<a href="https://colab.research.google.com/github/ZachPetroff/multiclass-classification/blob/main/multi_class_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd

In [None]:
train_df = pd.read_csv('../input/ghouls-goblins-and-ghosts-boo/train.csv.zip', compression='zip')
test_df = pd.read_csv('../input/ghouls-goblins-and-ghosts-boo/test.csv.zip', compression='zip')

train_df.head()

Unnamed: 0,id,bone_length,rotting_flesh,hair_length,has_soul,color,type
0,0,0.354512,0.350839,0.465761,0.781142,clear,Ghoul
1,1,0.57556,0.425868,0.531401,0.439899,green,Goblin
2,2,0.467875,0.35433,0.811616,0.791225,black,Ghoul
3,4,0.776652,0.508723,0.636766,0.884464,black,Ghoul
4,5,0.566117,0.875862,0.418594,0.636438,green,Ghost


In [None]:
cols = train_df.columns

for col in cols:
    print('Number Missing in ', col, ' column: ', sum(train_df[col].isnull()))

train_df.describe()

Number Missing in  id  column:  0
Number Missing in  bone_length  column:  0
Number Missing in  rotting_flesh  column:  0
Number Missing in  hair_length  column:  0
Number Missing in  has_soul  column:  0
Number Missing in  color  column:  0
Number Missing in  type  column:  0


Unnamed: 0,id,bone_length,rotting_flesh,hair_length,has_soul
count,371.0,371.0,371.0,371.0,371.0
mean,443.67655,0.43416,0.506848,0.529114,0.471392
std,263.222489,0.132833,0.146358,0.169902,0.176129
min,0.0,0.061032,0.095687,0.1346,0.009402
25%,205.5,0.340006,0.414812,0.407428,0.348002
50%,458.0,0.434891,0.501552,0.538642,0.466372
75%,678.5,0.517223,0.603977,0.647244,0.60061
max,897.0,0.817001,0.932466,1.0,0.935721


# Data Preprocessing
* Not much preprocessing to do. 
* The continuous values seem to already be between zero and one.
* There are no missing values, so no imputation or row deletion is necessary.
* However, the color and type columns need to be one-hot encoded.
* Because type is the target value, I am going to take a different approach. First I will use pandas to get the dummy values, then I am going to transfer this to a numpy array. This array will be used in training

In [None]:
color = pd.get_dummies(train_df['color'], prefix='color')

train_df = train_df.drop(['color'], axis=1)
train_df = pd.concat([train_df, color], axis=1)

In [None]:
target_df = pd.get_dummies(train_df['type'])
target = target_df.to_numpy()

train_df = train_df.drop(['type'], axis=1)
target_df.head()

Unnamed: 0,Ghost,Ghoul,Goblin
0,0,1,0
1,0,0,1
2,0,1,0
3,0,1,0
4,1,0,0


# Model Building
> Usually, I would use a random forest classifier to see if any unimportant columns can be dropped. However, there is very little data, so I am going to skip this step. Hopefully, because there is a low amount of data, many models can be tested in a very short time.

**Models:**
* SVC - My prediction for the best model, works well with low data
* Nearest Neighbors - Also is usually pretty good
* Random Forest
* If nothing else works, I will try a simple feed-forward NN.

In [None]:
x = train_df.to_numpy()

I did not realize sklearn does not like when the target is not a one dimensional array. Below, I use the original target array to create a one-d array.

In [None]:
y = []

for v in target:
    y.append(list(v).index(1))
    
y = np.array(y)

In [None]:
from sklearn import svm, neighbors
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

svc = svm.SVC()
nn = neighbors.KNeighborsClassifier(10)
forest = RandomForestClassifier(max_depth=2, random_state=0)

models = [svc, nn, forest]
names = ['SVC', 'Nearest Neighbors', 'Random Forest']

for model in range(len(models)):
    cv_results = cross_val_score(models[model], x, y, cv=2)
    print(names[model], ": ", cv_results.mean())

SVC :  0.3396396396396396
Nearest Neighbors :  0.34234234234234234
Random Forest :  0.7061319383900029


It seems Random Forest heavily outperforms the other two models. I will fine tune below.

In [None]:
n_ests = [50, 100, 250, 500]
depths = [1, 2, 3]
crits = ['gini', 'entropy']

for est in n_ests:
    for d in depths:
        for crit in crits:
            forest = RandomForestClassifier(max_depth=d, n_estimators=est, criterion=crit, random_state=0)
            cv_results = cross_val_score(forest, x, y, cv=3)
            print("Estimators: ", est, " Depth: ", d, " Criterion: ", crit, " Score: ", cv_results.mean())

Estimators:  50  Depth:  1  Criterion:  gini  Score:  0.6442215228603899
Estimators:  50  Depth:  1  Criterion:  entropy  Score:  0.6496415770609318
Estimators:  50  Depth:  2  Criterion:  gini  Score:  0.6657924643762566
Estimators:  50  Depth:  2  Criterion:  entropy  Score:  0.6576405280181834
Estimators:  50  Depth:  3  Criterion:  gini  Score:  0.6818996415770608
Estimators:  50  Depth:  3  Criterion:  entropy  Score:  0.6765232974910393
Estimators:  100  Depth:  1  Criterion:  gini  Score:  0.6577060931899642
Estimators:  100  Depth:  1  Criterion:  entropy  Score:  0.655061631261474
Estimators:  100  Depth:  2  Criterion:  gini  Score:  0.6765888626628201
Estimators:  100  Depth:  2  Criterion:  entropy  Score:  0.6685024914765276
Estimators:  100  Depth:  3  Criterion:  gini  Score:  0.6791896144767899
Estimators:  100  Depth:  3  Criterion:  entropy  Score:  0.6711469534050178
Estimators:  250  Depth:  1  Criterion:  gini  Score:  0.6469315499606609
Estimators:  250  Depth:  1

Because two of the best scores have a max depth of 3, I will expand the search for depth.

In [None]:
depths = [3, 4, 5, 6]

for d in depths:
    forest = RandomForestClassifier(max_depth=d, n_estimators=50, random_state=0)
    cv_results = cross_val_score(forest, x, y, cv=3)
    print("Depth: ", d, " Score: ", cv_results.mean())

Depth:  3  Score:  0.6818996415770608
Depth:  4  Score:  0.7061806101931988
Depth:  5  Score:  0.6952530815630737
Depth:  6  Score:  0.6790147740187079


These params seem to give us the best score:
* N_Estimators: 50
* Max Depth: 4
* Criterion: Gini

validation score = 70.62

This score is good, however, a neural network would likely do better. I will use pytorch to build and train a model.

In [None]:
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.optim as optim
import torch.utils.data
from torch.autograd import Variable

In [None]:
class Net(nn.Module):
    def __init__(self, ):
        super(Net, self).__init__()
        # basic feed forward NN architecture
        self.fc1 = nn.Linear(11, 20)
        self.fc2 = nn.Linear(20, 30)
        self.fc3 = nn.Linear(30, 20)
        self.fc4 = nn.Linear(20, 3)
        self.activation = nn.ReLU()
        self.output = nn.Softmax()
    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        x = self.activation(self.fc3(x))
        x = self.output(self.fc4(x))
        return x
ann = Net()
# loss fxn
crit = nn.MSELoss()
# hyperparameters
opt = optim.Adam(ann.parameters(), lr = 5e-4)

Split for training and testing

In [None]:
import random

x_train = []
x_test = []
y_train = []
y_test = []

for i in range(len(x)):
    if random.random() > .1:
        x_train.append(x[i])
        y_train.append(target[i])
    else:
        x_test.append(x[i])
        y_test.append(target[i])

x_train = np.array(x_train)
x_test = np.array(x_test)

train_x = torch.FloatTensor(x_train)
test_x = torch.FloatTensor(x_test)

train_y = torch.FloatTensor(y_train)
test_y = torch.FloatTensor(y_test)

In [None]:
nb_epoch = 200
for epoch in range(1, nb_epoch + 1):
  train_loss = 0
  s = 0.
  for index in range(len(train_x)):
    opt.zero_grad()
    
    # get input
    inp = Variable(train_x[index]).unsqueeze(0)

    # expected output is the same as input, so we clone input
    targ = train_y[index]
    
    # get output from nn
    output = ann(inp)

    # target will not be changed
    targ.require_grad = False
    
    # get loss (difference between output and original input)
    loss = crit(output, targ)
    
    # propagate loss backward in the network
    loss.backward()
    
    # update train loss
    train_loss += np.sqrt(loss.data)
    s += 1.
    opt.step()
  print('epoch: '+str(epoch)+' loss: '+ str(train_loss/s))

  from ipykernel import kernelapp as app
  return F.mse_loss(input, target, reduction=self.reduction)


epoch: 1 loss: tensor(0.4797)
epoch: 2 loss: tensor(0.4772)
epoch: 3 loss: tensor(0.4768)
epoch: 4 loss: tensor(0.4759)
epoch: 5 loss: tensor(0.4758)
epoch: 6 loss: tensor(0.4747)
epoch: 7 loss: tensor(0.4744)
epoch: 8 loss: tensor(0.4746)
epoch: 9 loss: tensor(0.4740)
epoch: 10 loss: tensor(0.4735)
epoch: 11 loss: tensor(0.4731)
epoch: 12 loss: tensor(0.4730)
epoch: 13 loss: tensor(0.4725)
epoch: 14 loss: tensor(0.4725)
epoch: 15 loss: tensor(0.4713)
epoch: 16 loss: tensor(0.4710)
epoch: 17 loss: tensor(0.4712)
epoch: 18 loss: tensor(0.4703)
epoch: 19 loss: tensor(0.4702)
epoch: 20 loss: tensor(0.4699)
epoch: 21 loss: tensor(0.4698)
epoch: 22 loss: tensor(0.4697)
epoch: 23 loss: tensor(0.4693)
epoch: 24 loss: tensor(0.4697)
epoch: 25 loss: tensor(0.4689)
epoch: 26 loss: tensor(0.4691)
epoch: 27 loss: tensor(0.4683)
epoch: 28 loss: tensor(0.4686)
epoch: 29 loss: tensor(0.4676)
epoch: 30 loss: tensor(0.4674)
epoch: 31 loss: tensor(0.4671)
epoch: 32 loss: tensor(0.4666)
epoch: 33 loss: t

In [None]:
def score(data, targ):
  acc = 0
  for i in range(len(data)):
    with torch.no_grad():
      input = Variable(data[i]).unsqueeze(0)
      target = targ[i]
    
      # get output from nn
      output = ann(input)

      if list(target).index(1) == list(output[0]).index(max(list(output[0]))):
        acc += 1

  return(acc/len(data))

print(score(test_x, test_y))

0.4864864864864865


  from ipykernel import kernelapp as app


Pretty good score, better than anything else. I attempt to build a better model below, without deleting the original model

In [None]:
class Net(nn.Module):
    def __init__(self, ):
        super(Net, self).__init__()
        # basic feed forward NN architecture
        self.fc1 = nn.Linear(11, 12)
        self.fc2 = nn.Linear(12, 24)
        self.fc3 = nn.Linear(24, 12)
        self.fc4 = nn.Linear(12, 3)
        self.activation = nn.ReLU()
        self.output = nn.Softmax()
    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        x = self.activation(self.fc3(x))
        x = self.output(self.fc4(x))
        return x
ann2 = Net()
# loss fxn
crit = nn.MSELoss()
# hyperparameters
opt = optim.Adam(ann2.parameters(), lr = 1e-3)

In [None]:
nb_epoch = 300
for epoch in range(1, nb_epoch + 1):
  train_loss = 0
  s = 0.
  for index in range(len(train_x)):
    opt.zero_grad()
    
    # get input
    inp = Variable(train_x[index]).unsqueeze(0)

    # expected output is the same as input, so we clone input
    targ = train_y[index]
    
    # get output from nn
    output = ann2(inp)

    # target will not be changed
    targ.require_grad = False
    
    # get loss (difference between output and original input)
    loss = crit(output, targ)
    
    # propagate loss backward in the network
    loss.backward()
    
    # update train loss
    train_loss += np.sqrt(loss.data)
    s += 1.
    opt.step()
  print('epoch: '+str(epoch)+' loss: '+ str(train_loss/s))

  from ipykernel import kernelapp as app


epoch: 1 loss: tensor(0.4781)
epoch: 2 loss: tensor(0.4759)
epoch: 3 loss: tensor(0.4748)
epoch: 4 loss: tensor(0.4738)
epoch: 5 loss: tensor(0.4731)
epoch: 6 loss: tensor(0.4733)
epoch: 7 loss: tensor(0.4720)
epoch: 8 loss: tensor(0.4721)
epoch: 9 loss: tensor(0.4723)
epoch: 10 loss: tensor(0.4712)
epoch: 11 loss: tensor(0.4711)
epoch: 12 loss: tensor(0.4710)
epoch: 13 loss: tensor(0.4709)
epoch: 14 loss: tensor(0.4715)
epoch: 15 loss: tensor(0.4731)
epoch: 16 loss: tensor(0.4710)
epoch: 17 loss: tensor(0.4712)
epoch: 18 loss: tensor(0.4709)
epoch: 19 loss: tensor(0.4708)
epoch: 20 loss: tensor(0.4707)
epoch: 21 loss: tensor(0.4704)
epoch: 22 loss: tensor(0.4703)
epoch: 23 loss: tensor(0.4704)
epoch: 24 loss: tensor(0.4702)
epoch: 25 loss: tensor(0.4700)
epoch: 26 loss: tensor(0.4700)
epoch: 27 loss: tensor(0.4700)
epoch: 28 loss: tensor(0.4698)
epoch: 29 loss: tensor(0.4699)
epoch: 30 loss: tensor(0.4696)
epoch: 31 loss: tensor(0.4696)
epoch: 32 loss: tensor(0.4694)
epoch: 33 loss: t

In [None]:
def score(data, targ):
  acc = 0
  for i in range(len(data)):
    with torch.no_grad():
      input = Variable(data[i]).unsqueeze(0)
      target = targ[i]
    
      # get output from nn
      output = ann2(input)

      if list(target).index(1) == list(output[0]).index(max(list(output[0]))):
        acc += 1

  return(acc/len(data))

print(score(test_x, test_y))

0.21621621621621623


  from ipykernel import kernelapp as app


I wasn't able to create a better model, but at this point, I have three decent models. It might be a good idea to combine the two models. I test this theory below.

In [None]:
forest = RandomForestClassifier(max_depth=4, n_estimators=50, random_state=0)
forest = forest.fit(x, y)

def score(data, targ):
  acc = 0
  for i in range(len(data)):
    with torch.no_grad():
      inp = Variable(data[i]).unsqueeze(0)
      target = targ[i]
    
      # get output from nn
      output1 = ann(inp)
      output2 = ann2(inp)
      output3 = forest.predict(inp[0].numpy().reshape(1, -1))
    
      output1 = list(output1[0]).index(max(list(output1[0])))
      output2 = list(output2[0]).index(max(list(output2[0])))
      output3 = output3[0]
      print(output3)

      if output1 == output2:
        pred = output1
      if output1 == output3:
        pred = output1
      if output2 == output3:
        pred = output2
      
      if pred == list(target).index(1):
        acc += 1

  return(acc/len(data))

print(score(test_x, test_y))

  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel

1
1
2
0
0
2
0
2
0
1
2
0
1
0
1
1
0
2
0
1
2
0
2
1
1
2
0
1
1
1
2
0
1
2
1
2
1
0.5675675675675675


  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app


This ensemble of classifiers seems to perform better than any other model. I will use this for the test set. 

In [None]:
color = pd.get_dummies(test_df['color'], prefix='color')

test_df = test_df.drop(['color'], axis=1)
test_df = pd.concat([test_df, color], axis=1)

In [None]:
test = test_df.to_numpy()
test_tens = torch.FloatTensor(test)

In [None]:
def predict(data):
  preds = []
  for i in range(len(data)):
    with torch.no_grad():
      inp = Variable(data[i]).unsqueeze(0)
    
      # get output from nn
      output1 = ann(inp)
      output2 = ann2(inp)
      output3 = forest.predict(inp[0].numpy().reshape(1, -1))
    
      output1 = list(output1[0]).index(max(list(output1[0])))
      output2 = list(output2[0]).index(max(list(output2[0])))
      output3 = output3[0]

      if output1 == output2:
        pred = output1
      if output1 == output3:
        pred = output1
      if output2 == output3:
        pred = output2
        
      if output3 == 0:
        pred = 'Ghost'
      if output3 == 1:
        pred = 'Ghoul'
      if output3 == 2:
        pred = 'Goblin'
      preds.append(pred)

  return preds

preds = predict(test_tens)

  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app
  from ipykernel

In [None]:
sample = pd.read_csv('../input/ghouls-goblins-and-ghosts-boo/sample_submission.csv.zip', compression='zip')
pd.DataFrame({'id': sample['id'], 'type': preds}).to_csv('submission.csv', index=False)

In [None]:
sample.head(-10)

Unnamed: 0,id,type
0,3,Ghost
1,6,Ghost
2,9,Ghost
3,10,Ghost
4,13,Ghost
...,...,...
514,880,Ghost
515,881,Ghost
516,882,Ghost
517,883,Ghost
