<a href="https://colab.research.google.com/github/ekanshi258/eye-cluster-emotions/blob/master/K2Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here, we will be training and testing several Classifiers on the clustered features resulting from the KMeans clustering process, when K=2.

In [None]:
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier as ABC
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np

In [None]:
df = pd.read_csv('k2means_eye.csv')
x = df.drop(columns=['emotion', 'Unnamed: 0'])
x = x.to_numpy()
y = np.array(df['emotion'])

In [None]:
# Train Test Split:
xtrain, xtest, ytrain,ytest = train_test_split(x,y,random_state = 42, stratify = y)


Below snippet shows that there are 82 samples in the Test set, which means 245 samples are in the training set.

In [None]:
xtest.shape

(82, 2)

**Ada Boost Classifier:**  
Base Classifier: Decision Tree



In [None]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([('clf',ABC())])
params = {'clf__n_estimators':[20, 30, 50], 'clf__learning_rate':[0.125, 0.25, 0.5, 0.75]}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__learning_rate': 0.125, 'clf__n_estimators': 30}


The above Grid Search on parameters had given us:  
estimators = 30  
learning rate = 0.125  

In [None]:
clf = ABC( n_estimators= 30,learning_rate=0.125)
clf.fit(xtrain, ytrain)
pred = clf.predict(xtest)
clf.score(xtest,ytest)

0.43902439024390244

Thus, AdaBoost Classifier gives the best score of `43.90%` only.

Let's maintain a dataframe with results so that we can save them in a file.

In [None]:
result_df = pd.DataFrame(columns=['classifier','k2score'])
result_df = result_df.append(pd.DataFrame({
    'classifier':['ABC'],
    'k2score':[43.90]
}), ignore_index=True)



---


**Decision Tree Classifier:**


In [None]:
from sklearn.tree import DecisionTreeClassifier as DTC
pipe = Pipeline([('clf',DTC())])
params = {'clf__criterion':['gini', 'entropy']}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__criterion': 'gini'}


Best Split Criterion returned by the Grid Search: `gini`, which is incidentally, also the default criterion employed by `sklearn`

In [None]:
clf = DTC()
clf.fit(xtrain, ytrain)
clf.score(xtest,ytest)

0.6585365853658537

Best Score returned by DTc is `65.85%`. Making a note of it:

In [None]:
result_df = result_df.append(pd.DataFrame({
    'classifier':['DTC'],
    'k2score':[65.85]
}), ignore_index=True)
result_df



---


**Gradient Boost Classifier**  
uses Regression Trees  
using `max_features` = `'sqrt'` that means, the number of features to consider when looking for the best split will be `sqrt(n_features)`.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier as GBC
pipe = Pipeline([('clf',GBC())])
params = {'clf__learning_rate':[0.125, 0.25, 0.5, 0.75], 'clf__n_estimators':[20, 30, 50]}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__learning_rate': 0.5, 'clf__n_estimators': 50}


Since Gridsearch returned 0.5 as the optimal learning rate paired with 50 as the count of estimators, we will go ahead and use these parameter values to check the classification score:

In [None]:
clf = GBC(learning_rate=0.5, n_estimators= 50, max_features='sqrt')
clf.fit(xtrain, ytrain)
clf.score(xtest,ytest)

0.6219512195121951

GBC performed with an accuracy of about 62.20%. Saving result:

In [None]:
result_df = result_df.append(pd.DataFrame({
    'classifier':['GBC'],
    'k2score':[62.20]
}), ignore_index=True)
result_df



---

**K-Nearest Neighbors Classification**  
Distance Measure: Euclidean Distance (L2 Norm)

In [None]:
from sklearn.neighbors import KNeighborsClassifier as KNC
pipe = Pipeline([('clf',KNC())])
params = {'clf__n_neighbors':[5,7,10,15,20]}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__n_neighbors': 5}


Gridsearch return 5 as the optimal number of neighbors to consider. 

In [None]:
clf = KNC(n_neighbors = 5)
clf.fit(xtrain, ytrain)
clf.score(xtest,ytest)

0.6097560975609756

KNN gives a performance of about `60.98%`. Saving this result:

In [None]:
result_df = result_df.append(pd.DataFrame({
    'classifier':['KNN'],
    'k2score':[60.98]
}), ignore_index=True)
result_df



---

**Multinomial Naive Bayes**

In [None]:
from sklearn.naive_bayes import MultinomialNB as MNB
pipe = Pipeline([('clf',MNB())])
params = {'clf__alpha':[0.05,0.1,0.5,1,3]}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__alpha': 0.05}


GridSearch results in optimal Laplacian Smoothing factor alpha value as `0.05`

In [None]:
clf = MNB(alpha=0.05)
clf.fit(xtrain, ytrain)
clf.score(xtest,ytest)

0.25609756097560976

MNB Performs very unsatifactorily, with the score lying just above the Baseline Score, which is 25.38%. However, we will keep a note of this result too.

In [None]:
result_df = result_df.append(pd.DataFrame({
    'classifier':['MNB'],
    'k2score':[25.61]
}), ignore_index=True)
result_df



---

**Random Forest Classifier**  
Max Features: `sqrt` (default)  
split: `gini` (default)

In [None]:
from sklearn.ensemble import RandomForestClassifier as RFC
pipe = Pipeline([('clf',RFC())])
params = {'clf__n_estimators':[10,20,30,50]}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__n_estimators': 10}


Using 10 estimators:

In [None]:
clf = RFC(n_estimators=10)
clf.fit(xtrain, ytrain)
clf.score(xtest,ytest)

0.6707317073170732

RFC has given a performance score of `67.07%` which is the best so far.

In [None]:
result_df = result_df.append(pd.DataFrame({
    'classifier':['RFC'],
    'k2score':[67.07]
}), ignore_index=True)
result_df

Let's try one last classifier:


---


**Support Vector Classification**  
using Radial Basis Function (RBF) Kernel

In [None]:
from sklearn.svm import SVC
pipe = Pipeline([('clf',SVC())])
params = {'clf__gamma':[0.5,1,2,3], 'clf__C':[1,2,3]}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__C': 1, 'clf__gamma': 0.5}


Gridsearch returned `C = 1` (default value in sklearn) and `gamma = 0.5`

In [None]:
clf = SVC(gamma = 0.5, C = 1)
clf.fit(xtrain, ytrain)
clf.score(xtest,ytest)

0.4146341463414634

SVM has not performed very satisfactorily, giving a score of `41.46%` only.  

In [None]:
result_df = result_df.append(pd.DataFrame({
    'classifier':['SVC'],
    'k2score':[41.46]
}), ignore_index=True)
result_df

Unnamed: 0,classifier,k2score
0,ABC,43.9
1,DTC,65.85
2,GBC,62.2
3,KNN,60.98
4,MNB,25.61
5,RFC,67.07
6,SVC,41.46


We will save these results in a CSV to be used later for comparision with results of K means clustering with K=3.

In [None]:
result_df.to_csv('results_cluster.csv')