<a href="https://colab.research.google.com/github/ekanshi258/eye-cluster-emotions/blob/master/K3Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here, we will classify images based on the clustered features obtained from K means clustering with K=3.

In [None]:
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier as ABC
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np

In [None]:
df = pd.read_csv('k3means_eye.csv')
x = df.drop(columns=['emotion', 'Unnamed: 0'])
x = x.to_numpy()
y = np.array(df['emotion'])

In [None]:
# Train Test Split:
xtrain, xtest, ytrain,ytest = train_test_split(x,y,random_state = 42, stratify = y)
xtest.shape

(82, 3)

The dataset is divided into training and testing sets. The above result shows that the test set contains 82 samples, so the training set will contain 245.

**Ada Boost Classifier:**  
Base Classifier: Decision Tree

In [None]:
# Grid Search:
from sklearn.pipeline import Pipeline
pipe = Pipeline([('clf',ABC())])
params = {'clf__n_estimators':[20, 30, 50], 'clf__learning_rate':[0.125, 0.25, 0.5, 0.75]}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__learning_rate': 0.125, 'clf__n_estimators': 30}


Using the above results as parameters, we will train and test as follows: 

In [None]:
clf = ABC( n_estimators= 30,learning_rate=0.125)
clf.fit(xtrain, ytrain)
pred = clf.predict(xtest)
clf.score(xtest,ytest)

0.47560975609756095

Thus, AdaBoost Classifier gives the best score of `47.56%`, which is better than that with k=2 clustering, but not significantly.

Let's maintain a dataframe with results so that we can add them in the results file as done for K=2.

In [None]:
result_df = pd.DataFrame(columns=['classifier','k3score'])
result_df = result_df.append(pd.DataFrame({
    'classifier':['ABC'],
    'k3score':[47.56]
}), ignore_index=True)
result_df

Unnamed: 0,classifier,k3score
0,ABC,47.56




---


**Decision Tree Classifier:**

In [None]:
from sklearn.tree import DecisionTreeClassifier as DTC
pipe = Pipeline([('clf',DTC())])
params = {'clf__criterion':['gini', 'entropy']}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__criterion': 'gini'}


Using `gini` as the split criterion:

In [None]:
clf = DTC()
clf.fit(xtrain, ytrain)
clf.score(xtest,ytest)

0.6829268292682927

Best Score returned by DTc is `68.29%`, again only slightly better than the k=2 clustering

In [None]:
result_df = result_df.append(pd.DataFrame({
    'classifier':['DTC'],
    'k3score':[68.29]
}), ignore_index=True)
result_df



---


**Gradient Boost Classifier**  
uses Regression Trees  
using `max_features` = `'sqrt'` that means, the number of features to consider when looking for the best split will be `sqrt(n_features)`.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier as GBC
pipe = Pipeline([('clf',GBC())])
params = {'clf__learning_rate':[0.125, 0.25, 0.5, 0.75], 'clf__n_estimators':[20, 30, 50]}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__learning_rate': 0.125, 'clf__n_estimators': 20}


Using the values of the parameters as returned b gridsearch above, we will go ahead and use these parameter values to check the classification score:

In [None]:
clf = GBC(learning_rate=0.125, n_estimators= 20, max_features='sqrt')
clf.fit(xtrain, ytrain)
clf.score(xtest,ytest)

0.6219512195121951

GBC performed with an accuracy of about `62.20%`, which is the same as that obtained with the 2-means clustering.

In [None]:
result_df = result_df.append(pd.DataFrame({
    'classifier':['GBC'],
    'k3score':[62.20]
}), ignore_index=True)
result_df

Unnamed: 0,classifier,k3score
0,ABC,47.56
1,DTC,68.29
2,GBC,62.2




---

**K-Nearest Neighbors Classification**  
Distance Measure: Euclidean Distance (L2 Norm)

In [None]:
from sklearn.neighbors import KNeighborsClassifier as KNC
pipe = Pipeline([('clf',KNC())])
params = {'clf__n_neighbors':[5,7,10,15,20]}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__n_neighbors': 5}


In [None]:
clf = KNC(n_neighbors = 5)
clf.fit(xtrain, ytrain)
clf.score(xtest,ytest)

0.6341463414634146

KNN gives a performance of about `63.41%`, again, slightly better than that the 2-means clustered features gave us. Saving this result in our dataframe:

In [None]:
result_df = result_df.append(pd.DataFrame({
    'classifier':['KNN'],
    'k3score':[63.41]
}), ignore_index=True)
result_df



---

**Multinomial Naive Bayes**

In [None]:
from sklearn.naive_bayes import MultinomialNB as MNB
pipe = Pipeline([('clf',MNB())])
params = {'clf__alpha':[0.05,0.1,0.5,1,3]}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__alpha': 0.05}


GridSearch results in optimal Laplacian Smoothing factor alpha value as `0.05`

In [None]:
clf = MNB(alpha=0.05)
clf.fit(xtrain, ytrain)
clf.score(xtest,ytest)

0.25609756097560976

MNB Performs very unsatifactorily even with 3-means clustered features and an accuracy score equal to that of the 2-clustered features, with the score lying just above the Baseline Score, which is 25.38%. However, we will keep a note of this result too.

In [None]:
result_df = result_df.append(pd.DataFrame({
    'classifier':['MNB'],
    'k3score':[25.61]
}), ignore_index=True)
result_df



---

**Random Forest Classifier**  
Max Features: `sqrt` (default)  
split: `gini` (default)

In [None]:
from sklearn.ensemble import RandomForestClassifier as RFC
pipe = Pipeline([('clf',RFC())])
params = {'clf__n_estimators':[10,20,30,50]}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__n_estimators': 10}


In [None]:
rfc = RFC(n_estimators=10)
rfc.fit(xtrain, ytrain)
rfc.score(xtest,ytest)

0.6707317073170732

RFC has given a performance score of `67.07%` which is the same as that of the 2-means clustering, but not the best so far in this case, as DTC has performed better. 

In [None]:
result_df = result_df.append(pd.DataFrame({
    'classifier':['RFC'],
    'k3score':[67.07]
}), ignore_index=True)
result_df

Let's try one last classifier:


---


**Support Vector Classification**  
using Radial Basis Function (RBF) Kernel

In [None]:
from sklearn.svm import SVC
pipe = Pipeline([('clf',SVC())])
params = {'clf__gamma':[0.5,1,2,3], 'clf__C':[1,2,3]}
gs = GridSearchCV(pipe, param_grid = params, cv = 5)
gs.fit(xtrain,ytrain)

print(gs.best_params_)

{'clf__C': 3, 'clf__gamma': 3}


In [None]:
clf = SVC(gamma = 3, C = 3)
clf.fit(xtrain, ytrain)
clf.score(xtest,ytest)

0.5609756097560976

SVM has performed better than with 2-means clustered features, however not as well as the other classifiers with 3-means clustered features, giving a score of about `56.10%`.  

In [None]:
result_df = result_df.append(pd.DataFrame({
    'classifier':['SVM'],
    'k3score':[56.10]
}), ignore_index=True)
result_df

Unnamed: 0,classifier,k3score
0,ABC,47.56
1,DTC,68.29
2,GBC,62.2
3,KNN,63.41
4,MNB,25.61
5,RFC,67.07
6,SVM,56.1


Let us save these results in the CSV file containing the results of the 2-means clustering. 

In [None]:
results = pd.read_csv('results_cluster.csv')
results = results.join(result_df['k3score'])
results = results.drop(columns=['Unnamed: 0'])
results

Unnamed: 0,classifier,k2score,k3score
0,ABC,43.9,47.56
1,DTC,65.85,68.29
2,GBC,62.2,62.2
3,KNN,60.98,63.41
4,MNB,25.61,25.61
5,RFC,67.07,67.07
6,SVC,41.46,56.1


In [None]:
results.to_csv('results_cluster.csv')

Now we can run some comparisions on the results.