# Facial Recognition

It's generally expected that neural networks are the way to go for image data. But how well do some other classification techniques work? Here I'll try some tree based methods, namely adaboost and random forest. I'll also do logistic regression. These will supply a baseline to beat for the neural networks later on.

In [1]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pickle

In [2]:
traindata = pd.DataFrame(pickle.load(open('data\TrainData.p','rb')))
traindata = traindata.sample(n = len(traindata))
testdata = pd.DataFrame(pickle.load(open('data\TestData.p','rb')))

In [3]:
traindata.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2295,2296,2297,2298,2299,2300,2301,2302,2303,2304
3194,0.356863,0.415686,0.47451,0.537255,0.592157,0.643137,0.690196,0.72549,0.752941,0.772549,...,0.098039,0.231373,0.180392,0.12549,0.12549,0.227451,0.498039,0.713725,0.815686,3.0
15941,0.980392,0.980392,0.980392,0.980392,0.980392,0.980392,0.968627,1.0,0.741176,0.619608,...,0.027451,0.019608,0.015686,0.007843,0.011765,0.007843,0.035294,0.062745,0.054902,0.0
8084,0.862745,0.756863,0.819608,0.772549,0.721569,0.768627,0.788235,0.776471,0.752941,0.552941,...,0.894118,0.945098,0.952941,0.92549,0.901961,0.956863,0.976471,0.980392,0.988235,5.0
8335,0.313725,0.333333,0.145098,0.031373,0.019608,0.094118,0.184314,0.184314,0.164706,0.145098,...,0.713725,0.729412,0.752941,0.776471,0.768627,0.768627,0.768627,0.776471,0.764706,3.0
24243,1.0,0.996078,1.0,0.807843,0.313725,0.176471,0.12549,0.117647,0.152941,0.145098,...,0.74902,0.733333,0.854902,0.984314,0.992157,0.960784,0.94902,0.984314,0.996078,2.0


In [4]:
print(traindata.shape)
print(testdata.shape)

(31761, 2305)
(7178, 2305)


In [5]:
Xtrain, ytrain = traindata.iloc[:,0:2304], traindata.iloc[:,2304]
Xtest, ytest = testdata.iloc[:,0:2304], testdata.iloc[:,2304]

Although I could do some tuning here on hyperparameters for the adaboost classifier I just want to get an idea of how well these types of models work. Sklearn makes it easy to plug and play.

In [13]:
adaboost_mod = AdaBoostClassifier(n_estimators=500)
adaboost_mod.fit(Xtrain,ytrain)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=500, random_state=None)

In [14]:
preds = adaboost_mod.predict(Xtest)

In [15]:
print(classification_report(y_true = ytest, y_pred = preds))
print('\n')
print('accuracy', np.round(np.mean(preds == ytest),2))
print('\n')
print('confusion matrix')
print(confusion_matrix(y_true = ytest, y_pred = preds))

             precision    recall  f1-score   support

        0.0       0.23      0.13      0.17       958
        1.0       0.21      0.27      0.23       111
        2.0       0.20      0.09      0.12      1024
        3.0       0.42      0.61      0.50      1774
        4.0       0.29      0.22      0.25      1247
        5.0       0.39      0.50      0.44       831
        6.0       0.31      0.36      0.33      1233

avg / total       0.32      0.34      0.32      7178



accuracy 0.34


confusion matrix
[[ 125   21   71  313  144  104  180]
 [   7   30   10   24   18    5   17]
 [  86   16   92  315  146  182  187]
 [  86   28   56 1088  165  132  219]
 [ 119   19   84  371  277  114  263]
 [  31   13   70  145   49  417  106]
 [  83   18   71  351  159  111  440]]


34% accuracy. Certainly it learned something, but we can probably do better. How about a random forest? Again I'll just go with general guidelines here. Random forests are nice because they require little optimization to work well. 

In [16]:
random_forest_mod = RandomForestClassifier(n_estimators = 500,max_features = 48)
random_forest_mod.fit(Xtrain,ytrain)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=48, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [17]:
preds = random_forest_mod.predict(Xtest)

In [18]:
print(classification_report(y_true = ytest, y_pred = preds))
print('\n')
print('accuracy', np.round(np.mean(preds == ytest),2))
print('\n')
print('confusion matrix')
print(confusion_matrix(y_true = ytest, y_pred = preds))

             precision    recall  f1-score   support

        0.0       0.60      0.19      0.29       958
        1.0       1.00      0.35      0.52       111
        2.0       0.57      0.27      0.37      1024
        3.0       0.44      0.82      0.57      1774
        4.0       0.40      0.39      0.39      1247
        5.0       0.71      0.62      0.66       831
        6.0       0.44      0.41      0.43      1233

avg / total       0.51      0.48      0.46      7178



accuracy 0.48


confusion matrix
[[ 184    0   46  399  158   40  131]
 [   4   39    4   45    7    2   10]
 [  38    0  276  331  174   81  124]
 [  23    0   25 1447  141   43   95]
 [  29    0   59  451  482   15  211]
 [  10    0   42  144   51  514   70]
 [  18    0   31  463  184   30  507]]


That is actually pretty good. Much better than I would've expected given the adaboost results. Hpw about logistic regression? Since the dataset is large I'll just select 2000 samples and tune the regularization penalty on that. Note that I already shuffled the data when I imported it, so I can just take the first 2000 samples here.

In [6]:
logistic_reg_mod = LogisticRegression()

In [7]:
param_grid = [
  {'C': [.1,1,10,100]}
 ]

In [9]:
lr_tune = GridSearchCV(logistic_reg_mod,param_grid,cv=5,scoring='f1_weighted',)

In [10]:
lr_tune.fit(Xtrain[:2000],ytrain[:2000])

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'C': [0.1, 1, 10, 100]}], pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring='f1_weighted',
       verbose=0)

In [11]:
lr_tune.grid_scores_



[mean: 0.31588, std: 0.01305, params: {'C': 0.1},
 mean: 0.30439, std: 0.02534, params: {'C': 1},
 mean: 0.28887, std: 0.02149, params: {'C': 10},
 mean: 0.27940, std: 0.01824, params: {'C': 100}]

In [12]:
logistic_reg_mod = LogisticRegression(C = .1)
logistic_reg_mod.fit(Xtrain,ytrain)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [13]:
preds = logistic_reg_mod.predict(Xtest)

In [14]:
print(classification_report(y_true = ytest, y_pred = preds))
print('\n')
print('accuracy', np.round(np.mean(preds == ytest),2))
print('\n')
print('confusion matrix')
print(confusion_matrix(y_true = ytest, y_pred = preds))

             precision    recall  f1-score   support

        0.0       0.27      0.15      0.19       958
        1.0       0.09      0.42      0.15       111
        2.0       0.27      0.13      0.18      1024
        3.0       0.45      0.67      0.54      1774
        4.0       0.30      0.26      0.28      1247
        5.0       0.53      0.49      0.51       831
        6.0       0.34      0.32      0.33      1233

avg / total       0.36      0.37      0.35      7178



accuracy 0.37


confusion matrix
[[ 140   66   81  323  163   62  123]
 [   4   47    3   32   10    4   11]
 [  73   96  136  270  161  117  171]
 [  89   84   54 1196  154   54  143]
 [ 102   95   93  344  328   56  229]
 [  40   36   57  142   81  406   69]
 [  71   77   85  345  195   68  392]]


Not bad. Surely we should be able to beat this with a neural network!