# Face Image Classification



In [173]:
from os import listdir
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from random import randint
from PIL import Image
from sklearn import svm
from sklearn.linear_model import LogisticRegression

## Data Modeling

To model the data, we will represent each image as an array mapping a greyscale value to each value. For this, we open each image and get the pixel greyscale values. We add this array and it's category to either the training set or the test set. The training set is the data we use to train our model, whereas the test set is the data we use to evaluate how well our model performs. We randomly decide which of the two sets each image is assigned to. This is the easiest way to ensure there is no bias regarding the training and test data, it's one of the easiest methodes to seperate the two sets while keeping the same proportion between categories, and we have enough data to get a functioning model using this method.

In [174]:
from sklearn.model_selection import train_test_split
test_size = 0.33
random_state = 0

In [175]:
# Convert images to vectors and store in x, y
X, y = [], []
for sample in listdir("cropped"):
    for pose in listdir("cropped/{}/face".format(sample)):
        X.append(np.array(Image.open("cropped/{}/face/{}".format(sample, pose))).flatten())
        y.append(sample)
X = np.array(X, dtype=int)
y = np.array(y, dtype=str)

# Build Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size=test_size, random_state = random_state)

# Verify that the data has been stratified correctly
count_unique_labels_all = dict(zip(*np.unique(y, return_counts=True)))
count_unique_labels_test = dict(zip(*np.unique(y_test, return_counts=True)))
label_percentages = {k:[count_unique_labels_all[k]/len(y)*100, count_unique_labels_test[k]/len(y_test)*100] for k in count_unique_labels_all}
print("Label  | % in all data | % in test data")
print("-------|---------------|---------------")
for k in label_percentages:
    print("  {}   |     {:.2f}%     |     {:.2f}%".format(k, label_percentages[k][0], label_percentages[k][1]))


Label  | % in all data | % in test data
-------|---------------|---------------
  1a   |     6.61%     |     6.32%
  1b   |     6.09%     |     5.79%
  1c   |     4.52%     |     4.74%
  1d   |     4.17%     |     4.21%
  1e   |     4.52%     |     4.74%
  1f   |     4.00%     |     4.21%
  1g   |     3.30%     |     3.16%
  1h   |     3.83%     |     3.68%
  1i   |     3.48%     |     3.68%
  1j   |     5.57%     |     5.26%
  1k   |     5.91%     |     5.79%
  1l   |     5.91%     |     5.79%
  1m   |     4.52%     |     4.74%
  1n   |     5.22%     |     5.26%
  1o   |     3.30%     |     3.16%
  1p   |     4.52%     |     4.74%
  1q   |     4.52%     |     4.74%
  1r   |     5.74%     |     5.79%
  1s   |     8.35%     |     8.42%
  1t   |     5.91%     |     5.79%


In [176]:
print(X_test.shape, y_test.shape, X_train.shape, y_train.shape)

(190, 10304) (190,) (385, 10304) (385,)


In [184]:
print(y_train)

['1f' '1b' '1p' '1b' '1n' '1s' '1s' '1s' '1q' '1s' '1s' '1s' '1q' '1s'
 '1r' '1a' '1o' '1i' '1d' '1l' '1i' '1a' '1n' '1j' '1n' '1k' '1c' '1h'
 '1p' '1f' '1p' '1l' '1l' '1c' '1a' '1t' '1j' '1t' '1s' '1r' '1k' '1b'
 '1t' '1m' '1m' '1f' '1k' '1r' '1q' '1o' '1n' '1e' '1e' '1h' '1t' '1s'
 '1s' '1f' '1r' '1k' '1c' '1m' '1e' '1h' '1a' '1l' '1i' '1c' '1p' '1k'
 '1r' '1k' '1b' '1d' '1a' '1s' '1b' '1b' '1l' '1r' '1a' '1l' '1n' '1e'
 '1a' '1r' '1q' '1r' '1t' '1l' '1c' '1q' '1m' '1h' '1h' '1c' '1n' '1s'
 '1a' '1d' '1t' '1n' '1r' '1j' '1m' '1q' '1k' '1d' '1o' '1a' '1t' '1l'
 '1s' '1h' '1b' '1k' '1h' '1g' '1h' '1e' '1b' '1a' '1a' '1a' '1f' '1m'
 '1n' '1o' '1s' '1i' '1b' '1e' '1e' '1f' '1c' '1b' '1l' '1b' '1r' '1m'
 '1j' '1k' '1t' '1a' '1m' '1l' '1s' '1i' '1t' '1j' '1c' '1d' '1b' '1m'
 '1o' '1p' '1n' '1p' '1o' '1s' '1c' '1p' '1g' '1e' '1s' '1d' '1t' '1q'
 '1q' '1q' '1t' '1l' '1a' '1b' '1q' '1n' '1i' '1l' '1s' '1t' '1h' '1s'
 '1n' '1t' '1p' '1t' '1s' '1r' '1l' '1j' '1i' '1k' '1l' '1h' '1q' '1a'
 '1c' 

## Logistic Regression

Logistic Regression is a way of classifying data using the sigmoid function 
$$g(z) = \frac{1}{1+e^{-z}}$$

In [185]:
log_reg = LogisticRegression(multi_class="ovr")
log_reg.fit(X_train, y_train)

LogisticRegression(multi_class='ovr')

In [186]:
print(log_reg.score(X_test, y_test))

0.9789473684210527


## Support Vector Machines

In [187]:
clf = svm.SVC(kernel='linear').fit(X_train, y_train)

In [188]:
print(clf.score(X_test, y_test))

0.9842105263157894


## Performance Evaluation

In [189]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix

In [190]:
y_pred = log_reg.predict(X_test)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = 2*precision*recall/(precision + recall)
print(precision)
print(recall)
print(f1)

print(confusion_matrix(y_test, y_pred))

0.9817424242424242
0.9753571428571428
0.9785393671614059
[[11  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0]
 [ 0 11  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  6  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  1  0  0  6  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  7  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  9  1  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0 11  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0 11  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  9  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0 10  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  5  1  0  0  0  0