# Classifying Humans



## Data Modeling

To model the data, we will represent each image as an array mapping a greyscale value to each value. For this, we open each image and get the pixel greyscale values. We add this array and it's category to either the training set or the test set. The training set is the data we use to train our model, whereas the test set is the data we use to evaluate how well our model performs. We randomly decide which of the two sets each image is assigned to. This is the easiest way to ensure there is no bias regarding the training and test data, it's one of the easiest methodes to seperate the two sets while keeping the same proportion between categories, and we have enough data to get a functioning model using this method.

In [59]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from PIL import Image
from os import listdir
from random import randint

In [60]:
X = []
y = []
X_train = []
y_train = []
X_test = []
y_test = []
test_percentage = 5
#np.append(images, im_ar)

samples = listdir("cropped")
for sample in samples:
    names = listdir("cropped/" + sample + "/face")
    for name in names:
        image = Image.open("cropped/" + sample + "/face/" + name)
        X += [np.array(image).flatten()]
        y += [sample[-1]]
        if randint(0, test_percentage) < test_percentage:
            X_train += [np.array(image).flatten()]
            y_train += [sample[-1]]
        else:
            X_test += [np.array(image).flatten()]
            y_test += [sample[-1]]
        
        

y = np.array(y)
X = np.array(X)
y_train = np.array(y_train)
X_train = np.array(X_train)
y_test = np.array(y_test)
X_test = np.array(X_test)
dataset = {"data" : X, "target" : y}

In [61]:
print(X_test.shape, y_test.shape, X_train.shape, y_train.shape)

(97, 10304) (97,) (478, 10304) (478,)


In [62]:
print(y_train)

['a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a'
 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b'
 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b' 'b'
 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c' 'c' 'c' 'c' 'c' 'c' 'c' 'c' 'c' 'c' 'c'
 'c' 'c' 'c' 'c' 'c' 'c' 'c' 'c' 'c' 'd' 'd' 'd' 'd' 'd' 'd' 'd' 'd' 'd'
 'd' 'd' 'd' 'd' 'd' 'd' 'd' 'd' 'd' 'd' 'd' 'd' 'e' 'e' 'e' 'e' 'e' 'e'
 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'e' 'f' 'f' 'f'
 'f' 'f' 'f' 'f' 'f' 'f' 'f' 'f' 'f' 'f' 'f' 'f' 'f' 'f' 'f' 'g' 'g' 'g'
 'g' 'g' 'g' 'g' 'g' 'g' 'g' 'g' 'g' 'g' 'g' 'g' 'h' 'h' 'h' 'h' 'h' 'h'
 'h' 'h' 'h' 'h' 'h' 'h' 'h' 'h' 'h' 'h' 'h' 'h' 'h' 'i' 'i' 'i' 'i' 'i'
 'i' 'i' 'i' 'i' 'i' 'i' 'i' 'i' 'i' 'i' 'j' 'j' 'j' 'j' 'j' 'j' 'j' 'j'
 'j' 'j' 'j' 'j' 'j' 'j' 'j' 'j' 'j' 'j' 'j' 'j' 'j' 'j' 'j' 'j' 'j' 'k'
 'k' 'k' 'k' 'k' 'k' 'k' 'k' 'k' 'k' 'k' 'k' 'k' 'k' 'k' 'k' 'k' 'k' 'k'
 'k' 'k' 'k' 'k' 'k' 'k' 'l' 'l' 'l' 'l' 'l' 'l' 'l

## Logistic Regression

Logistic Regression is a way of classifying data using the sigmoid function 
$$g(z) = \frac{1}{1+e^{-z}}$$

In [67]:
log_reg = LogisticRegression(multi_class="ovr")
log_reg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(multi_class='ovr')

In [68]:
print(log_reg.score(X_test, y_test))

0.979381443298969


## Support Vector Machines

In [69]:
clf = svm.SVC(kernel='linear').fit(X_train, y_train)

In [70]:
print(clf.score(X_test, y_test))

1.0


## Performance Evaluation

In [71]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix

In [72]:
y_pred = log_reg.predict(X_test)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = 2*precision*recall/(precision + recall)
print(precision)
print(recall)
print(f1)

print(confusion_matrix(y_test, y_pred))

0.9803571428571429
0.9791666666666667
0.9797615431348725
[[5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 3 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6]]
