# Facial Recognition with TensorFlow

## Supervised Learning

### About the Dataset
A popular component of computer vision and deep learning revolves around identifying faces for various applications from logging into your phone with your face or searching through surveillance images for a particular suspect. This dataset is great for training and testing models for face detection, particularly for recognising facial attributes such as finding people with brown hair, are smiling, or wearing glasses. Images cover large pose variations, background clutter, diverse people, supported by a large quantity of images and rich annotations. This data was originally collected by researchers at MMLAB, The Chinese University of Hong Kong (specific reference in Acknowledgment section).

https://www.kaggle.com/jessicali9530/celeba-dataset

### Purpose of this Notebook

The goal of this notebook is to create a supervised learning model that can predict a celebrities hair color.  To achieve this goal we have over 100,000 images of celebrities with their hair colors identified.  These images have different backgrounds which may cause some confusion in our model.  However, this will also help the real world application of the model.  We will be using Keras and knn to generate predictions.

In [243]:
# Basic lib imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import keras as ks
from skimage import io
import os
import csv

In [244]:
# Opening attributes in a new dataframe
attr = pd.read_csv('list_attr_celeba.csv')

In [245]:
# Setting hair columns to categorical values
hair = pd.DataFrame(np.where(attr[['Black_Hair','Bald','Blond_Hair','Brown_Hair','Gray_Hair']] > 0, 1, 0),columns=['Black_Hair','Bald','Blond_Hair','Brown_Hair','Gray_Hair'])

In [246]:
# Removing if all values are 0
hair = hair[(hair.T != 0).any()]

In [247]:
# Getting list of target image ids
target_images = hair.index

In [282]:
# Importing cv2 to help work with images
import cv2

# Creating a function to format images
def load_images(path):
    img_data = [] # return the image itself
    index = [] # adds an index to reference image
    x = -1
    for pic in os.listdir(path):
        pic_path = os.path.join(path,pic)
        for img in os.listdir(pic_path):
            x += 1
            if x in target_images: # selecting images in target index
                img_path = os.path.join(pic_path,img)
                image = cv2.imread(img_path)
                image = cv2.resize(image, (64, 64))
                image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
                img_data.append(image)
                index.append(x)
    return(np.array(img_data),np.array(index)) # saving image data as array and image number

In [283]:
# Setting up fill location for our load_images function
train_path = 'images'

In [284]:
# Running load_images saving images arrays to x and ids to img num
(X, img_num) = load_images(train_path)

In [286]:
# import to split data
from sklearn.cross_validation import train_test_split

In [287]:
# Seperating data into training and test groups
X_train, X_test, y_train, y_test = train_test_split(X, hair, test_size=0.33, random_state=42)

In [288]:
# Checking shape
X_train.shape

(86416, 64, 64, 3)

In [1]:
# Importing Modeling Tools
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten,Reshape
from keras.layers import Conv2D, MaxPooling2D, ZeroPadding2D, GlobalAveragePooling2D
from keras.layers.normalization import BatchNormalization

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [331]:
# Init Model
model = Sequential()

In [332]:
model.add(Conv2D(16, (3, 3), input_shape=(64,64,3)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Conv2D(16, (3, 3)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(32,(3,3 )))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(64, (3, 3)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))


model.add(Flatten())

model.add(Dense(512))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(5))

model.add(Activation('softmax'))

In [333]:
# Compiling Model
model.compile(loss='categorical_crossentropy', optimizer='rmsprop',metrics=['accuracy'])

In [334]:
# Training Model
# higher batch and loss 32, .5
model.fit(X_train,y_train, epochs=11, batch_size = 32,
                    validation_data=(X_test,y_test))

Train on 86416 samples, validate on 42564 samples
Epoch 1/11
Epoch 2/11
Epoch 3/11
Epoch 4/11
Epoch 5/11
Epoch 6/11
Epoch 7/11
Epoch 8/11
Epoch 9/11
Epoch 10/11
Epoch 11/11


<keras.callbacks.History at 0x1f0059ca1d0>

In [335]:
# importing metrics
from sklearn.metrics import classification_report

In [342]:
# Saving predictions for new model
predictions = model.predict_classes(X_test)

In [346]:
# Checking shape to fit to classification_report
print(y_test.shape)
predictions.shape

(42564, 5)


(42564,)

In [376]:
# fitting predictions to match y_test shape
from keras.utils import np_utils
pred = np_utils.to_categorical(predictions)

In [377]:
# pred fit
pred.shape

(42564, 5)

In [387]:
# Printing metrics
print(classification_report(pred,y_test,target_names=['Black Hair','Bald','Blond Hair','Brown Hair','Gray Hair']))

             precision    recall  f1-score   support

 Black Hair       0.97      0.88      0.92     17675
       Bald       0.77      0.81      0.79      1403
 Blond Hair       0.92      0.90      0.91     10216
 Brown Hair       0.75      0.94      0.83     10897
  Gray Hair       0.74      0.86      0.79      2373

avg / total       0.88      0.90      0.89     42564



## Model Analysis

   Overall we were able to create a pretty successful model averaging about 89% accuracy.  Through trial and error I found consistent hyperparameters for fitting our model.  Before, with a higher batch and/or a lower dropout rate the model had a stronger bias towards the training data.  Being able to reach an accuracy of 93% to 94% but performing much worse on the test data staying in the mid 80s.  After further research, I think this overfitting occurs because the a model with the higher batch size loses some of its ability to generalize as effectively. The lack of generalization ability is due to the fact that large-batch methods tend to converge to sharp minimizers of the training function on a much larger sample. These minimizers are characterized by large positive eigenvalues.  Similarly I this bias is avoided by increasing the dropout rate because it reduces the number of ‘neurons’ we are gathering information from improving the models ability to generalize.
   
### Preformance by Color 

   Looking at the classification report the performance of our model based on hair color isn’t that surprising to me.  Black and blond hair have the highest accuracy which makes since to me because they are the most extreme color values.  While bald only have 1,400 values could easily be confused from other factors like background or skin color.  As for brown hair, I think this would be the hardest, thing like highlights, reflections and lighting would cause a lot of confusion.  In direct sunlight brown hair could easily be mistaken from blond while in a dark photo I could be seen as black.  Last, I thought gray would do better, could be a problem with gray vs bald. But I think it would have benefited the model to have more samples considering it’s the second lowest with 2,300 sample.
   
### Concerns and Shortcomings

   One thing that worries me about this model is the lack of consistency in the number of samples.  Looking at the classification report we can see that the most accurate prediction was for blond hair at 97%.  Which was also the color with the most samples by over 6,000.  Again, the second best accuracy was on the second most common color black.  However brown hair had a similar number of samples and was much worse.  Last, gray and bald had barely any so I think the model lack the proper amount of data to be trained properly.  However, this could easily be solved with more samples, and a balanced number of samples.