# Facial Keypoint Detection Project

The objective of this project was to predict facial keypoints using Kaggle's Facial Keypoints Detection dataset. (https://www.kaggle.com/c/facial-keypoints-detection). Input consists of a list of pixels in grayscale and keypoint coordinates of 7000 images. Images are 96x96 pixels. Each predicted keypoint is specified by an (x,y) real-valued pair in the space of pixel indices. There are 15 keypoints, which represent elements of the face such as left_eye_center, right_eye_center, etc. and final goal was to predicts these keypoint positions for the test dataset.

We explored various neural network structures (dense, convolutions, vgg16, inceptionV3 etc.) for our predictions and ultimately trained two models for project submission. Our final solution's predictions submitted to Kaggle produced a RMSE error of 1.71947 on the unseen test data labels. This placed us 4th on the public Kaggle leaderboard.

## Group Members:
- Vanlunen, Daniel
- Nautiyal, Aniruddh
- Munjuluri, Anusha
- Challa, Usha

### Notes about the directory structure:

The zip file submitted contains 3 folders:

<b>1. data:</b> This folder contains the competition data. *Idlookuptable*, *test*, *training*, and *samplesubmission* files in this folder are needed for this notebook to run. Training data should be unzipped before running the notebook.

 <b>2. notebook:</b> This folder contains this main subission notebook. It also contains a subfolder with a screenshot (img) of Kaggle score, and another subfolder (augmentdata) that defines a custom image generating class.

 <b>3. saved-models:</b> This folder contains the saved models weights of final models in .h5 files. It also has a subfolder that contains the training history in pickle files.

### Remarks and Learnings: 

<b>1. Splitting data into two groups: </b>Initial data exploration showed that training images come from two different marking schemes. Hence, training two separate models for images that come with different set of keypoints (Group 1: 8 keypoints or less and Group 2: 9 keypoints or more, upto 30) reduced our training/validation loss.

<b>2. Augmenting data: </b> Augmentating images (rotating, flipping etc.) was a nice way to artificially increase the variance in our training dataset and improve the model.

<b>3. Convolutions vs Dense: </b>Convolutions significantly improved the results.

<b>4. Choice of Optimizers: </b> Different optimizers worked differently under different circumstances -  Nadam converged faster but eventually Adam and Nadam reached the same loss. However, Adam did achieve a better minimum loss for simple models without convolutions. These optimizers gave a much better loss validation than using SGD optimizer. https://towardsdatascience.com/types-of-optimization-algorithms-used-in-neural-networks-and-ways-to-optimize-gradient-95ae5d39529f.

<b>5. Normalization: </b>Normalization was helpful (both of the features and the output) and helped prevent divergence.

<b>6. Batch Normalization: </b>Batch normalization worked well after the activation. https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md

<b>7. Dropout and Batch Normalization: </b> Dropout and batch normalization together seemed to cause weird differences between training and validation error due to "variance shift". One fix was to only apply dropout after all the batch normalization layers had been applied. http://arxiv.org/abs/1801.05134v1

<b>8. Adding additional layers: </b> After bringing validation loss more in line with training loss through normalization and dropout, adding further complexity through more layers seemed to further reduce the validation loss.

<b>9. Activation Function: </b> Though ReLu is the most popular activation function, Exponential Linear Units (ELU) in some cases give better results because their mean value is closer to zero and thus they allow less bias to propagate through the network. https://arxiv.org/pdf/1511.07289.pdf

<b>10. Using pre-tuned models: </b> Though training pre-tuned models ("from the zoo") is likely a good choice in real world applications, our results training InceptionNetV3 and VGG16 did not show loss lower than our simpler architectures. This could be due to the small size of our training dataset and need more sample data to train these complex architectures.

<b>11. Strides and Convolution Window Sizes: </b> Large strides and convolution window sizes generally worked worse than smaller convolutions with a stride of 1. Smaller windows were in line with the ideas presented in the InceptionV3 model paper of replacing larger windows with multiple small windows. https://arxiv.org/pdf/1512.00567.pdf

<b>12. Training Time: </b> Depth and additional data significantly increased training time.

<b>13. Use of GPUs: </b> GPUs gave a ~30x speedup over CPU training. They were definitely necessary to quickly train, test and evaluate various models.

### Further exploration:
Given more time, there are a number of other things we would have evaluated:
- Use cross validation to see if our best architectures hold up when the seed value is not 42.
- Testing more advanced, inception based architectures.
- Utilizing the images with missing keypoints (total keypoints != 30 or 8). Our solution used only images with 30 or 8 keypoints train models.
- Training a separate model for the eye keypoints that uses the data in both groups of images, since it appears the two labelling schemes were the same for the center of the eye keypoints.
- Tuning the amount of data augmentation used.
- Importing layers and hyperparameter settings from the other group's model, to see if those settings offer improvement over the current group's model.
- Use Ensemble architecture to split dataset based on number of available keypoints, train on separate models and combine results.


### Location of other models tried:
https://drive.google.com/file/d/1dK069Q08DIZ6g_HIi4osF20DjV2zktC3/view?usp=sharing
https://drive.google.com/drive/folders/1IcfVLCy_btNzYmqzvKzyVPFQQkKrYVvZ?usp=sharing


## Setup
### Imports

In [None]:
%matplotlib inline
from pandas.io.parsers import read_csv
from sklearn.model_selection import train_test_split

import os
import pickle

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from keras.models import Sequential, load_model, Model
from keras.layers import Dense, Conv2D, MaxPooling2D, Flatten, Dropout, BatchNormalization
from keras.optimizers import Nadam
from keras.callbacks import EarlyStopping, Callback, History

from augmentdata.CustImageDataGenerator import CustImageDataGenerator,CustNumpyArrayIterator

from tensorflow.python.client import device_lib

print(device_lib.list_local_devices()) # confirm using GPU

### Constants

In [None]:
TRAIN_DATA = '../data/training.csv'                # train dataset downloaded from Kaggle
TEST_DATA = '../data/test.csv'                     # test dataset downloaded from Kaggle
IMAGE_ROWS = 96                                    # number of pixel rows
IMAGE_COLS = 96                                    # number of pixel columns
INPUT_SHAPE = (IMAGE_ROWS, IMAGE_COLS, 1)          # shape of input to first Conv2D layer of the model
RETRAIN = False                                    # bool to load and use existing saved models
VERBOSE_TRAIN = True                               # bool to show/hide progress while training a model
NUM_KEYPOINTS = 30                                 # maximum no. of facial keypoints for any image

### EDA

We began by examining images from train dataset, which is organised as (x,y) coodinates of each of the 15 facial features for a total of 30 keypoints. The 31st column contains 96x96 image pixel array data coded as raw grayscale values from 0 to 255. There are a total of about 7000 images in the train dataset.

In [None]:
# Setting up the data for EDA

df_train = pd.read_csv(TRAIN_DATA)
Y = np.array(df_train[df_train.columns.difference(['Image'])])
X = df_train['Image']
labels = list(df_train.columns.difference(['Image']))

img_vec_len = IMAGE_ROWS*IMAGE_COLS                             # images pixel grid size

imgArray = np.zeros((X.shape[0], img_vec_len), dtype=int)       # temporary array to save each image as numpy array

idx=0
for i in X.keys(): 
    imgArray[idx] = np.fromstring(X[i], dtype=int, sep=' ')
    idx += 1
X = np.reshape( imgArray, (X.shape[0], IMAGE_ROWS, IMAGE_COLS, 1) )
print("Total images in train dataset: ", X.shape[0])

In [None]:
# Function to subplot a group of images, and label the ones with missing keypoints distinctly

def plot_images(images, points, type='actual', subplotting=False, gridRows=0, gridCols=0, 
                imageIndices=1, subtitles=True, title=None, labelsList=[] ):
    
    plt.figure(figsize=(4*gridCols, 4*gridCols))    
    img_nums = images.shape[0]
    points_nums = points.shape[0]
    
    if ( ( img_nums != points_nums) | ( img_nums != imageIndices.shape[0] ) ):
        raise ValueError("Mismatch in number of images and keypoints' rows passed to plot_images().")
    
    
    for thisImg in range(0, gridRows*gridCols ):
        
        if subplotting:
            plt.subplot(gridRows, gridCols, thisImg + 1)
            noKeypNums  = np.isnan(points[thisImg]).sum()
            
            if subtitles:
                if( noKeypNums == 0 ):                            # no missing keypoints (group1)
                    plt.title("#: " + str(imageIndices[thisImg]) + 
                              ",  Points: " + str(NUM_KEYPOINTS - noKeypNums), color='k')
                
                elif( ( noKeypNums > 0) & (noKeypNums < 22 ) ):   # (1,21) missing keypoints
                    plt.title("#: " + str(imageIndices[thisImg]) + 
                              ",  Points: " + str(NUM_KEYPOINTS - noKeypNums), color='m')
                
                elif( ( noKeypNums == 22 ) ):                     # 22 missing keypoints (group2)
                    plt.title("#: " + str(imageIndices[thisImg]) + 
                              ",  Points: " + str(NUM_KEYPOINTS - noKeypNums), color='b')
                
                else:                                             # > 22 missing keypoints
                    plt.title("#: " + str(imageIndices[thisImg]) + 
                              ",  Points: " + str(NUM_KEYPOINTS - noKeypNums), color='r')
        
        plt.imshow(np.reshape(images[thisImg,:],(96,96)), cmap = 'gray')

        x = 0
        for idx in range(0, points[thisImg].shape[0]):
            label = labelsList[idx]
            if label[-1]=='x':
                x = points[thisImg, idx]
            else:
                if label in ['left_eye_center_y',
                             'left_eye_inner_corner_y', 
                             'left_eye_outer_corner_y', 
                             'left_eyebrow_inner_end_y', 
                             'left_eyebrow_outer_end_y',
                             'mouth_left_corner_y'
                            ]:
                    if(type=='actual'):
                        plt.plot(x, points[thisImg, idx], 'c<')
                    else:
                        plt.plot(x, points[thisImg, idx], 'c*')
                        
                elif label in ['right_eye_center_y',
                             'right_eye_inner_corner_y', 
                             'right_eye_outer_corner_y', 
                             'right_eyebrow_inner_end_y', 
                             'right_eyebrow_outer_end_y',
                              'mouth_right_corner_y']:
                    if(type=='actual'):
                        plt.plot(x, points[thisImg, idx], 'r>')
                    else:
                        plt.plot(x, points[thisImg, idx], 'r*')
                
                else:
                    if(type=='actual'):
                        plt.plot(x, points[thisImg, idx], 'mo')
                    else:
                        plt.plot(x, points[thisImg, idx], 'm*')
                    
        plt.axis('off')
    
    if (title != None):
        plt.suptitle(title)
    plt.show()

In [None]:
# Function to plot an array of image indices

idx_max = df_train.shape[0]                      # all images
grid_cols = 4                                    # grid columns size for a subplot of images
grid_rows = 4                                    # grid rows size for images subplot
subImgNum = grid_cols * grid_rows

def plot_img_group( thisGroup, dataset='train', denorm=False, thisLabels=labels, thisSubTitle=True, thisTitle=None ):

    if(dataset == 'train'):
        thisX = X
        thisY = Y
    elif(dataset == 'group1'):
        thisX = X1
        thisY = Y1
    elif(dataset == 'group2'):
        thisX = X2
        thisY = Y2     
    
    thisSubsetX = np.zeros( (subImgNum, IMAGE_ROWS, IMAGE_COLS, 1), dtype=float)
    thisSubsetY = np.zeros( (subImgNum, thisY.shape[1]), dtype=float)
    img_indices = np.zeros( (subImgNum, 1), dtype=int)

    img_sub = 0                                  # local iterator for images in subplot
    flushed = False
    for img in thisGroup:
        
        if( ( (img_sub + 1 )  % subImgNum ) != 0 ):
            thisSubsetX[img_sub,:] = thisX[img-1,:]
            if denorm:
                thisSubsetY[img_sub,:] = 48*thisY[img-1,:] + 48
            else:
                thisSubsetY[img_sub,:] = thisY[img-1,:]
            img_indices[img_sub] = img
            img_sub += 1
            flushed = False
            
        else:
            thisSubsetX[img_sub,:] = thisX[img-1,:]
            if denorm:
                thisSubsetY[img_sub,:] = 48*thisY[img-1,:] + 48
            else:
                thisSubsetY[img_sub,:] = thisY[img-1,:]
            img_indices[img_sub] = img
            
            # plot when all images for the subplot are accumulated
            plot_images(images=thisSubsetX, points=thisSubsetY, subplotting=True, 
                        gridRows=grid_rows, gridCols=grid_cols, imageIndices=img_indices, 
                        subtitles=thisSubTitle, title=thisTitle, labelsList=thisLabels )
            
            # reset subplot indexing pointer and subplot image/keypoints buckets
            img_sub = 0
            flushed = True
            thisSubsetX = np.zeros( (subImgNum, IMAGE_ROWS, IMAGE_COLS, 1), dtype=int)
            thisSubsetY = np.zeros( (subImgNum, Y.shape[1]), dtype=float)
    
    if not flushed:                          # for images leftover from partial subplot grid

        thisGridRows = ( (img_sub - 1) // grid_rows ) + 1
        if( thisGridRows > 1 ):
            thisGridCols = grid_cols
        else:
            thisGridCols = img_sub

        plot_images(images=thisSubsetX, points=thisSubsetY, subplotting=True, 
                    gridRows=thisGridRows, gridCols=thisGridCols, imageIndices=img_indices, 
                    subtitles=thisSubTitle, title=thisTitle, labelsList=thisLabels )


Images in train dataset were broadly split into two groups. The first group had more than 8 keypoints, up to maximum of 30, and starting at index 1(or row 1 in .csv) upto index 2284. The images after index 2284 had a maximum of 8 keypoints. 

We identified approx. 320 images for in-depth examination after an initial review of the dataset. These images had some peculiarities, examples of these are discussed below: 

#### Peculiarity in Group1 Images:

##### Eyes closed: 
For identifying the center of eye keypoint, these images had an approximate labelling of that keypoint, without providing a high level of feature extraction for the model e.g. the absence contrast between the white sclera and the relatively darker eyelids or iris. Since the dataset had more than 300+ such images, we assumed these image types to be common in the test dataset. Hence, were retained in the training dataset.

In [None]:
imgs_eyes_closed    = np.array([152, 242, 288, 1676])

print("\nSample images with eyes closed")
plot_img_group( imgs_eyes_closed )


#### Peculiarity in Group2 Images:

##### Highly blurred:
Group 2 had a higher number of images that were extremely blurred and failed to provide any significant feature extractions. These images were not useful in training the model, and were dropped from our training dataset. 

In [None]:
imgs_high_blur    = np.array([2322, 2574, 2916, 6605])

print("\nSample images that are highly blurred")
plot_img_group( imgs_high_blur )

#### Peculiarities common in both groups:

##### Partially/fully covered keypoints:
Some images had part of the faces, near eyes, partially covered with hair, hat, or other form of facial obstruction (by objects like props, sunglasses, partially covered in dark shadow). These images were marked with approximate location of obscured keypoints. Including these imagees would not provide any benefits in training the model. Therefore, most of these were eliminated from training dataset. 

##### A second partial/full face:
Some images had more than one face either partially or fully appearing in the foreground/background. Although some of these were labelled correctly, we assumed these could mislead the model, especially in cases where more than one face is prominent. We decided to drop images where the primary face was not sufficiently prominent.

##### Wrong/bad labels:
These images had wrong or bad labelling and were removed from the train dataset. 

##### Missing keypoints:
Images that did not fall into Group 1 or 2 (images where identified keypoints were not 30 or 8) were either out of frame, had the face viewed sideways, or were overshadowed and hence did not contain all keypoints. While these images were valid for training, we found that the model fitting function expected a coherent specification of keypoints for all input samples. Developing a solution to accurately handle all these scenarios requires additional models. Due to time constraints, we decided to limit our scope to include only images with 30 or 8 keypoints.


In [None]:
imgs_obstructed     = np.array([1697, 1947, 5628, 6755])
imgs_2nd_face       = np.array([1820, 2064, 4264, 4491])
imgs_bad_labels     = np.array([1748, 1878, 2200, 6493])
imgs_miss_keyps     = np.array([1709, 1875, 1882, 2322])

print("\nSample images with partial or fully covered/obscured keypoints")
plot_img_group( imgs_obstructed )

print("\nSample images with a second partial/full face")
plot_img_group( imgs_2nd_face )

print("\nSample images with bad/wrong labelling")
plot_img_group( imgs_bad_labels )

print("\nSample images with missing keypoints")
plot_img_group( imgs_miss_keyps )

It was evident from EDA that a different labeling scheme was used for images in the two groups. For example, images in group 2 had keypoints marked underneath the nose while images in group 1 had tip of the nose as a keypoint. Group2 also marks center of the bottom lip, in the center of the mouth, often on teeth.

These differences in keypoint location identification reconfirmed our approach to develop 2 separate models, one for group2 images with 8 keypoints, and another for group1 with 30.  

In [None]:
# Final indices of images to be dropped from the dataset

IDX_BAD_IMAGES = np.array( [1621, 1862, 1748, 1878, 1927, 2200, 2431, 2584, 2647, 
                            2671, 2765, 4198, 1627, 1628, 1637, 1957, 4477, 1820, 
                            2064, 2089, 2091, 2109, 2195, 4264, 4491, 6490, 6493, 
                            6494, 1655, 2096, 2454, 3206, 3287, 5628, 5653, 6754, 
                            6755, 2321, 2322, 2414, 2428, 2462, 2574, 2584, 2663, 
                            2691, 2694, 2830, 2910, 2916, 3126, 3176, 3291, 3299, 
                            3361, 4061, 4483, 4484, 4494, 4766, 4809, 4837, 4880, 
                            4905, 5068, 5362, 5566, 5868, 6535, 6538, 6588, 6605, 
                            6659, 6724, 6733, 6753, 6758, 6766, 6907 ] )

### Loading the data
This section loads data for two the groups (Separate models were trained on each of these groups because they appeared to have different labelling schemes).

In [None]:
# Function to clean up the train dataset, normalize it, drop bad images & labels, 
# and finally split the dataset into 2, for group1 and group2 modelling.

def loaderV2(test=False, seed=None, keeplabels=None):
    
    if seed:
        np.random.seed(seed)
    fileloc = TEST_DATA if test else TRAIN_DATA
    
    df = read_csv(fileloc)
    
    df['Image'] = df['Image'].apply(lambda x: np.fromstring(x, sep=' '))
    
    if keeplabels:
        df = df[list(keeplabels) + ['Image']]
        
    X = np.vstack(df['Image'])
    
    if not test:                                       # process train dataset
        Y = df[df.columns.difference(['Image'])].values
        Y = Y.astype(np.float32)
        
        # remove rows having bad images or labels
        X = np.delete( X, (IDX_BAD_IMAGES - 1), axis=0 )
        Y = np.delete( Y, (IDX_BAD_IMAGES - 1), axis=0 )
        
        # normalize - by pixel across the whole dataset subtract mean and divide by stdev
        X = X - np.tile(np.mean(X,axis=0),(X.shape[0],1))
        X = X / np.tile(np.std(X,axis=0),(X.shape[0],1))
        X = X.astype(np.float32)
    
        Y = (Y - 48) / 48                     # this helps, but tanh on output doesnt
        shuffle = np.random.permutation(np.arange(X.shape[0]))
        X, Y = X[shuffle], Y[shuffle]
    
        X = X.reshape(-1, 96, 96, 1)
        
        # split X and Y into dataset for model1 (more than 8 keypoints) and model2 (less than 8 keypoints)
        X_model1 = np.zeros( (X.shape[0], 96,96,1), dtype=np.float32)
        X_model2 = np.zeros( (X.shape[0], 96,96,1), dtype=np.float32)
        Y_model1 = np.zeros( Y.shape, dtype=float)
        Y_model2 = np.zeros( Y.shape, dtype=float)
        tempIdx1 = 0
        tempIdx2 = 0
        
        for thisIdx in range(0, Y.shape[0]):
            numKeyps  = NUM_KEYPOINTS - np.isnan(Y[thisIdx]).sum()
            
            if( ( numKeyps > 8 ) ):
                X_model1[tempIdx1] = X[thisIdx,:,:]
                Y_model1[tempIdx1] = Y[thisIdx,:]
                tempIdx1 = tempIdx1 + 1
            else:
                X_model2[tempIdx2] = X[thisIdx,:,:]
                Y_model2[tempIdx2] = Y[thisIdx,:]
                tempIdx2 = tempIdx2 + 1
    
        # remove empty rows
        drop_idx1 = []
        drop_idx2 = []
        
        for idx in range(0, X.shape[0]):
            if( (np.all(Y_model1[idx] == 0)) | (np.isnan(Y_model1[idx]).sum() != 0) ):
                drop_idx1.append(idx)
            if( (np.all(Y_model2[idx] == 0)) | (np.isnan(Y_model2[idx]).sum() != 22) ):
                drop_idx2.append(idx)
        
        X_model1 = np.delete( X_model1, np.array(drop_idx1), axis=0 )
        Y_model1 = np.delete( Y_model1, np.array(drop_idx1), axis=0 )
        X_model2 = np.delete( X_model2, np.array(drop_idx2), axis=0 )
        Y_model2 = np.delete( Y_model2, np.array(drop_idx2), axis=0 )            
        
        # remove empty columns, setup lists of labels
        labels = df.columns.difference(['Image'])
        labels1 = labels
        labels2 = []
        drop_idx3 = []
        for idx in range(0, Y.shape[1]):
            if( (np.all(Y_model2[:,idx] == 0)) | (np.isnan(Y_model2[:,idx]).sum() != 0) ):
                drop_idx3.append(idx)
            else:
                labels2.append(labels[idx])
        Y_model2 = np.delete( Y_model2, np.array(drop_idx3), axis=1 ) 
        
        # return the original dataset and the group splits
        return X_model1, Y_model1, labels1, X_model2, Y_model2, labels2, X, Y, labels
    
    else:                                          # process test dataset
        Y = None
        
        # normalize - by pixel across the whole dataset subtract mean and divide by stdev
        X = X - np.tile(np.mean(X,axis=0),(X.shape[0],1))
        X = X / np.tile(np.std(X,axis=0),(X.shape[0],1))
        X = X.reshape(-1, 96, 96, 1)
        labels = df.columns.difference(['Image'])
        
        return X, Y, labels
    

In [None]:
X1, Y1, labels1,   X2, Y2, labels2,   X, Y, labels = loaderV2(seed=42)

print("Group1 data Y1 shape: ", Y1.shape, ", X1 shape: ", X1.shape)
print("Group2 data Y2 shape: ", Y2.shape, ", X2 shape: ", X2.shape)


## Model Fitting

### Model fitting function
Below is general model fitting function to train model architectures used in this notebook.

In [None]:
def fit_model(model, data, modelname,
              generator=None,retrain=RETRAIN,
              epochs=10000, patience=1000,optimizer=Nadam()):
    # check if the user wants to retrain or if the saved model doesn't exist
    if retrain or not os.path.exists('../saved-models/' + modelname + '.h5'):
        # data setup
        X_train = data[0]
        y_train = data[2]
        if len(data) == 4:
            valid_dat = (data[1], data[3])
        else:
            valid_dat = None

        # default optimization routine to use Nadam and minimize mse
        model.compile(loss='mse', optimizer=optimizer)
        
        # set an early stopping criteria
        if valid_dat:
            earlystop = EarlyStopping(monitor='val_loss',
                                     patience=patience,
                                     verbose=1,
                                     mode="auto")
            
        else:
            earlystop = EarlyStopping(monitor='loss',
                                     patience=patience,
                                     verbose=1,
                                     mode="auto")
        
        callbacks = [earlystop]
        
        # fit the model (with a data generator if present)
        if generator:
            history = model.fit_generator(generator,
                        epochs=epochs,
                        steps_per_epoch=data[0].shape[0]//32,
                        callbacks=callbacks,
                        validation_data=valid_dat
             )
        else:
            history = model.fit(X_train, y_train,
                                epochs=epochs,
                                batch_size=32,
                                callbacks=callbacks,
                                validation_data=valid_dat,
                                verbose=VERBOSE_TRAIN
                     )
        # save the model weights
        model.save('../saved-models/'+ modelname + '.h5')
        
        # save the model loss history
        with open('../saved-models/histories/'+modelname+'_hist',
                  'wb') as file_pi:
            pickle.dump(history.history, file_pi)
        history = history.history
    
    # if the user doesnt want to retrain, load the weights and model history
    else:
        model = load_model('../saved-models/'+modelname+'.h5')
        history = pickle.load(open( "../saved-models/histories/" + modelname + '_hist',
                                   "rb" ))
        
    return history, model

### Data augmenting function
Keras offers an image data generator class that automatically transforms input data images. Data augmentation helps artifically increase the size of training data set by adding variance to training images without distorting their meaning. This keras data generator does not change the labels of the images. 

In our case, when the image was changed, we needed the keypoint locations to adjust accordingly. Therefore, we had to create a custom subclass of the Keras ImageDataGenerator that not only transforms the images, but also transforms the labels. 

Our custom generator implements below transformations to image data:
1. randomly rotates images by up to 5 degrees
2. flips images horizontally
3. translates images by up to 5% of the total height/weight of the image.  

These transformations were only applied if they would not move the keypoints outside of the image. Refer to CustImageDataGenerator.py file for more details.

In [None]:
# Generator for the models that predict 30 keypoints
datagen = CustImageDataGenerator(
    rotation_range=5. #degrees
     ,horizontal_flip=True
     ,width_shift_range=.05 # percent of image width
     ,height_shift_range=.05 # percent of image height
    ).flow(X1,Y1,whichlabels=list(labels1), batch_size=32)

# Generator for the models that predict 8 keypoints
g2_datagen = CustImageDataGenerator(
    rotation_range=5. #degrees
     ,horizontal_flip=True
     ,width_shift_range=.05 # percent of image width
     ,height_shift_range=.05 # percent of image height
    ).flow(X2,Y2,whichlabels=list(labels2), batch_size=32)

### Build and train Group 1 Images' Model (more than 8 and upto 30 keypoints to predict)
**Note**: In order to retrain the model, retrain=True should be passed. **Caution:** this will overwrite the old saved model in the saved-models directory.

In [None]:
g1_model3_new = Sequential()

g1_model3_new.add(Conv2D(32,
                 (6, 6),
                 activation='relu',
                input_shape=INPUT_SHAPE))
g1_model3_new.add(BatchNormalization())
g1_model3_new.add(MaxPooling2D(pool_size=(2, 2)))

g1_model3_new.add(Conv2D(filters=64,
                 kernel_size=(5, 5),
                 activation='relu'))
g1_model3_new.add(BatchNormalization())
g1_model3_new.add(MaxPooling2D(pool_size=(2, 2)))

g1_model3_new.add(Conv2D(filters=256,
                 kernel_size=(4, 4),
                 activation='relu'))
g1_model3_new.add(BatchNormalization())
g1_model3_new.add(MaxPooling2D(pool_size=(2, 2)))

g1_model3_new.add(Conv2D(filters=64,
                 kernel_size=(3,3),
                 activation='relu'))
g1_model3_new.add(BatchNormalization())
g1_model3_new.add(MaxPooling2D(pool_size=(2, 2)))

g1_model3_new.add(Conv2D(filters=128,
                 kernel_size=(2, 2),
                 activation='relu'))
g1_model3_new.add(BatchNormalization())
g1_model3_new.add(MaxPooling2D(pool_size=(2, 2)))

g1_model3_new.add(Flatten())
g1_model3_new.add(Dense(500, activation = "relu"))
g1_model3_new.add(BatchNormalization())

g1_model3_new.add(Dense(500, activation = "relu"))
g1_model3_new.add(BatchNormalization())
g1_model3_new.add(Dropout(.3))

g1_model3_new.add(Dense(30))

print(g1_model3_new.summary())
g1_model3_new_hist, g1_model3_new = fit_model(g1_model3_new, [X1, None, Y1],
                                'g1_CNN_aug_addedLayers',datagen,retrain=RETRAIN)

### Build and train Group 2 Images' Model (maximum 8 keypoints to predict)
**Note**: In order to retrain the model, retrain=True should be passed. **Caution:** this will overwrite the old saved model in the saved-models directory.

In [None]:
g2_model4 = Sequential()

g2_model4.add(Conv2D(filters=64,
                 kernel_size=(6, 6),
                 strides=1,
                 activation='elu',
                 input_shape=INPUT_SHAPE))
g2_model4.add(BatchNormalization())
g2_model4.add(Dropout(.1))

g2_model4.add(Conv2D(filters=128,
                 kernel_size=(5, 5),
                 strides=1,
                 activation='elu'))
g2_model4.add(BatchNormalization())
g2_model4.add(MaxPooling2D(pool_size=(2, 2)))
g2_model4.add(Dropout(.2))

g2_model4.add(Conv2D(filters=256,
                 kernel_size=(4, 4),
                 activation='elu'))
g2_model4.add(BatchNormalization())
g2_model4.add(MaxPooling2D(pool_size=(2, 2)))
g2_model4.add(Dropout(.2))

g2_model4.add(Conv2D(filters=512,
                 kernel_size=(3, 3),
                 activation='elu'))
g2_model4.add(BatchNormalization())
g2_model4.add(MaxPooling2D(pool_size=(2, 2)))
g2_model4.add(Dropout(.3))

g2_model4.add(Conv2D(filters=512,
                 kernel_size=(2, 2),
                 activation='elu'))
g2_model4.add(BatchNormalization())
g2_model4.add(MaxPooling2D(pool_size=(2, 2)))
g2_model4.add(Dropout(.4))

g2_model4.add(Flatten())
g2_model4.add(Dense(500, activation = "elu"))
g2_model4.add(BatchNormalization())
g2_model4.add(Dropout(.4))

g2_model4.add(Dense(100, activation = "elu"))
g2_model4.add(BatchNormalization())

g2_model4.add(Dense(8))

print(g2_model4.summary())

g2_model4_hist, g2_model4 = fit_model(g2_model4,[X2, None, Y2],
                                'g2_CNNv2_aug', generator=g2_datagen
                                     ,retrain=RETRAIN, patience=100)


## Predictions
### Load test images

In [None]:
out_images, _ , _ = loaderV2(test=True, seed=None, keeplabels=None)
print(out_images.shape)

### Create predictions
Predictions with both models: g1 predicts all 30 keypoints, g2 predicts 8 keypoints.

In [None]:
g1_prediction = g1_model3_new.predict(out_images)
g2_prediction = g2_model4.predict(out_images)


###  Prepare predictions into submission format
#### Adding keypoints per test image to be predicted
This is necessary to know which model's predictions should be used.

In [None]:
# aggregate the number of keypoints needed for each distinct image in test dataset

IdLookupTable = read_csv('../data/IdLookupTable.csv')

rowIDs = np.array(IdLookupTable['RowId'])
# keypoints to be predicted for each imageID
imgs_numKeyps = np.zeros(max(IdLookupTable['ImageId']), dtype=int)

thisImgKeyps = 0
for rowIdx in range(0, rowIDs.shape[0]):
    thisImgID = IdLookupTable.loc[rowIdx,'ImageId']
    imgs_numKeyps[thisImgID - 1] += 1

print("Total test images: {}".format(imgs_numKeyps.shape) )

#### Use appropriate model for predicting each image's keypoints
If the test data requests 8 or fewer keypoints use the group 2 model, else use group 1 model. We tried using mean of the predictions from both models, for common keypoints, as left_eye_center and right_eye_center; but these yielded higher RMSE.

In [None]:
label_locs1 = {}
label_locs2 = {}
for i, label in enumerate(labels1):
    label_locs1[label]=i

for i, label in enumerate(labels2):
    label_locs2[label]=i
    
IdLookupTable['test'] = IdLookupTable['FeatureName'].replace(label_locs1)

thisRowId = 0
modelUsed = np.zeros(IdLookupTable['RowId'].shape[0], dtype=int)

for imgIdx in range(0, imgs_numKeyps.shape[0]):
    
    if( imgs_numKeyps[imgIdx] > 8):
        for keypIdx in range(0, imgs_numKeyps[imgIdx]):
            map_label_idx1 = label_locs1[IdLookupTable.loc[thisRowId, 'FeatureName']]
            IdLookupTable.loc[thisRowId, 'Location'] =  (48*g1_prediction[imgIdx, map_label_idx1]) + 48
            modelUsed[thisRowId] = 1
            thisRowId += 1
    else:
        for keypIdx in range(0, imgs_numKeyps[imgIdx]):
            map_label_idx2 = label_locs2[IdLookupTable.loc[thisRowId, 'FeatureName']]
            IdLookupTable.loc[thisRowId, 'Location'] =  (48*g2_prediction[imgIdx, map_label_idx2]) + 48
            
            modelUsed[thisRowId] = 2
            thisRowId += 1

IdLookupTable['Location'] = (IdLookupTable['Location'].
                             where(IdLookupTable['Location']<=96, 96).
                             where(IdLookupTable['Location']>=0, 0)
                            )
IdLookupTable.head()

#### Creating the submission

In [None]:
Submission = IdLookupTable[['RowId','Location']]
Submission.to_csv(path_or_buf='Final_FacialKeypoints.csv',
                  index=False)

## Evaluation
Our final model's predictions submitted to Kaggle produce a RMSE error of 1.71947 with the unseen test data labels. This places us 4th on the public Kaggle leaderboard. This is much better than the baseline RMSE we set at 3.138, by just using the mean value of the labels as our prediction. 

![](img/Score.png)