<a href="https://colab.research.google.com/github/anosharahim/deep-learning/blob/master/DL_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Packages and Libraries

In [28]:
import os
import shutil
import math 
import random 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import keras
import tensorflow as tf

import keras,os
from keras.models import Sequential
from keras.layers import Dense, Conv2D, MaxPool2D , Flatten
from keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split
from keras.applications.vgg16 import VGG16
from keras.layers.pooling import GlobalAveragePooling2D
from keras.models import Sequential, Model 
from keras.layers import Dense, Dropout, Activation, Flatten, GlobalAveragePooling2D
from keras.layers import Conv2D, MaxPooling2D, ZeroPadding2D
from keras.layers.normalization import batch_normalization
from tensorflow.keras.optimizers import RMSprop, SGD, Adam
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau 

## Load Data

In [29]:
#load data from gdrive
train_npz = np.load("/content/drive/MyDrive/Capstone/datasets/train.npz")
test_npz = np.load("/content/drive/MyDrive/Capstone/datasets/test.npz")

x_train = train_npz["arr_0"]
y_train = train_npz["arr_1"]
x_test = test_npz["arr_0"]
y_test = test_npz["arr_1"]

#add 2 more channels to grayscale image to imitate rgb
x_train = np.repeat(x_train[..., np.newaxis], 3, -1)
x_test = np.repeat(x_test[..., np.newaxis], 3, -1)

#split test set into validation and test set 
x_val, x_test, y_val ,y_test = train_test_split(x_test,y_test, test_size=0.3)

print('Train -- ', x_train.shape, y_train.shape)
print('Val -- ', x_val.shape, y_val.shape)
print('Test -- ', x_test.shape, y_test.shape)

Train --  (28709, 48, 48, 3) (28709,)
Val --  (5024, 48, 48, 3) (5024,)
Test --  (2154, 48, 48, 3) (2154,)


## Prepare Model 

In [30]:
vgg = VGG16(weights = 'imagenet', include_top=False, input_shape= (48,48,3))

for layer in vgg.layers[:15]:
  layer.trainable=False

for (i,layer) in enumerate(vgg.layers):
  print(str(i)+' '+ layer.__class__.__name__, layer.trainable)

def fully_connected(base_model, num_classes):
  x = base_model.output
  x = GlobalAveragePooling2D()(x)
  x = Dense(1024, activation='relu')(x)
  x = Dropout(0.3)(x)
  x = Dense(1024, activation='relu')(x)
  x = Dropout(0.3)(x)
  x = Dense(1024, activation='relu')(x)
  x = Dropout(0.3)(x)
  x = Dense(512, activation='relu')(x)
  x = Dense(num_classes, activation='softmax')(x)
  return x

num_class = 7
FC_head = fully_connected(vgg, num_class)
model = Model(inputs=vgg.input, outputs = FC_head)
print(model.summary())

0 InputLayer False
1 Conv2D False
2 Conv2D False
3 MaxPooling2D False
4 Conv2D False
5 Conv2D False
6 MaxPooling2D False
7 Conv2D False
8 Conv2D False
9 Conv2D False
10 MaxPooling2D False
11 Conv2D False
12 Conv2D False
13 Conv2D False
14 MaxPooling2D False
15 Conv2D True
16 Conv2D True
17 Conv2D True
18 MaxPooling2D True
Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, 48, 48, 3)]       0         
                                                                 
 block1_conv1 (Conv2D)       (None, 48, 48, 64)        1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 48, 48, 64)        36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 24, 24, 64)        0         
                                                                 

## Model Training 

The good thing is that the model is working in a reasonable manner. The bad thing is that the model is overfitting on the training data. What can we do? 

Training accuracy was about 91% and test/val accuracy was pleatauing at about 42%

Added 30% dropout layers where overfitting decreases but performance still plateaued at about ~40 for validation set

- Added Dropout 
- Used global average pooling: layer helps aggregate feature map information and helps prevent overfitting caused by fully connected layers in the
model
- Tried all three different optimizers 
- Image data generator eta is in 4 hours and more
- Try using balanced training data. 
- Try using unfrozen layers 
- Try adding batch norm 

In [31]:
epochs = 100
bs = 32
rms_lr = 0.001
sgd_lr = 0.01
adam_lr = 0.001
sgd_decay = 0.0001

lrd = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10, min_lr=0.00001, verbose=1)
es = EarlyStopping(monitor = 'val_accuracy',min_delta = 0, patience = 10, verbose = 1, restore_best_weights = True)
checkpoint = ModelCheckpoint('face_vgg.h5', monitor = 'val_loss', mode ='min', save_best_only = True, verbose=1)
callbacks = [checkpoint, es, lrd]


sgd = SGD(learning_rate=sgd_lr, momentum=0.9, decay=sgd_decay, nesterov=True) 
adam = Adam(learning_rate=adam_lr, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
rms = RMSprop(learning_rate=rms_lr)

model.compile(loss = 'sparse_categorical_crossentropy', optimizer = sgd, metrics = ['accuracy'])

history = model.fit(x_train, y_train, 
                    batch_size=bs, 
                    epochs=epochs, 
                    verbose=1,
                    validation_data= (x_val, y_val),
                    callbacks=callbacks)

  super(SGD, self).__init__(name, **kwargs)
  super(Adam, self).__init__(name, **kwargs)
  super(RMSprop, self).__init__(name, **kwargs)


Epoch 1/100

KeyboardInterrupt: ignored

In [20]:
loss, acc = model.evaluate(x_test, y_test, verbose=1)
print("Restored model, accuracy: {:5.2f}%".format(100 * acc))

Restored model, accuracy: 44.43%


## Questions

- Why is validation accuracy greater than training accuracy after adding dropout? 
- Why does my val loss doesn't decrease but val accuracy increases?

## Notes

**Momentum**

momentum: that accelerates gradient descent in the relevant direction and dampens oscillations Momentum speeds up movement along directions of strong improvement (loss decrease) and also helps the network avoid local minima.

**Nesterov momentum**

nesterov momentum is a simple change to normal momentum. Here the gradient term is not computed from the current position
while gradient may point in right direction, momentum may not always, so the intermediate position is found to redirect momentum 

## References 

1. https://dominikschmidt.xyz/nesterov-momentum/