# VGGNet
***

The VGG network implemented the idea of a deeper network and much smaller filter size on each convolutional layer. VGGNet had models of 16 to 19 layers. The convolutional filter size were 3x3 in all layers with a periodic pooling all through the network. It's kind of simple and elegant architecture and was able to get 7.3% error in the imageNet dataset.

One of the biggest breakthroughs of the team was finding out that having smaller filters and stacking more of them together created a more effective receptive field. What happens is that 3 convolutional layers each with a filter size of 3x3 has the same receptive field as 1 convolutional layer with filter size of 7x7. The advantages are that the number of total parameters per layer decrease and also there are more non-linearities. 

Pooling layers after each block of convolutional layers make the representations smaller and more manageable sinces it takes the input and spatially downsamples it. The original VGGNet as well has 3 fully connected layers after the convolutional layers have extracted the relevant features.

However, the structure of the original VGGNet was slightly modified specifically due to smaller size of input images. Only two fully connected layer were left, one with 512 hidden nodes and the last one with the 10 nodes for the classes. This reduces the amount of parameters in the whole network. As well, dropout was added to reduce overfitting in neural networks by preventing complex co-adaptations on training data. This randomly drops out a percentage of units in the layer. 

Batch Norm the earlier layers do not get to shift the its hidden unit values as much because they are constrained to have the same mean and variance. So this makes an easier the job of learning in the later layers. As well, similar to dropout, it adds some noise to each hidden layer's activations even though is a non-intended effect. With all these changes to the VGGNet, the model was reduced to ~15M parameters making it faster to train.

The dataset used was the [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) which consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset has 10 different classes that are completely mutually exclusive.

In [10]:
import numpy as np
import matplotlib.pyplot as plt
import keras
from keras.models import Model, Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from keras.layers import Dropout, BatchNormalization
from keras.preprocessing.image import ImageDataGenerator
from keras import regularizers, optimizers
from keras.datasets import cifar10

When training a neural network, we want to normalize or standardize our data ahead of time as part of the preprocessing step. Both techniques put the data in the same scale. In this case, we will standardize the data which means it will have a mean of zero and standard deviation of 1. This will make the pixels of each image to be close to each other instead of being in a scale from 0 to 255. The relatively large inputs can cascade down through the layers in the network which may cause unbalanced gradients may therefore cause [exploding gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem). Additionally, non-normalized data can significantly decrease the training speed of the model. 

In [2]:
def normalize(X_train,X_test):
    mean = np.mean(X_train,axis=(0,1,2,3))
    std = np.std(X_train, axis=(0,1,2,3))
    X_train = (X_train-mean)/(std+1e-7)
    X_test = (X_test-mean)/(std+1e-7)
    return X_train, X_test

In [3]:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

x_train, x_test = normalize(x_train, x_test)

y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

In [4]:
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

x_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples


In [5]:
#data augmentation
train_datagen = ImageDataGenerator(
            rotation_range=15,  # randomly rotate images in the range (degrees, 0 to 180)
            width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
            height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
            horizontal_flip=True)  # randomly flip images

train_datagen.fit(x_train)

It is helpful to reduce the learning rate as the number of training epochs increases. We reduce the learning rate by a constant factor every few epochs.

In [6]:
def lr_scheduler(epoch):
    return learning_rate * (0.5 ** (epoch // lr_drop))

In [47]:
input_shape = (32,32,3)
classes = 10
weight_decay = 0.0005
learning_rate = 0.1
lr_decay = 1e-6
lr_drop = 20
batch_size = 128 
epochs = 250
reduce_lr = keras.callbacks.LearningRateScheduler(lr_scheduler)

In [48]:
model = Sequential()

# Block 1
model.add(Conv2D(64, (3,3),input_shape=input_shape,activation='relu',padding='same',kernel_regularizer=regularizers.l2(weight_decay),name='block1_conv1'))
model.add(BatchNormalization())
model.add(Dropout(0.3))
model.add(Conv2D(64, (3,3),activation='relu',padding='same',name='block1_conv2'))
model.add(BatchNormalization())
model.add(MaxPooling2D((2,2),name='block1_pool'))

# Block 2
model.add(Conv2D(128,(3,3),activation='relu',padding='same',kernel_regularizer=regularizers.l2(weight_decay),name='block2_conv1'))
model.add(BatchNormalization())
model.add(Dropout(0.4))
model.add(Conv2D(128,(3,3),activation='relu',padding='same',kernel_regularizer=regularizers.l2(weight_decay),name='block2_conv2'))
model.add(BatchNormalization())
model.add(MaxPooling2D((2,2),name='block2_pool'))

# Block 3
model.add(Conv2D(256,(3,3),activation='relu',padding='same',kernel_regularizer=regularizers.l2(weight_decay),name='block3_conv1'))
model.add(BatchNormalization())
model.add(Dropout(0.4))
model.add(Conv2D(256,(3,3),activation='relu',padding='same',kernel_regularizer=regularizers.l2(weight_decay),name='block3_conv2'))
model.add(BatchNormalization())
model.add(Dropout(0.4))
model.add(Conv2D(256,(3,3),activation='relu',padding='same',kernel_regularizer=regularizers.l2(weight_decay),name='block3_conv3'))
model.add(BatchNormalization())
model.add(MaxPooling2D((2,2),name='block3_pool'))

# Block 4
model.add(Conv2D(512,(3,3),activation='relu',padding='same',kernel_regularizer=regularizers.l2(weight_decay),name='block4_conv1'))
model.add(BatchNormalization())
model.add(Dropout(0.4))
model.add(Conv2D(512,(3,3),activation='relu',padding='same',kernel_regularizer=regularizers.l2(weight_decay),name='block4_conv2'))
model.add(BatchNormalization())
model.add(Dropout(0.4))
model.add(Conv2D(512,(3,3),activation='relu',padding='same',kernel_regularizer=regularizers.l2(weight_decay),name='block4_conv3'))
model.add(BatchNormalization())
model.add(MaxPooling2D((2,2),name='block4_pool'))

# Block 5
model.add(Conv2D(512,(3,3),activation='relu',padding='same',kernel_regularizer=regularizers.l2(weight_decay),name='block5_conv1'))
model.add(BatchNormalization())
model.add(Dropout(0.4))
model.add(Conv2D(512,(3,3),activation='relu',padding='same',kernel_regularizer=regularizers.l2(weight_decay),name='block5_conv2'))
model.add(BatchNormalization())
model.add(Dropout(0.4))
model.add(Conv2D(512,(3,3),activation='relu',padding='same',kernel_regularizer=regularizers.l2(weight_decay),name='block5_conv3'))
model.add(BatchNormalization())

model.add(MaxPooling2D((2,2),name='block5_pool'))

model.add(Dropout(0.5))

model.add(Flatten(name='flatten'))
model.add(Dense(units=512,activation='relu',kernel_regularizer=regularizers.l2(weight_decay),name='fc1')) # only one FCL
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(units=classes,activation='softmax',name='predictions'))

sgd = optimizers.SGD(lr=learning_rate, decay=lr_decay, momentum=0.9, nesterov=True)
model.compile(sgd,loss='categorical_crossentropy',metrics=['accuracy'])

In [49]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
block1_conv1 (Conv2D)        (None, 32, 32, 64)        1792      
_________________________________________________________________
batch_normalization_31 (Batc (None, 32, 32, 64)        256       
_________________________________________________________________
dropout_32 (Dropout)         (None, 32, 32, 64)        0         
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 32, 32, 64)        36928     
_________________________________________________________________
batch_normalization_32 (Batc (None, 32, 32, 64)        256       
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 16, 16, 64)        0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 16, 16, 128)       73856     
__________

In [62]:
# fits the model on batches with real-time data augmentation:
history = model.fit_generator(train_datagen.flow(x_train, y_train, batch_size=batch_size),
                    steps_per_epoch=len(x_train) // batch_size, epochs=epochs,
                    validation_data=(x_test, y_test),callbacks=[reduce_lr])

In [28]:
model.save_weights('cifar10vgg.h5')

Final results:

loss: 0.2916 - acc: 0.9734 - val_loss: 0.5480 - val_acc: 0.9122

The model started converging at around 175 epochs.  

In [55]:
model.load_weights('cifar10vgg.h5')

In [54]:
# # summarize history for accuracy
# plt.plot(history.history['acc'])
# plt.plot(history.history['val_acc'])
# plt.title('model accuracy')
# plt.ylabel('accuracy')
# plt.xlabel('epoch')
# plt.legend(['train', 'test'], loc='upper left')
# plt.show()

# # summarize history for loss
# plt.plot(history.history['loss'])
# plt.plot(history.history['val_loss'])
# plt.title('model loss')
# plt.ylabel('loss')
# plt.xlabel('epoch')
# plt.legend(['train', 'test'], loc='upper left')
# plt.show()

Coremltools framework easily convert our keras model into coreML model to be able to integrate it to any iOS device. This will allow adding the model to an iPhone and do offline classification. 

In [41]:
import coremltools



In [45]:
model_name = 'CIFAR10.mlmodel'
classes = ['airplane', 'automobile' ,'bird ','cat ','deer ','dog ','frog ','horse ','ship ','truck']
coreml_model = coremltools.converters.keras.convert(model, input_names=['image'], image_input_names='image', class_labels=['airplane', 'automobile' ,'bird ','cat ','deer ','dog ','frog ','horse ','ship ','truck'])
coreml_model.save(model_name)
print('Saved trained model')

0 : block1_conv1_input, <keras.engine.topology.InputLayer object at 0x1115b1550>
1 : block1_conv1, <keras.layers.convolutional.Conv2D object at 0x1115b1748>
2 : block1_conv1__activation__, <keras.layers.core.Activation object at 0x181c56ff60>
3 : batch_normalization_29, <keras.layers.normalization.BatchNormalization object at 0x1115b1fd0>
4 : block1_pool, <keras.layers.pooling.MaxPooling2D object at 0x11181fef0>
5 : block5_pool, <keras.layers.pooling.MaxPooling2D object at 0x181a0616a0>
6 : flatten, <keras.layers.core.Flatten object at 0x181a30cc50>
7 : fc1, <keras.layers.core.Dense object at 0x181a30c630>
8 : fc1__activation__, <keras.layers.core.Activation object at 0x181c56ff98>
9 : batch_normalization_30, <keras.layers.normalization.BatchNormalization object at 0x181a30ca58>
10 : predictions, <keras.layers.core.Dense object at 0x181a2cdd30>
11 : predictions__activation__, <keras.layers.core.Activation object at 0x181c56ffd0>
Saved trained model
