## Alexnet Implementation
In this jupyter notebook file I implement the Alexnet architecture as it was proposed in the paper:

Krizhevsky et al., 2012. ImageNet classification with deep convolutional neural networks
https://proceedings.neurips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

The proposed model in the paper was made as a solution to the Imagenet image classificaiton problem for the year 2010 and outperformed other competitors on top-1 and top-5. It's complete end-to-end deep learning approach with high accuracy and innovative use of two GPUs to reduce training was novel at the time and drew lots of attention to using deep learning for computer vision tasks, hence becoming a pioneer. The idea proposed in the paper- of using a series of blocks consiting of convolutional layers followed by maxpooling layers, as well as using fully connected layers after these blocks is an approach used almost universally today with different hyperparameters as required by the problem at hand.

In this implementation I will replicate the model architecture and train a dataset and observe train and validation loss and accuracy. The original dataset used was the Imagenet dataset as provided by the competition which consisted of 1.2 million images divided into 1000 classes. The model was also tested on the 2012 version of the dataset as test labels were available. As access to this dataset is restricted I will instead be using a flowers dataset freely available on Kaggle. 

I import the fundamental libraries for tensorflow and implementing the architecture.

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D,MaxPooling2D,Flatten,Dense
from tensorflow.keras.optimizers.legacy import SGD
import numpy as np

Now, I create a path variable that can be used to reference the dataset in the notebook on kaggle.

In [None]:
import pathlib
data_dir = pathlib.Path("./kaggle/input/d/yingyingkan/alexnet/flower_data/flower_photos")
class_names = np.array([item.name for item in data_dir.glob('*') if item.name != "LICENSE.txt" ])
print(class_names)

In the actual paper, a batch size of 128 is used however here as this dataset is much smaller it serves us to just have a batch size of 32. Furthermore, in the paper the imagenet images are downsampled to size 256x256 and 5 crops of 224x224 are used in the model. The crops are upperleft, upperright, lowerleft, lowerright and center. I do not require this as we are not concerned about a competition but rather testing the models working. Therefore we can simply resize the flower dataset images to 227x227 and perform classification. I use 227x227 instead of 224x224 size images as it does not affect the output much, but still follows the convolution and maxpool output mathematics as described by the paper while this is not true for 224x224 (outputs reduce by a small margin due to the nature of the tensorflow framework). 

In [None]:
batch_size = 32             
img_height = 227            
img_width = 227             

image_generator = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255)

train_data_gen = image_generator.flow_from_directory(directory=str("/kaggle/input/d/yingyingkan/alexnet/flower_data/train"),
                                                     batch_size=batch_size,
                                                     shuffle=True,
                                                     target_size=(img_height, img_width),
                                                     classes = list(class_names))
valid_data_gen = image_generator.flow_from_directory(directory=str("/kaggle/input/d/yingyingkan/alexnet/flower_data/val"),
                                                     batch_size=batch_size,
                                                     shuffle=True,
                                                     target_size=(img_height, img_width),
                                                     classes = list(class_names))

### Model Architecture: 

The architecture as described in the paper contains eight layers with weights; the first five are convolutional and the remaining three are fullyconnected. While in the paper they use a 1000 class softmax output, I require only 5 based on our dataset. The paper's implementation has a more complex interpretation of the layers connectivity as it uses two GPUs simultaneously however we can go forward by a more straightforward implementation. 

I will input a 227x227x3 RGB image and put it through the below model, passing it through 2 sets of convolutional, batch normalization and maxpooling layers, consequently passing it through three sets of convolutional layers and batch norm layers and finally applying max pool. I flatten this output to reduce the dimensions and pass it through 2 fully connected layers and finally through an output softmax layer. The details of the kernel sizes, strides and other hyperparameters are identical to the paper and can be seen in the code. Along with this the expected output of each layer is mention in comments below each line of code. Along with this I implement a dropout of 0.5 for the fully connected layers as mentioned in the paper.

This paper was also important in it's justification of using the ReLu (Rectified Linear Unit) activation function for it's fully connected layers as it converges better for gradient descent and reduced overfitting as stated in the paper in comparison to the tanh function. 

In [None]:
#input image of size 227x227x3 (3 for RGB)

model = tf.keras.Sequential([
    
    tf.keras.layers.Conv2D(96,11,strides=4,padding='valid',activation='relu', input_shape=(227, 227, 3)),
    #input is of size: 55x55x96
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPooling2D(3,strides=2,padding='valid'),
    #input is of size: 27x27x96

    tf.keras.layers.Conv2D(256,5,strides=1,padding='same',activation='relu'),
    #input is of size: 27x27x256
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPooling2D(3,strides=2,padding='valid'),
    #input is of size: 13x13x256


    tf.keras.layers.Conv2D(384,3,strides=1,padding='same',activation='relu'),
    #input is of size: 13x13x384
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Conv2D(384,3,strides=1,padding='same',activation='relu'),
    #input is of size: 13x13x384
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Conv2D(256,3,strides=1,padding='same',activation='relu'),
    #input is of size: 13x13x256
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPooling2D(3,strides=2,padding='valid'),
    #input is of size: 6x6x256

    tf.keras.layers.Flatten(),
    #input is of shape(None,9216)

    tf.keras.layers.Dense(4096,activation='relu'),
    tf.keras.layers.Dropout(0.5),
    #4096 neurons
    tf.keras.layers.Dense(4096,activation='relu'),
    tf.keras.layers.Dropout(0.5),
    #4096 neurons
    tf.keras.layers.Dense(5, activation='softmax')
    #5 class probabilities as output
])

The paper uses stochastic gradient descent, and the authors initially use a learning rate of 0.001. I do the same. However more details about the way the weights are intialized along with momentum and weight decay are mentioned which I do not implement as it increases the complexity of code, while it does not improve performance in the small dataset that I use which does not take too much time to train. The required code to implement these hyperparameters as well is provided at the bottom of this notebook if it is required in other implementations.

In [None]:
#build/compile model -- optimizer etc.
model.compile(optimizer=tf.optimizers.SGD(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

Here we can compare the output to the expected outputs in the commments of the model architecture code. For an experiment it can be seen that if you input an image of 224x224, and run the same cells, you find that the output of this cell will not match the expected output as described in the comments (which is why we justify using 227x227)

In [None]:
model.summary()

I train the model and at each epoch I also test the model on the validation set and determine performance.

In [None]:
history=model.fit(train_data_gen,epochs=50,validation_data=valid_data_gen,validation_freq=1)

We observe the final training accuracy is extremely close to 1 which means alexnet has learnt how to fit the training set well, but the validation accuracy remains stagnat at around 0.7 which means it is not learning how to classify unseen data. This is a symptom of overfitting, so next I plot the loss and accuracy over time to confirm.

In [None]:
#graph
import matplotlib.pyplot as plt
f,ax=plt.subplots(2,1,figsize=(10,10)) 

#Assigning the first subplot to graph training loss and validation loss
ax[0].plot(model.history.history['loss'],color='b',label='Training Loss')
ax[0].plot(model.history.history['val_loss'],color='r',label='Validation Loss')

#Plotting the training accuracy and validation accuracy
ax[1].plot(model.history.history['accuracy'],color='b',label='Training  Accuracy')
ax[1].plot(model.history.history['val_accuracy'],color='r',label='Validation Accuracy')

plt.legend()

We can see that validation accuracy remains between a certain range while training accuracy increases and reaches an asymptote near 1. The losses also show that training loss reduces every epoch but validation loss does not beyond a certain point and actually gets a little worse. This means the model has overfit. This happens when a model learns to a great degree how to precisely tune it's weights to maximize performance on the training set (seen data), but such hyper specific optimization leads to worse generalization which leads to worse performance in unseen data.

Alexnet being a very deep model suffers from learning the training set too accurately. Methods to reduce overfitting include using a simpler model, getting more training data, using other regularization techniques and early stopping. However fixing overfitting is beyond the scope of this notebook as our goal is to only show how to make and train alexnet which has been done. 

The complete architecture including the initalization parameters is:

In [None]:
#input image of size 227x227x3 (3 for RGB)

model = tf.keras.Sequential([
    
    tf.keras.layers.Conv2D(96,11,strides=4,padding='valid',activation='relu', input_shape=(227, 227, 3),
                           kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.01)),
    #output is of size: 55x55x96
    
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPooling2D(3,strides=2,padding='valid'),
    #output is of size: 27x27x96

    tf.keras.layers.Conv2D(256,5,strides=1,padding='same',activation='relu',
                           kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.01),
                           bias_initializer='ones'), 
    #output is of size: 27x27x256
    
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPooling2D(3,strides=2,padding='valid'),
    #output is of size: 13x13x256


    tf.keras.layers.Conv2D(384,3,strides=1,padding='same',activation='relu', 
                           kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.01)),
    #output is of size: 13x13x384
    
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Conv2D(384,3,strides=1,padding='same',activation='relu', 
                           kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.01),
                           bias_initializer='ones'),
    #output is of size: 13x13x384
    
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Conv2D(256,3,strides=1,padding='same',activation='relu', 
                           kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.01),
                           bias_initializer='ones'),  
    #output is of size: 13x13x256
    
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPooling2D(3,strides=2,padding='valid'),
    #output is of size: 6x6x256

    tf.keras.layers.Flatten(),
    #output is of shape(None,9216)

    tf.keras.layers.Dense(4096,activation='relu',kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.01),
                          bias_initializer='ones'),
    tf.keras.layers.Dropout(0.5),
    #4096 neurons
    tf.keras.layers.Dense(4096,activation='relu',kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.01),
                          bias_initializer='ones'),
    tf.keras.layers.Dropout(0.5),
    #4096 neurons
    tf.keras.layers.Dense(5, activation='softmax',kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.01))
    #5 class probabilities as output
])

In [None]:
optimizer = SGD(learning_rate=0.01, momentum=0.9, decay=0.0005)

model.compile(optimizer=optimizer,loss='categorical_crossentropy',metrics=['accuracy'])