# Alexnet

## Dataset

1. Over 15 million labeled high-resolution images belonging to roughly 22,000 categories. 
2. On ImageNet, it is customary to report two error rates:
    top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model.
3. Down-sampled the images to a fixed resolution of 256 × 256.
4. Subtracting the mean activity over the training set from each pixel and trained our network on the (centered) raw RGB values of the pixels

## The Architecture

* It contains eight learned layers — five convolutional and three fully-connected

#### Relu

* In terms of training time with gradient descent, Tanh and sigmoid saturating nonlinearities are much slower than the non-saturating nonlinearity f(x) = max(0, x) (ReLu).

* ReLUs have the desirable property that they do not require input normalization to prevent them from saturating but applied special normality "brightness normalization"(check paper) after few layers.

#### Overlapping pooling

* Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map.
* We generally observe during training that models with overlapping pooling find it slightly more difficult to overfit

#### Overall Architecture

* the net contains eight layers with weights; the first five are convolutional and the remaining three are fullyconnected.
* The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels

* Two GPUs are used for parallel processing . Only 3rd conv layer uses input from both GPUs otherwise all layers get inputs from their respective GPU.

* The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU.

* The kernels of the third convolutional layer are connected to all kernel maps in the second layer.

* Response-normalization layers follow the first and second convolutional layers. 

* Max-pooling layersfollow both response-normalization layers as well as the fifth convolutional layer.

* The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.

* The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers.

![alt text](input/alexnet_architecture.png "Title")

#### Kernal Sizes

* Input Image = 224×224×3 (input)

* 1st Conv Layer =  96 kernels of size 11×11×3 with a stride of 4 pixels 
* 2nd Conv Layer = 256 kernels of size 5 × 5 × 48  (48 because there are 2 GPUs having 48 and 48 inputs each)
* 3rd Conv Layer =  384 kernels of size 3 × 3 × 256 (Third layer gets all the inputs on each GPU)
* 4th Conv Layer = 384 kernels of size 3 × 3 × 192 (192 same as second layer)
* 5th Conv Layer =  256 kernels of size 3 × 3 × 192 (192 same as second layer)
* 1st FC Layer = 4096 neurons
* 2nd FC Layer = 4096 neurons
* 3rd FC Layer = 1000 neurons (output)

## Reducing Overfitting

#### Data Augmentation

1. Generating image translations and horizontal reflections.

2.  Altering the intensities of the RGB channels in training images.

    * As object identity is invariant to changes in the intensity and color of the illumination.They used below method
    
    * Perform PCA on the set of RGB pixel values throughout the ImageNet training set.
    * To each training image, we add multiples of the found principal components, with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1.
    
    
#### Dropout

* Setting to zero the output of each hidden neuron with probability 0.5.

* The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in backpropagation.

* Every time an input is presented, the neural network samples a different architecture ,therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. 

* At test time, we use all the neurons but multiply their outputs by 0.5, which is a reasonable approximation to taking the geometric mean of the predictive distributions produced by the exponentially-many dropout networks.

## Details of learning

* Stochastic gradient descent
* Batch size = 128
* Momentum = 0.9
* Weight decay = 0.0005 => weight decay here is not merely a regularizer: it reduces the model’s training error.
* Formula =     
                vi+1 :=   ( 0.9 * vi) − (0.0005 * alpha * wi) − (alpha *(∂L/∂w))

                wi+1 := wi + vi+1
                
* Initialized the weights in each layer from a zero-mean Gaussian distribution with standard deviation 0.01.