# AlexNet Summary
## Paper
- ImageNet Classification with Deep Convolutional Neural Networks 
  - Authors: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (University of Toronto)
  - NIPS 2012

## Abstract
> _"We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry."_

## Summary
### Intro and Overview
  - Alex was the one who started DL revolution
  - so is the paper is also called as AlexNet
  - first one to show it works on CUDA
  - won competition by large margin
  - vision earlier was done by hand
  
### The necessity of larger models
  - traditionally
    - small datasets like CIFAR of 32x32 pixels
      - can be solved using classical computer vision model
    - but for large datasets with higher resolution images
      - was difficult to recognize
      - ImageNet dataset is a large one
        - 15 million high resolution images with 22000 categories
  - large model is required
    - to learn complexity of object recognition
  - model must have lots of prior knowledge to compensate for all the data that is not available
  - during this time, CNN was not popular
    - was used to identify hand-written digits

### Why CNNs
  - convolutional operations have a strong prior 
  - and is consistent with object identification of computer vision
    - _"capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptitons about the nature of images (namely, stationarity of statistics and locality of pixel dependencies)"_
  - one of problems they envisoned was "overfitting" with such large CNNs models
  - also spoke about better results using CNNs and GPUs
  - used many methods to prevent overfitting
  - network size is limited by memory available on GPUs and permissible amount of training time 

### ImageNet
  - Data
    - ImageNet dataset is plenty
      - 1.2 million training images
      - 50000 validation images
      - 150000 testing images
      - 256x256 RGB size images
      - 1000 images in each of 1000 category of classes
      
### Model Architecture Overview
  - contains layers such as 
    - MaxPooling
    - Dense layer
    - increase the feature maps, in between, by decreasing the resolution
  - split into 2 GPUs, with occasional inter-communication
    - invention of bigger GPUs made it easier today
    - no tensorflow was there back then

<img src="./images/AlexNetModelArchitecture.png" width=500 height=500>

### ReLU Nonlinearities
  - earlier mostly sigmoid or hyperbolic functions were used to model neuron function
    - as it was differentiable
  - the problem was no learning was achieved at boundaries
    - _" In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity f(x) = max(0, x)."_
  - the ReLU trains much faster
    - 6 times faster

<img src="./images/AlexNet_ReLU.png" width=200 height=500>

### Multi-GPU training
  - distributed a model onto 2 GPUs
  - G-shard 
  - GPU interact with each other
  - using 2 GPUs improved 
  - this is not what is used today with big GPUs

### Classification Results
  - was much better than the previous models
  
<img src="./images/AlexNet_result1.png" width=200 height=300>
  
### Local Response Normalization
  - normalize the response of ReLUs
  - the denominator uses average of the layers infront and behind 
    - would it not be better just to take average of all
      - because they wanted to capture the local effect
    - or average of fixed group
      - this is a question to think of
  - inspired by _"response normalization implements a form of lateral inhibition inspired by the type found in real neurons"_
  - inspired by _"local contrast normalization scheme of Jarrett et al"_
  - ? How is Local Response Normalization different from Batch Normalization?
  
### Overlapping Pooling
  - instead of pooling without overlap (as done today), they overlapped 
  
### Architecture
  - 224x224 with a stride of 4, with 3 channel maps
  - this became 55x55 with 48 feature maps
    - `feature maps keeps increasing`, while
    - `dimension of image resolution keeps decreasing`
  - stride of 4 used to downsample the image, at the same time as convolving it
  - multiple dense layer at the end 

### Reducing overfitting
  - today's deep model are mostly overfitted and we dont care about it
    - _"Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRC make each training example impose 10 bits of constraint on the mapping from image to label, this turns out to be insufficient to learn so many parameters without considerable overfitting. Below, we describe the two primary ways in which we combat overfitting."_
  - Data Augmentation
    - random cropping and horizontal reflections
      - both these data augmentation procedures are still used today
      - to do image translations, random cropping is required
    - _"altering the intensities of the RGB channels in training images"_
      - PCA based image augmentation 
        - Is this still in use? Guess no
### Dropout
  - not used in today's model, mostly
  - _"Combining the predictions of many different models is a very successful way to reduce test errors\[1, 3\], but it appears to be too expensive for big neural networks that already take several days to train. "_
  - dropout
    - _"consists of setting to zero the output of each hidden neuron with probability 0.5"_
      - this method is still in use
    - found dropout reduces overfitting

### More Results
  - use momentum to train this
  - use ensemble methods using 1CNN, 5CNN and 7CNN
  - use transfer learning
    - _"to classify the entire ImageNet Fall 2011 release (15M images, 22K categories), and then “fine-tuning” it on ILSVRC-2012 gives an error rate of 16.6%."_
  - results are very good
    - dalmation vs cherry
  - nearest neighbors images are very close and relatable
    - model learns variances across the class

### Conclusion
  - hasn't changed much today
    - vs VGG16 or VGG19 is just depth
    - vs ResNet - ?
  - today we dont use 3 dense layers, 
    - simply use 1 dense layer and 1 classification layer
  - mention that depth is important
    - _"It is notable that our network’s performance degrades if a single convolutional layer is removed"_
      - ResNets were ultra deep
  - didnot use any unsupervised pre-training