### ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)

**Authors:** Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton
- Trained a large deep CNN to classify 1.2 million high-resolution images in the ImageNet contest into 1000 different classes (50,000 validation images, 150,000 testing images). 
- ImageNet consists of variabe-resolution images: Down-sampled images to fixed resolution of 256 x 256. 
- Preprocessing: Subtracted mean over the training set from each pixel
- NN has 60 million parameters, 650,000 neurons, 5 convolutional layers (some followed by max-pooling layers), 3 fully-connected layers, final layer uses softmax.
- Used non-saturating neurons (ReLU)
- Efficient GPU implementation of convolution operation (5-6 days to train)
- Used "dropout" regularization method to reduce overfitting. 
- Architecture: Network is spread across 2 GPUs
![](alexnet1.png)
- ReLU Nonlinearity: 
    - Saturating non-linearities: $f(x) = tanh(x)$ and $f(x) = (1+e^{-x})^{-1}$: Computationally expensive during training with SGD compared to non-saturating non-linearity $f(x) = max(0, x)$ 
    - ReLUs do not require input normalization to prevent them from saturating
    
- Pooling Layers: A grid of pooling units, each summarizing a neighborhood of size z x z centered at the location of the pooling unit. Overlapping pooling reduces error.
- Cost: Maximizing average across training cases of the log-probability of the correct label under the prediction distribution
- Reducing Overfitting
    - Data Augmentation: Artificialy augmented dataset using label preserving transformations (reduces overfitting)
        - Generated image translations and horizontal reflections (extracted random patches and trained network on extracted patches)
        - Altered intensities of RGB channels (PCA)
    - Dropout: Set the output of each hidden neuron to zero with probability 0.5
        - Reduces complex co-adaptations of neurons (enables a neuron not to rely on the presence of particular other neurons). Neurons are forced to learn more robust features that are useful in conjunction with many different random subsets of other neurons.
- Training:
    - SGD with batch size = 128 examples, momentum = 0.9, decay = 0.0005
    - Weights initialized from a zero mean Gaussian distribution with standard deviation 0.01. Few layers biases initialized with constant 1 (accelerates early stages of learning by providing the ReLUs with positive input) and remaining with constant 0
![](alexnet2.png)

**Information**
- ImageNet: 15 million labeled high-resolution (variable) images in over 22,000 categories. Images collected from the web and labeled using Amazon Mechanical Turk.

### Visualizing and Understanding Convolutional Networks (ZF Net)
**Authors:** Matthew Zeiler and Rob Fergus  
[Video link](https://www.youtube.com/watch?v=ghEmQSxT6tw)

- Introduced visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Visualization technique also allows to observe the evolution of features during training. 
- Normally only first layer projections to pixel space are possible
- Understanding the operation of a convnet requires interpreting the feature activity in intermediate layers. Developed a new way to map feature activity back to input pixel space, showing what input pattern caused a given activation in the feature map.
- Visualization technique uses a multi-layered deconvolutional network to project the feature activations back to the input pixel space. Deconvolutional network is similar to convolution network in reverse. 
    - Convolution: Mapping pixels to features
    - Deconvolution: Mapping features to pixels
- A deconvnet (with path back to image pixels) is attached to each layer of CNN. An input is fed into CNN and features are computed throughout the layers. To examine a given CNN activation, set all other activations in the layer to zero and pass the feature maps as input to attached deconvnet layer. Then (1) unpool, (2) rectify and (3) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation
    - Unpooling: Max pooling operation is non-invertible, however an approximate inverse can be obtained by recording the locations of the maxima within each pooling region in a set of switch variables. 
    - Rectification: ReLU non-linearities rectify the feature maps (makes sure feature maps are always positive). Valid feature reconstructions at each layer are obtained by passing reconstructed signal through a ReLU
    - Filtering: Deconvnet uses transposed versions of learned filters applied to rectified maps
![](zfnet0.png)

- Architecture: Similar to AlexNet (sparce connection replaced with dense connections)
![](zfnet1.png)
- Training: 
    - ImageNet 2012 training set (1.3 Million images)
    - SGD with batch size = 128 examples, momentum = 0.9, learning rate = 0.01
    - Weights initialized to 0.01 and biases set to 0
    - Observed that few first layer filters were dominating. Solution: Renormalize each filter in the convolution layers whose RMS value exceeds a fixed radius of 0.1 to fixed radius of 0.1
- AlexNet problems:
    - First layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies.
    - Second layer visualization shows aliasing distortion/artifacts caused by large stride 4 used in 1st layer convolution. Aliasing is an effect that causes different signals to become indistinguishable (or aliases of one another) when sampled.
    - Solution:
        - Reduced 1st layer filter size from 11 x 11 to 7 x 7
        - Reduced stride of convolutions from 4 to 2
        ![](zfnet2.png)
        - (b) and (d): AlexNet and (c) and (e): ZF Net
    - Smaller filter and smaller stride retains more information
- Occulison sensitivity: Probability of correct class drops significantly when the object is occluded
- Correspondence analysis: Used Hamming distance
- Feature generalization: Complex invariances learned in convolution layers. Tested model performance on Caltech and other data sets by fixing first 7 layers and training final softmax layer. Found the model was able to generalize well.
![](zfnet3.png)