### ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)

**Authors:** Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton
- Trained a large deep CNN to classify 1.2 million high-resolution images in the ImageNet contest into 1000 different classes (50,000 validation images, 150,000 testing images). 
- ImageNet consists of variabe-resolution images: Down-sampled images to fixed resolution of 256 x 256. 
- Preprocessing: Subtracted mean over the training set from each pixel
- NN has 60 million parameters, 650,000 neurons, 5 convolutional layers (some followed by max-pooling layers), 3 fully-connected layers, final layer uses softmax.
- Used non-saturating neurons (ReLU)
- Efficient GPU implementation of convolution operation (5-6 days to train)
- Used "dropout" regularization method to reduce overfitting. 
- Architecture: Network is spread across 2 GPUs
![](image/alexnet1.png)
- ReLU Nonlinearity: 
    - Saturating non-linearities: $f(x) = tanh(x)$ and $f(x) = (1+e^{-x})^{-1}$: Computationally expensive during training with SGD compared to non-saturating non-linearity $f(x) = max(0, x)$ 
    - ReLUs do not require input normalization to prevent them from saturating
- Found that using Local Response Normalization aids generalization
- Pooling Layers: A grid of pooling units, each summarizing a neighborhood of size z x z centered at the location of the pooling unit. Overlapping pooling reduces error.
- Cost: Maximizing average across training cases of the log-probability of the correct label under the prediction distribution
- Reducing Overfitting
    - Data Augmentation: Artificialy augmented dataset using label preserving transformations (reduces overfitting)
        - Generated image translations and horizontal reflections (extracted random patches and trained network on extracted patches)
        - Altered intensities of RGB channels (PCA)
    - Dropout: Set the output of each hidden neuron to zero with probability 0.5
        - Reduces complex co-adaptations of neurons (enables a neuron not to rely on the presence of particular other neurons). Neurons are forced to learn more robust features that are useful in conjunction with many different random subsets of other neurons.
- Training:
    - SGD with batch size = 128 examples, momentum = 0.9, decay = 0.0005
    - Weights initialized from a zero mean Gaussian distribution with standard deviation 0.01. Few layers biases initialized with constant 1 (accelerates early stages of learning by providing the ReLUs with positive input) and remaining with constant 0
![](image/alexnet2.png)

**Information**
- ImageNet: 15 million labeled high-resolution (variable) images in over 22,000 categories. Images collected from the web and labeled using Amazon Mechanical Turk.

### Visualizing and Understanding Convolutional Networks (ZF Net)
**Authors:** Matthew Zeiler and Rob Fergus  
[Matthew Zeiler Presentation](https://www.youtube.com/watch?v=ghEmQSxT6tw)

- Introduced visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Visualization technique also allows to observe the evolution of features during training. 
- Normally only first layer projections to pixel space are possible
- Understanding the operation of a convnet requires interpreting the feature activity in intermediate layers. Developed a new way to map feature activity back to input pixel space, showing what input pattern caused a given activation in the feature map.
- [Visualization](https://www.youtube.com/watch?v=AgkfIQ4IGaM) technique uses a multi-layered deconvolutional network to project the feature activations back to the input pixel space. Deconvolutional network is similar to convolution network in reverse. 
    - Convolution: Mapping pixels to features
    - Deconvolution: Mapping features to pixels
- A deconvnet (with path back to image pixels) is attached to each layer of CNN. An input is fed into CNN and features are computed throughout the layers. To examine a given CNN activation, set all other activations in the layer to zero and pass the feature maps as input to attached deconvnet layer. Then (1) unpool, (2) rectify and (3) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation
    - Unpooling: Max pooling operation is non-invertible, however an approximate inverse can be obtained by recording the locations of the maxima within each pooling region in a set of switch variables. 
    - Rectification: ReLU non-linearities rectify the feature maps (makes sure feature maps are always positive). Valid feature reconstructions at each layer are obtained by passing reconstructed signal through a ReLU
    - Filtering: Deconvnet uses transposed versions of learned filters applied to rectified maps
![](image/zfnet0.png)

- Architecture: Similar to AlexNet (sparce connection replaced with dense connections)
![](image/zfnet1.png)
- Training: 
    - ImageNet 2012 training set (1.3 Million images)
    - SGD with batch size = 128 examples, momentum = 0.9, learning rate = 0.01
    - Weights initialized to 0.01 and biases set to 0
    - Observed that few first layer filters were dominating. Solution: Renormalize each filter in the convolution layers whose RMS value exceeds a fixed radius of 0.1 to fixed radius of 0.1
- AlexNet problems:
    - First layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies.
    - Second layer visualization shows aliasing distortion/artifacts caused by large stride 4 used in 1st layer convolution. Aliasing is an effect that causes different signals to become indistinguishable (or aliases of one another) when sampled.
    - Solution:
        - Reduced 1st layer filter size from 11 x 11 to 7 x 7
        - Reduced stride of convolutions from 4 to 2
        ![](image/zfnet2.png)
        - (b) and (d): AlexNet and (c) and (e): ZF Net
    - Smaller filter and smaller stride retains more information
- Occulison sensitivity: Probability of correct class drops significantly when the object is occluded
- Correspondence analysis: Used Hamming distance
- Feature generalization: Complex invariances learned in convolution layers. Tested model performance on Caltech and other data sets by fixing first 7 layers and training final softmax layer. Found the model was able to generalize well.
![](image/zfnet3.png)

### Very Deep Convolutional Networks For Large-Scale Image Recognition (VGG Net)
**Authors:** Karen Simonyan and Andrew Zisserman

- Investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition (Used very small 3 x 3 filters)
- Architecture:
    - ConvNet input: Fixed size 224 x 224 RGB image (Preprocessing: Subtracted the mean RGB, computed on the training set, from each pixel)
    - Filter size: 3 x 3 (Used 1 x 1 in one configuration)
    - Convolution Stride is fixed to 1 pixel
    - Spatial Padding: 1 pixel to preserve spatial resolution after convolution 
    - Spatial Pooling: 5 max pooling layers which follow some of the  convolution layers. Max pooling is performed over 2 x 2 pixel window , with stride 2 pixels.
    - Fully Connected: 3 layers after a stack of convolution layers.  Third FC layer is softmax layer
    - ReLU non-linearity applied to all hidden layers
    - Depth of Convolution layer starts from 64 in first layer and then increases by a factor of 2, until it reaches 512
    ![](image/vggnet0.png)
    - Convolutional layer parameter: “conv⟨receptive field size⟩-⟨number of channels⟩”
    ![](image/vggnet1.png)
- Show Local Response Normalization (AlexNet) normalization does not improve performance on the ImageNet dataset but leads to increase memory consumption and computation time.
- Small Filters:
    - Used 3 x 3 filters (with stride 1). Small filters increase effective filter size.
        - Example: Stack two 3 x 3 convolution layers (stride 1). Each neuron sees 3 x 3 region of previous activation map. A neuron on second convolution layer sees 5 x 5 region in input. This means that a stack of two 3 x 3 convolution layer has effective filter size of 5 x 5
        ![](image/vggnet2.png)
        ![](image/vggnet3.png)
    - Stack of three 3 x 3 convolution layers vs a single 7 x 7 convolution layer: 3 non-linear rectification layers instead of one and decrease in number of parameters and computations.
    ![](image/vggnet4.png)
    - 1 x 1 convolution layer (the number of input and output channels is the same) is a way to increase the non-linearity of the decision function without affecting the filter size of the convolution layers.
- Training:
    - ImageNet 2012 dataset (1.3 Million images)
    - SGD with batch size = 256, momentum = 0.9, learning rate = 0.01 (initially and then decreased by factor of 10 when validation set accuracy stopped improving)
    - Training was regularlarized ($\lambda = 5 \times 10^{-4}$) by weight decay and dropout regularization for first 2 fully connected layers.
    - Weights initialized from a zero mean Gaussian distribution with standard deviation 0.01. Biases initialized with zero.
    - Data augmentation: Random horizontal flipping, random RGB color shift and scale jittering.
    - Used Caffe tool box to implement CNN
- *Demonstrated that convolutional network depth is beneficial for the classification accuracy*

**Object Localization**
- Last fully connected layer predicts the bounding box location. A bounding box is represented by a 4-D vector storing its center coordinates, width and height. ConvNet configuration D was used with last layer as bounding box prediction layer.
- Euclidean loss which penalises the deviation of the predicted bounding box parameters from the ground-truth was used during training.
- The bounding box prediction is correct if its intersection over union ratio with the ground-truth bounding box is above 0.5
![](image/iou.png)
    
**Information**
- Two VGG Net publicly available
- [VGG Net on iPhone](http://matthijshollemans.com/2016/08/30/vggnet-convolutional-neural-network-iphone/)

### Network In Network
**Authors:** Min Lin1, Qiang Chen, and Shuicheng Yan

- Proposed a novel deep network structure called Network In Network (NIN).  
- Convolution layers take inner product of the linear filter and the underlying receptive field followed by a non-linear activation function at every local function of the input to produce feature maps.
- Convolution filter in CNN is a generalized linear model (GLM) for the underlying data patch. GLM can achieve a good extent of abstraction (the feature is invariant to the variants of the same concept) when the samples of latent concepts are linearly separable. However, the data for the same concept often live on a non-linear manifold, therefore the representations that capture these concepts are generally highly non-linear function of the input.
    - Idea: Replace GLM with non-linear function approximator
- In NIN the GLM is replaced with a "micro network" structure which is a general non-linear function approximator. Authors chose Multi-layer Perceptron (MLP) as instantion of the micro network.
![](image/nin0.png)
- MLPConv layer maps the input local patch to feature map with an MLP by sliding it over the input. ReLU non-linearity is used as activation function in the MLP.
    - In CNN using linear rectifier, the feature map can be calculated as $f_{i, j, k} = max(w_{k}^Tx_{i, j}, 0)$, where $(i, j)$ is the pixel index in the feature map, $x_{i,j}$ stands for the input patch centered at location $(i, j)$, and $k$ is used to index the channels in of the feature map.
    - MLPConv layer performs calculation as follows ($n$ is number of layers in the multilayer perceptron): $$f_{i, j, k}^1 = max({w_{k_1}^{1}}^Tx_{i, j} + b_{k_1}, 0)$$
    $$\vdots$$  $$f_{i, j, k}^n = max({w_{k_n}^{n}}^Tf_{i, j}^{n-1} + b_{k_n}, 0)$$
    
- NIN structure consists of stacked multiple MLPConv layers. Instead of using fully connected layer (prone to overfitting) for classification, NIN uses *global average pooling* layer (acts as a regularizer) which feeds into softmax layer.
![](image/nin1.png)
- **Global Average Pooling:** Generate one feature map for each corresponding category of the classification task in the last MLPConv layer and then take average of each feature map and feed into softmax layer.
    - Advantages:
        - No parameter to optimize
        - More robust to spatial translation of the input
- Experiments: Used these four datasets: CIFAR-10, CIFAR-100, SVHN, MNIST

### Going Deeper with Convolutions (GoogLeNet)
**Authors:** Christian Szegedy et al.

- Proposed a deep CNN architecture (Inception-"Network in network") with increased depth and width of the network while improving utilization of computing resources (Models designed to keep a computational budget of 1.5 billion multiply-adds at inference time)
- Hebbian principle – Neurons that fire together, wire together: A method of determining how to alter the weights between model neurons. The weight between two neurons increases if the two neurons activate simultaneously, and reduces if they activate separately. Nodes that tend to be either both positive or both negative at the same time have strong positive weights, while those that tend to be opposite have strong negative weights.
- GoogLeNet uses 12 times fewer parameters than AlexNet and is 22 layer deep. 
- Max pooling layers sometimes result in loss of accurate spatial information.
    - Advantages of Max pooling
        - No parameters
        - Often accuratte
    - Disadvantages of Max pooling
        - More computationally expensive
        - More hyperparameters (pooling size and stride)
- [1 x 1 Explanation](https://www.youtube.com/watch?v=qVP574skyuM): 1 x 1 convolutional layers are used as dimension reduction modules to remove computational bottlenecks. This allows increase in depth as well as width (number of units at each level) of network. For example, a feature map with size 100 x 100 x C channels on convolution with $k$ 1 x 1 filters would result in a feature map of size 100 x 100 x $k$. 
- Deep network has drawbacks:
    - Large number of parameters make deep network more prone to overfitting
    - Training a deep network requires a lot of computational resources.
    - Solution: Efficient distribution of computational resources and introduce sparsity and replace fully connected layers by sparse layers
- Architecture
    - Filter size 1 x 1, 3 x 3, and 5 x 5 are used to avoid patch alignment issues
    - [Inception modules](https://www.youtube.com/watch?v=VxhSouuSZDY): Used 9 inception modules with over 100 layers in total
    
        - Naive version:
            - Merging of outputs of the pooling layer with outputs of the convolutional layer would increase the number of outputs from stage to stage and this will lead to a computational blow up within a few stages
        - Dimensionality Reduction Inception module (idea based on embeddings): Using 1 x 1 filter size to reduce dimension as well as to increase non-linearity
    ![](image/gnet0.png)
    - All convolutions use ReLU non-linearity for activations
    ![](image/gnet1.png)
    - “#3×3 reduce” and “#5×5 reduce” stands for the number of 1×1 filters in the reduction layer used before the 3×3 and 5×5 convolutions.
    ![](image/gnet2.png)
    ![](image/gnet3.png)
    - Auxiliary classifiers were added to intermediate layers to combat vanishing gradient problem while providing regularization. During training auxiliary classifier loss (with discount weight 0.3) gets added to total loss of the network. 
    - Used Average pooling layer
- Training:
    - GoogLeNet networks were trained using the DistBelief (Large Scale Distributed Deep Networks) distributed machine learning system
    - Asynchronous SGD with momentum = 0.9, learning rate decreased by 4% every 8 epochs. 
    - ImageNet 2012 dataset (1.3 Million images)
- Trained 7 versions of same GoogLeNet model and performed ensemble prediction and obtained a top-5 error of 6.67%

**Object Localization**
- Approach similar to R-CNN used but augmented with inception model as the region classifier

![](image/GoogLeNet.gif)
    


### Deep Residual Learning for Image Recognition (ResNet)
**Authors: ** Kaiming He et al.

- Presented a residual learning framework that makes it easier to train deep networks
- Accuracy degradation problem occurs when deep networks start converging: with the network depth increasing the accuracy gets saturated and then degrades rapidly.
![](image/resnet0.png)
- Training accuracy degradation indicates that not all systems are similarly easy to optimize.
- Experiments on ImageNet to show:
    - Extremely deep residual networks are easy to optimize. Deep  simple stack layer networks exhibit higher training error when the depth increases
    - Deep residual networks gains accuracy from increased depth
    - 152 layer residual network ensemble achieved *3.57* % top-5 error on ImageNet test set.
- Experiments on CIFAR-10: Explored models with over 1000 layers
- Consider $\mathcal{H}(\bf x)$ as an underlying mapping to be fit by a few stacked layers, where $\bf x$ is the input to the first of these layers. Hypothesis that multiple non-linear layers can approximate complicated function is equivalent to hypothesis that non-linear layers can approximate the residual functions $\mathcal{H}(\bf x) - \bf x$. Instead of using stacked layers to approximate $\mathcal{H}(\bf x)$ the stacked layers are used to approximate a residual function $\mathcal{F}(\bf x):=\mathcal{H}(\bf x) - \bf x$. The original function becomes $\mathcal{F}(\bf x)+\bf x$

- If the added layers can be constructed as identity mappings a deeper model should not have training error greater than a shallower similar network. The accuracy degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple non-linear layers. With the residual learning reformulations, *if the identity mappings are optimal, the solvers may drive the weights of the multiple non-linear layers toward zero to approach identity mappings*.
- Desired underlying mapping $\mathcal{H}(\bf x)$ can be realized by feedforward network with *shortcut connections* (connection that skip one or more layers). 
- In ResNet shortcut connections simply perfrom *identity* mapping (it is a function that always returns the same value that was used as its argument). Identity shortcut connections does not add extra parameter or computational complexity.
- **Identity Mapping by Shortcuts**
    - Building block: $\bf y = \mathcal{F}(\bf{x}, \{W_i\}) + \bf x$ where $\bf x$ and $\bf y$ are the input and output vectors fo the layers. Function $\mathcal{F}(\bf{x}, \{W_i\})$ represents the residual mapping to be learned. 
    ![](image/resnet1.png)
    - Above example: $\mathcal{F} = W_2\sigma(W_1 \bf x)$, where $\sigma$ denotes non-linearity ReLU.
    - Operation $\mathcal{F} + \bf x$ is performed by a shortcut connection and element-wise addition. Then ReLU non-linearity is applied: $\sigma(\mathcal{F} + \bf x)$. Shortcut connection does not introduce extra parameter or computation complexity. 
    - Experiments involve function $\mathcal{F}$ that has 2 or more layers. If $\mathcal{F}$ has only a single layer then $\bf y = W_1 \bf x + \bf x$ (a linear layer) and it does not offer any advantages
- Training:
    - Image resized with its shorter side randomly sampled in [256, 480] for scale augmentation. A 224 x 224 crop is then randomly sampled from an image with the per-pixel mean subtracted.
    - Used batch normalization after each convolution and before activation
    - SGD batch size = 256, learning rate = 0.1 and it is divided by 10 when the error plateus
    - Weight decay = 0.0001 and momentum = 0.9
    - Convolution layers have 3 x 3 filters. Downsampling is done by using a stride of 2. 
    - Network ends with a global average pooling layer and  a fully connected softmax layer.
- Deeper Bottleneck Architecture: Modified *building block* as a *bottleneck* design. For each residuaal function $\mathcal{F}$ 3 layer stack (1 x 1, 3 x 3, 1 x 1) was used. 1 x 1 layers were used for reducing and restoring dimensions. Identity shortcuts were replaced with projection

![](image/ResNet.gif)

**Information**
- Authors worked on "Highway Network"

### Introspective Generative Modeling: Decide Discriminatively
**Authors: ** Justin Lazarow et al.

- Developed Introspective Generative Modeling (IGM) that attains a generator using progressively learned deep convolutional neural networks. Generator is able to self-evaluate the difference between its generated samples and the given training data (i.e. Generator is also a discriminator).

- Unsupervised learning (much harder task because of learning complexity and assumptions) models are often *generative* and supervised classifiers are often *discriminative*

- Introspective Generative Modeling (IGM) is simultaneously a generator and a discriminator. Modeling consists of two stages during training:
    1. A pseudo-negative sampling stage (synthesis) for self-generation
    2. A CNN classifier learning stage (classification) for self evaluation and model updating.
- Some properties about IGM
    - Existing CNN classifiers can be directly made into generators (if trained properly)
    - Able to train on images of a size and generate an image of larger size while maintaining the coherence for the entire image.
- General pipeline of IGM is similar to Generative Modeling via Discriminative approach method (GDL) with boosting algorithm replaced by a CNN in IGM (demonstrated significant improvement in modeling and computational power)
- GDL learns a generative model through a sequence of discriminative classifiers (boosting) using repeadedly self-generated samples, called *pseudo-negatives*

- Differences between GDL and IGM
    - CNN in IGM results in a significant boost to feature learning
    - GDL: Markov Chain Monte Carlo based sampling process (computational bottleneck. IGM: Backpropagation to synthesis/sampling process
- Differences between GAN and IGM
    - IGM maintains a single model that is simultaneously a generator and a discriminator. GAN uses two CNN's, a generator and a discriminator.
    - GAN's are hard to train. IGM carries out a straightforward use of backpropagation in both the sampling and the classifier training stage, making the learning process direct.
    - GAN generator is a mapping from features to images. IGM directly models the underlying statistics of an image with an efficient sampling/inference process, which makes IGM flexible.
    - GAN performs a forward pass to reconstruct an image. In IGM image synthesis is carried out using backpropagation so it is slower (but feasible)
    - IGM has larger model complexity (a cascade of ~ 60 to 200 CNN classifiers are included) than GAN.
    
----

- GAN - Discriminator helps a geneartor try not to be fooled by "fake" samples.
- GAN was motivated from an observation that adding small perturbations to an image leads to classification errors that are absurd to humans. 
- GAN uses a generator and a discriminator with the objective of making use of the discriminator to help the generator generate "faithful" samples. The discriminator in GAN were trained to classify between "real" and "fake" samples. The discriminator in GAN is not meant to perform multi-class classification task.

### Introspective Classifier Learning: Empower Generatively
**Authors: ** Long Jin et al.

- Proposed Introspective Classifier Learning (ICL) Framework - Discriminative classifier with generative capabilities (A single model that is simultaneously discriminative and generative)
- Studied how generative aspect of their model benefits its own discriminative training
- Developed an efficient sampling procedure to synthesize new data from discriminative classifier
- Devised *Reclassification-By-Synthesis* algorithm to iteratively augment negative samples and update the classifier.

**Discriminative Model** - Class of models used in machine learning for modeling the dependence of an unobserved variable $y$ on an observed variable $x$. Examples: Logistic Regression, SVM, Boosting, Linear Regression, Random Forests, Neural Networks.
- Models the posterior $p(y|x)$ directly or learn a direct map from inputs $x$ to class labels $y$.
- Superior performance for classification and regression tasks where joint distribution is not required.
- Samples can not be generated from the joint distribution of $x$ and $y$
- Inherently supervised and can not be easily extended to unsupervised learning.

**Generative Models** - A class of models used in machine learning for modeling how the data was generated in order to categorize an input. Examples: GMM, HMM, Naive Bayes, LDA, RBM, GAN.
- Learns a model of the joint probability $p(x, y)$ of the inputs $x$ and class label $y$ and make predictions by using **`Bayes`** theorem to calculate $p(y|x)$ and then picking the most likely label $y$. *Model asks question: based on my generation assumptions which category is most likely to generate this input.*
- $p(x, y)$ can be used to generate synthetic data similar to observed data
- More flexible than discriminative model in expressing dependencies in complex learning tasks.