# Convolutional Neural Network Notes

## Resources

Articles, notes used to construct these notes:

- [Neural Networks and Deep Learning Chapter 6](http://neuralnetworksanddeeplearning.com/chap6.html)
- [An Intuitive Explanation of Convolutional Neural Networks](https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/)
- [Understanding Convolutional Neural Networks for NLP](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/)
- [Understanding CNNs (3 Parts)](https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks/)
- [Stanford CS231 Architecture Overview](http://cs231n.github.io/convolutional-networks/)

## Concepts

### Local Receptive Fields

- A field of inputs that maps to a single neuron in the subsequent hidden (convolutional) layer.
 
![](http://neuralnetworksanddeeplearning.com/images/tikz44.png)

- The field gets shifted to map to the next neuron in the hidden layer.

![](http://neuralnetworksanddeeplearning.com/images/tikz45.png)


### Shared Weights and Biases

- Each of the neurons in a convolution layer uses the same weight and bias.
- This implies
    - Each layer detects the same feature across all of the local receptive fields that connects to its neurons.
    - CNNs are well-suited for translation invariance (rotations or translations in an image).
- The shared weights and bias are sometimes called a *kernel* or *filter*.
    - This is reflected in the keras prototype for the `Convolution2D` layer: `Convolution2D(nb_filter=..)`, where `nb_filter` is the number of kernels or filters in that convolutional layer.

### Convolutional Layer

- A layer consisting of several filters applied over their inputs. Filters can be thought of as parallel layers within the convolutional layer.
- The number of filters in the convolutional layer is its **depth**.
- Each of the filters should be able to detect a single type of feature in its input. Generally, as you add more filters to a convolutional layer, you are able to detect more features.
- A filter in this layer has the same weights and biases across all of its neurons (shared weights and biases).

![](http://neuralnetworksanddeeplearning.com/images/tikz46.png)

- Some nice illustrations of a filter being applied across local receptive fields in an image to populate a filter:

![](https://ujwlkarn.files.wordpress.com/2016/07/screen-shot-2016-07-24-at-11-25-24-pm.png?w=74&h=64)
![](https://ujwlkarn.files.wordpress.com/2016/07/convolution_schematic.gif?w=268&h=196)
![](https://ujwlkarn.files.wordpress.com/2016/08/giphy.gif?w=748)

- **Stride** is the number of pixels by which the filter being applied to each local receptive field is slid over the input matrix. The stride in the illustration above is 1. A stride of 2 might look like this:
![](https://adeshpande3.github.io/assets/Stride2.png)

- **Zero-padding** is when you pad the input image with a border of zeros. This allows you to apply a filter matrix to values along the borders in the same it would be applied to values in the center. The illustration above has no padding. Using zero-padding is also referred to as **wide convolution**, whereas not using it is **narrow convolution**. Adding enough zero-padding makes it possible to maintain the same dimensions from a convolution layer to a pooling layer while still condensing the input.

### Activation Layer

- A layer that applies an activation function after a convolution. This can also be implicitly applied at the convolution layer itself, but sometimes it's considered separately.
- Convolution by itself is linear, so adding an activation function introduces non-linearity.
- The most common activation in recent literature seems to be the Rectified Linear Unit (ReLU).

```
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))
```
- The figure below shows how applying ReLU affects the feature map:

![](https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-07-at-6-18-19-pm.png?w=748)

### Pooling Layer

- Placed after a convolutional layer to condense the convolutional layer's output. (You can have multiple consecutive convolutional layers before a pooling layer).
- Each neuron in the pooling layer takes an area of the convolutional layer (e.g. 2x2) as input and computes a condensed version of that input.
- Several ways to condense the convolutional layer:
    - **max-pooling**: take the max of the input region.
    - **average pooling**: take the mean of the input region.
    - **L2 pooling**: square root of the sum of the squares.
- The implications of pooling:
    - Reduce precision about the convolutional output.
    - Maintain spatial information about the features detected or not detected in the convolutional layer.
    - Reduce noise from small variations or distortions in the input.
    - Reduce the number of parameters that need to be learned.
    
![](http://neuralnetworksanddeeplearning.com/images/tikz47.png)
  
![](https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-10-at-3-38-39-am.png?w=494)

### Fully-Connected Layer

- The final (output) layer is fully-connected to the last pooling layer.
- The input to this layer should be thought of as a representation of the features present in the image.
- The fully connected layer uses this representation to learn to classify the image.
- Typically a softmax activation is used, which yields a probability distribution over the classes.

![](http://neuralnetworksanddeeplearning.com/images/tikz49.png)


### Dropout Layer

- A dropout serves as a proxy between the layer before and after it and removes a random subset of its neurons during training.
- This can help prevent over-fitting.
- It should only be used during training. I think Keras handles this automagically for you.

### Network-in-network Layers

- A 1x1 convolution that acts as a dot-product between the dimensions in the input.
- Example: if the input is (32x32x3), then applying a 1x1 convolution is like performing a 3-dimensional dot-product.

### Structure

- It seems that a typical CNN consists of a series of convolutional $\rightarrow$ pooling pairs, followed by a fully-connected layer, followed by another fully-connected output layer.
- Consider AlexNet, with the diagram below. Nielsen describes each layer under the heading "The 2012 KSH paper" [on this page](http://neuralnetworksanddeeplearning.com/chap6.html).

http://neuralnetworksanddeeplearning.com/images/KSH.jpg


## Training

- Haven't looked a lot into this, but at the surface it seems the same backpropagation can be used as with fully-connected networks.
- Dealing with dimensionality and various activations likely makes the implementation more difficult.

## Visualizing  and Understanding

- DeConvNet (Zeiler, 2013) (founder of Clarifai)
    - [Presentation (youtube)](https://www.youtube.com/watch?v=ghEmQSxT6tw)
    - [Paper](http://www.matthewzeiler.com/pubs/arxive2013/eccv2014.pdf)


## Computer Vision Applications

### Classification

- Input: image, output: probability distribution over classes.

### Localization

- Input: image, output: bounding box around an object.

### Detection

- Input: image, output: multiple bouding boxes around objects.
- Methods
    - **Region CNN** (R-CNN

### Segmenation

- Input: image, output: class and outline around objects.
- Papers:
    - [Fully Convolutional Networks for Semantic Segmentation](http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Long_Fully_Convolutional_Networks_2015_CVPR_paper.pdf)
    - [Fast object segmentation in unconstrained video](http://calvin.inf.ed.ac.uk/wp-content/uploads/Publications/papazoglouICCV2013-camera-ready.pdf)
    - [Shape guided object segmentation](https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/Borenstein06.pdf)
    - [Primary object regions](http://crcv.ucf.edu/papers/cvpr2013/VideoObjectSegmentation.pdf)
    - [Shape sharing](http://www.cs.utexas.edu/~grauman/papers/shape-sharing-ECCV2012.pdf)