# The Deep Learning Book (Simplified)
## Part II - Modern Practical Deep Networks
*This is a series of blog posts on the [Deep Learning book](http://deeplearningbook.org)
where we are attempting to provide a summary of each chapter highlighting the concepts 
that we found to be most important so that other people can use it as a starting point
for reading the chapters, while adding further explanations on few areas that we found difficult to grasp. Please refer [this](http://www.deeplearningbook.org/contents/notation.html) for more clarity on 
notation.*


## Chapter 9: Convolutional Networks

**Convolutional networks**, also known as **convolutional neural networks**, or **CNN**s, are a specialized kind of neural network for processing data that has a known grid-like topology. <br>

The chapter is organized as follows:

**1. The Convolution Operation** <br>
**2. Motivation** <br>
**3. Pooling** <br>
**4. Convolution and Pooling as an Infinitely Strong Prior** <br>

### 1. The Convolution Operation

- The convolution operates on the **input** with a **kernel** (weights) to produce an **output map** given by:
$$ s(t) = \int x(a)w(t-a)da $$
or in discrete space as:
$$ s(t) = \sum_{a=-\infty}^{\infty} x(a)w(t-a) $$
and in 2D as:
$$ s(i,j) = \sum_m \sum_n x(i,j)w(i-m,j-n) $$
- The flipping of the kernel weights gives the formulation the commutative property, i.e.
$$ s(i,j) = \sum_m \sum_n x(i,j)w(i-m,j-n) = \sum_m \sum_n x(i-m,j-n)w(i,j) $$
- When the kernel isn't flipped it results in the **cross-correlation**:
$$ s(i,j) = \sum_m \sum_n x(i+m,j+n)w(i,j) $$
This operation however lacks the commutative property
- The operation can be broken into matrix multiplications using the **Toeplitz** matrix representation for 1D and **block-circulant** matrix for 2D convolution

### 2. Motivation

- **Sparse interactions**: Each output unit is connected to (affected by) only a subset of the input units. This significantly reduces the number of parameters and hence the number of computations in the operation. In  *deep CNNs*, the units in the deeper layers interact *indirectly* with large subsets of the input which allows modelling of complex interactions through sparse connections.

- **Parameter sharing**: In a CNN each kernel weight is used at every input position (except maybe at boundaries where different padding rules apply). It can be seen easily that if the same linear operation needs to be applied at all positions in an input image, the convoution representations is much more economical as compared to the equivalent fully-connected variant. Less parameters also implies more statistical efficiency.

- **Equivariance**: Parameter sharing also provides **equivariance to translation**
    - A function *f* is said to be equivarient to a function *g* if $f(g(x)) = g(f(x))$ i.e. if input changes, the output changes in the same way
    - Here we see the translation of the image results in corresponding translation in the output map (except maybe for boundary pixels)
    - Note that convolution operation by itself is not equivariant to changes in scale or rotation.

### 3. Pooling

A convolutional layer can be broken into the following components:

1. Convolution
2. Activation (detector stage)
3. Pooling


- The pooing function calculates a **summary statistic** of the nearby pixels at the point of operation. Several common statistics are max, mean, weighted average and $L^2$ norm of a surrounding rectangular window.
- Pooling makes the representation slightly **translation invariant** in that small translations in the input do not cause large changes in the output map. It allows detection of a particular feature if we only care about its existence, not its position in an image. This is a strong requirement on the representation learnt.
- Pooling over feature channels can be used to develop invariance to certain transformations of the input. For e.g., units in a layer may be developed to learn rotated features and then pooled over. This property has been used in [maxout networks](http://proceedings.mlr.press/v28/goodfellow13.pdf)
- Pooling reduces the input size to the next layer in turn reducing the number of computations required upstream.
- Variable sized inputs are an issue when presented to a fully connected layer. To counter this, the pooling operation maybe performed on regions of the input (such as quadrants) thus allowing the model to work on variable sized inputs.


[Theoretical guidelines](http://www.di.ens.fr/willow/pdfs/icml2010b.pdf) for which pooling to use have been studied. [Dynamic pooling](http://yann.lecun.com/exdb/publis/pdf/boureau-iccv-11.pdf) has also been studied.

### 4. Convolution and Pooling as an Infinitely Strong Prior

**What is a weight prior?** Assumptions about the weights (before learning) in terms of acceptable values and range are encoded into the *prior* distribution of the weights. A *weak prior* is has a high variance and shows that there is low confidence in the initial value of the weight. A *strong prior* is turn shows a narrow range of values about which we are confident before learning begins. An *infinitely strong prior* demarkates certain values as forbidden completely assigning them zero probability.

If we view the convolutional layer as a fully connected layer, **convolution imposes an infinitely strong prior** by making the following restrictions on the weights:
1. Adjacent units must have the same weight but shifted in space
2. Except for a small spatially connected region, all other weights must be zero


Likewise the **pooling stage imposes an infinitely strong prior** by requiring features to be translation invariant.

Insights:
1. Conv and pooling can cause underfitting if the priors imposed are not suitable for the task.
2. Convolutional models should only be compared with other convolutional models. This is because other models which are **pertumation invariant** can learn even when input features are permuted (thus loosing spatial relationships). Such models need to learn these spatial relationships (which are hard coded in CNNs). 