# Convolutional Networks

**Convolutional networks**, also known as **convolutional neural networks**, or CNNs, are a specialized kind of neural network for processing data that has a known grid-like topology. 

Examples include time-series data, which canbe thought of as a 1-D grid taking samples at regular time intervals, and image data, which can be thought of as a 2-D grid of pixels. 

Convolutional networks have been tremendously successful in practical applications. The name “convolutional neuralnetwork” indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation. Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.

## The Convolution Operation

In its most general form, convolution is an operation on two functions of a real-valued argument. To motivate the deﬁnition of convolution, we start with examplesof two functions we might use.

Suppose we are tracking the location of a spaceship with a laser sensor. Our laser sensor provides a single output $x(t)$, the position of the spaceship at time $t$. Both $x$ and $t$ are real valued, that is, we can get a different reading from the lasers ensor at any instant in time.

Now suppose that our laser sensor is some what noisy. To obtain a less noisy estimate of the spaceship’s position, we would like to average several measurements. Of course, more recent measurements are more relevant, so we will want this to be a weighted average that gives more weight to recent measurements. We can do this with a weighting function $w(a)$, where $a$ is the age of a measurement. If we apply such a weighted average operation once per second, we obtain a new function $s$ providing a smoothed estimate of the position of the spaceship. The time index $t$ can then take on only integer values. If we now assume that $x$ and $w$ are deﬁned only on integer $t$, we can deﬁne the discrete convolution:

$$
s(t) = \int x(a)w(t-a) da
$$
becomes
$$
s(t) = \sum_{a=-\infty}^\infty x(a)w(t-a)
$$

This operation is called **convolution**. The convolution operation is typically denoted with an asterisk: 

$$
s(t) = (x*w)(t)
$$

In convolutional network terminology, the ﬁrst argument (in this example, thefunctionx) to the convolution is often referred to as theinput, and the second argument (in this example, the functionw) as thekernel. The output is sometimesreferred to as the feature map.

Finally, we often use convolutions over more than one axis at a time. For example, if we use a two-dimensional image $I$ as our input, we probably also want to use a two-dimensional kernel K:

$$
S(i,j) = (I*K)(i,j) = \sum_m\sum_n I(m,n)K(i-m,j-n)
$$

Convolution is commutative, meaning we can equivalently write

$$
S(i,j) = (K*I)(i,j) = \sum_m\sum_n I(i-m,j-n)K(m,n)
$$

Usually the latter formula is more straightforward to implement in a machinelearning library, because there is less variation in the range of valid values of m and n. Instead, many neural network libraries implement arelated function called thecross-correlation, which is the same as convolutionbut without ﬂipping the kernel:

$$
S(i,j) = (K*I)(i,j) = \sum_m\sum_n I(i+m,j+n)K(m,n)
$$

<img src="img/matrix.png">

An example of 2-D convolution without kernel ﬂipping. We restrict the output to only positions where the kernel lies entirely within the image, called “valid” convolutionin some contexts. We draw boxes with arrows to indicate how the upper-left element of the output tensor is formed by applying the kernel to the corresponding upper-left region of the input tensor.

Discrete convolution can be viewed as multiplication by a matrix, but the matrix has several entries constrained to be equal to other entries. For example, for univariate discrete convolution, each row of the matrix is constrained to be equal to the row above shifted by one element. This is known as a Toeplitz matrix. In two dimensions, adoubly block circulant matrixcorresponds toconvolution. In addition to these constraints that several elements be equal toeach other, convolution usually corresponds to a very sparse matrix (a matrixwhose entries are mostly equal to zero). This is because the kernel is usually muchsmaller than the input image. 

## Motivation

Convolution leverages three important ideas that can help improve a machine learning system: 
- sparse interactions
- parameter sharing 
- equivariant representations. 

Moreover, convolution provides a means for working with inputs of variable size.

Traditional neural network layers use matrix multiplication by a matrix of parameters with a separate parameter describing the interaction between each input unit and each output unit. This means that every output unit interacts with every input unit. 

Convolutional networks, however, typically have sparse interactions (also referred to as sparse connectivity or sparse weights). This is accomplished by making the kernel smaller than the input. For example, when processing an image, the input image might have thousands or millions of pixels, but we can detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels. This means that we need to store fewer parameters, which both reduces the memory requirements of the model and improves its statistical eﬃciency. It also means that computing the output requires fewer operations.

<img src="img/ex.png" height="10%" weight="10%">

Sparse connectivity, viewed from below. We highlight one input unit,$x_3$, and highlight the output units in $s$ that are aﬀected by this unit. (Top)When $s$ is formed by convolution with a kernel of width 3, only three outputs are affected by $x$. (Bottom)When $s$ is formed by matrix multiplication, connectivity is no longer sparse, so all the outputs are aﬀected by $x_3$.

<img src="img/ex2.png" height="10%" weight="10%">

Sparse connectivity, viewed from above. We highlight one output unit, $s_3$, and highlight the input units in $x$ that aﬀect this unit. These units are known as the receptive ﬁeld of $s_3$. (Top)When $s$ is formed by convolution with a kernel of width 3, only three inputs aﬀect $s_3$. (Bottom)When $s$ is formed by matrix multiplication, connectivity is no longer sparse, so all the inputs aﬀect s3.

<img src="img/ex3.png" height="10%" weight="10%"> 

The receptive ﬁeld of the units in the deeper layers of a convolutional networkis larger than the receptive ﬁeld of the units in the shallow layers. This eﬀect increases if the network includes architectural features like strided convolution  or pooling. This means that even though direct connections in a convolutional net are very sparse, units in the deeper layers can be indirectly connected to all or most of theinput image

These improvements in eﬃciency are usually quite large. If there are $m$ inputs and $n$ outputs, then matrix multiplication requires $m\times n$ parameters, and the algorithms used in practice have $O(m\times n)$ runtime (per example). If we limit the number of connections each output may have to k, then the sparsely connected approach requires only $k\times n$ parameters and $O(k\times n)$ runtime. For many practical applications, it is possible to obtain good performanceon the machine learning task while keeping k several orders of magnitude smaller than m. In a deep convolutional network, units in the deeper layers may indirectly interact with a larger portion of the input. This allows the network to eﬃciently describe complicated interactions between many variables by constructing such interactions from simple building blocks that each describe only sparse interactions. 

Parameter sharing refers to using the same parameter for more than one function in a model. In a traditional neural net, each element of the weight matrix is used exactly once when computing the output of a layer.

As an example of both of these ﬁrst two principles in action,the figure below shows sparse connectivity and parameter sharing can dramatically improve the eﬃciency of a linear function for detecting edges in an image.

<img src="img/ex4.png" height="10%" weight="10%">

Eﬃciency of edge detection. The image on the right was formed by taking each pixel in the original image and subtracting the value of its neighboring pixel on the left. This shows the strength of all the vertically oriented edges in the input image, which can be a useful operation for object detection. Both images are 280 pixels tall.The input image is 320 pixels wide, while the output image is 319 pixels wide. This transformation can be described by a convolution kernel containing two elements, and requires 319×280×3 = 267,960 ﬂoating-point operations (two multiplications and one addition per output pixel) to compute using convolution. To describe the same transformation with a matrix multiplication would take 320×280×319×280, or overeight billion, entries in the matrix, making convolution four billion times more eﬃcient for representing this transformation. The straightforward matrix multiplication algorithmperforms over sixteen billion ﬂoating point operations, making convolution roughly 60,000 times more eﬃcient computationally. Of course, most of the entries of the matrix would be zero. If we stored only the nonzero entries of the matrix, then both matrix multiplication nd convolution would require the same number of ﬂoating-point operations to compute. The matrix would still need to contain 2×319×280 = 178,640 entries. Convolution is an extremely eﬃcient way of describing transformations that apply the same linear transformation of a small local region across the entire input.

In the case of convolution, the particular form of parameter sharing causes the layer to have a property called equivariance to translation. To say a function is equivariant means that if the input changes, the output changes in the same way. I.e. f is equivariant to g if $f(g(x)) = g(f(x))$. In the case of convolutions let $I$ be a function giving image brightness at integer coordinates. Let $g$ be a function mapping one image function to another image function, such that $I' = g(I)$ is the image function with $I'(x, y) =I(x −1, y)$. This shifts every pixel of $I$ one unit to the right. If we apply this transformation to $I$, then apply convolution, the result will be the same as if we applied convolution to $I'$, then applied the transformation gto the output.

## Pooling

A typical layer of a convolutional network consists of three stages. In the ﬁrst stage, the layer performs several convolutions in parallel to produce aset of linear activations. In the second stage, each linear activation is run througha nonlinear activation function, such as the rectiﬁed linear activation function.This stage is sometimes called thedetector stage. In the third stage, we use apooling function to modify the output of the layer further.

<img src="img/ex5.png" height="10%" weight="10%">

A pooling function replaces the output of the net at a certain location with asummary statistic of the nearby outputs. For example, the max pooling operation reports the maximum output within a rectangular neighborhood. Other popular pooling functions include the average of a rectangular neighborhood, the $L^2$ norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel.

In all cases, pooling helps to make the representation approximatelyinvariantto small translations of the input. Invariance to translation means that if wetranslate the input by a small amount, the values of most of the pooled outputsdo not change.

<img src="img/ex6.png" height="10%" weight="10%">

Max pooling introduces invariance. (Top)A view of the middle of the output of a convolutional layer. The bottom row shows outputs of the nonlinearity. The top row shows the outputs of max pooling, with a stride of one pixel between pooling regionsand a pooling region width of three pixels. (Bottom)A view of the same network, afterthe input has been shifted to the right by one pixel. Every value in the bottom row haschanged, but only half of the values in the top row have changed, because the max poolingunits are sensitive only to the maximum value in the neighborhood, not its exact location

Pooling over spatial regions produces invariance to translation, but if we poolover the outputs of separately parametrized convolutions, the features can learnwhich transformations to become invariant. Because pooling summarizes the responses over a whole neighborhood, it ispossible to use fewer pooling units than detector units, by reporting summarystatistics for pooling regions spacedkpixels apart rather than 1 pixel apart. Seeﬁgure 9.10 for an example. This improves the computational eﬃciency of thenetwork because the next layer has roughlyktimes fewer inputs to process.

<img src="img/ex7.png" height="10%" weight="10%">

Example of learned invariances. A pooling unit that pools over multiple featuresthat are learned with separate parameters can learn to be invariant to transformations ofthe input. Here we show how a set of three learned ﬁlters and a max pooling unit can learnto become invariant to rotation. All three ﬁlters are intended to detect a hand written 5.Each ﬁlter attempts to match a slightly diﬀerent orientation of the 5. When a 5 appears inthe input, the corresponding ﬁlter will match it and cause a large activation in a detectorunit. The max pooling unit then has a large activation regardless of which detector unitwas activated. We show here how the network processes two diﬀerent inputs, resultingin two diﬀerent detector units being activated. The eﬀect on the pooling unit is roughlythe same either way. This principle is leveraged by maxout networks (Goodfellow et al.,2013a) and other convolutional networks. Max pooling over spatial positions is naturallyinvariant to translation; this multichannel approach is only necessary for learning othertransformations.

<img src="img/ex8.png" height="10%" weight="10%">

Pooling with downsampling. Here we use max pooling with a pool width ofthree and a stride between pools of two. This reduces the representation size by a factorof two, which reduces the computational and statistical burden on the next layer. Notethat the rightmost pooling region has a smaller size but must be included if we do notwant to ignore some of the detector units

Some examples of complete convolutional network architectures for classiﬁcationusing convolution and pooling are shown in ﬁgure below

<img src="img/ex9.png" height="10%" weight="10%">

Examples of architectures for classiﬁcation with convolutional networks. Thespeciﬁc strides and depths used in this ﬁgure are not advisable for real use; they aredesigned to be very shallow to ﬁt onto the page. Real convolutional networks also ofteninvolve signiﬁcant amounts of branching, unlike the chain structures used here for simplicity.(Left)A convolutional network that processes a ﬁxed image size. After alternating betweenconvolution and pooling for a few layers, the tensor for the convolutional feature map isreshaped to ﬂatten out the spatial dimensions. The rest of the network is an ordinaryfeedforward network classiﬁer, as described in chapter 6. (Center)A convolutional networkthat processes a variably sized image but still maintains a fully connected section. Thisnetwork uses a pooling operation with variably sized pools but a ﬁxed number of pools,in order to provide a ﬁxed-size vector of 576 units to the fully connected portion of thenetwork. (Right)A convolutional network that does not have any fully connected weightlayer. Instead, the last convolutional layer outputs one feature map per class. The modelpresumably learns a map of how likely each class is to occur at each spatial location.Averaging a feature map down to a single value provides the argument to the softmaxclassiﬁer at the top.