# Chapter 7: Convolutional Neural Networks

### Dhuvi Karthikeyan

2/07/2023




## 7.1 From Fully Connected Layers to Convolutions

The assumptions of tabular data:
    
    * Tabular data is where the rows are examples and the columns are features
    * Assume no a priori structure on how the features interact between each other
    * A one megapixel image would need 10^6 [Input_Dim] * Hidden_Dim params for the first layer 
    

### 7.1.1 Invariance

Desiderata:

* Early layers need to focus on more local features (locality) whereas later layers need to focus on global features
* Early layers should respond similarly to similar patches of the image, regardless of where it appears in the global image (translational equivariance).
    * Invariance vs. Equivaraince: 
        * Invariance: Variances are equal to 0
        * Equivariance: The variances are equal and non-zero
        

### 7.1.2 Constraining the MLP

Define $\textbf{X}$ to be the input image with immediate hidden representation tensor $ \textbf{H} \in \mathbb{R}^2$ where they both have the same shape. To have all of the input pixels inform each hidden state index, we would need a 4th order weight tensor to index the $X_{i,j}$ and $H_{k,l}$ interaction. Each pixel in the input would need a bias so the bias vector would be a bias matrix $\textbf{U}$.


#### Formulation
$$ \textbf{H}_{ij} = \textbf{U}_{i,j} + \sum_{k}\sum_{l}W_{i,j,k,l}X_{k,l}$$
$$ = \textbf{U}_{i,j} + < W_{i,j}, X>_F $$

This is essentially indexing the fourth order weights tensor along the first two axes for the resulting matrix that corresponds to the i,j index of the hidden input. For each $\textbf{H}_{i,j}$ multiply each entry of matrix $W_{ij}$ and entry of $X$ together, taking the product summing over all of these, which is the Frobernius inner product of two matrices $W_{ij}$ $X$. To this inner product we add noise from the noise matrix, specifically the entry at (i,j).

#### Translation Equivariance

For translation equivariance/invariance we need the hidden representation of i,j to be agnostic to where we are in the hidden dim, and take into account only where we are in the image. As such we remove the (i,j) dependence from the bias matrix U and instead have a constant u. We also reduce the 4th order tensor to a 2nd order one (matrix), and get the following:

$$ \textbf{H}_{i,j} = u + \sum_k \sum_l \textbf{W}_{k,l}X_{k,l}$$

#### Locality

The idea behind this principle is to zero-out the pixels outside a specific window of the target pixel such that we focus on the local features of the image and incorporate that into the hidden state at i,j. We start off the formulation by first re-indexing the pixels of the image such that we loop over a and b which run over positive and negative numbers such that the edge of the images are reached but never crossed.

$$ \text{Reindexing gives: } \textbf{H}_{i,j} = u + \sum_{a}\sum_{b}W_{a,b}X_{i+a,j+b}$$ 
$$ \textbf{H}_{i,j} = u + \sum_{a=-\Delta}^\Delta\sum_{b=-\Delta}^\Delta W_{a,b}X_{i+a,j+b}$$

The equation above is also known as a convolutional layer. The weight matrix of a convolutional layer of dimensions $2\Delta x 2\Delta$ what is also known as the convolution kernel or a filter. We have thusly offloaded a number of the parameters by imposing an inductive bias, which when it agrees with the data, results in sample efficient models that generalize well.

#### Desideratum #3
The last thing we need is to ensure that the deeper layers of the models should respond to more higher level features.

### 7.1.3 Convolutions

In math a convolutioin between is an operation that takes two functions f and g as input and outputs the following:

$$ (f*g)(x) = \int f(z)g(x-z)dz$$

Or in the discrete case:

$$ (f*g)(i) = \sum_a f(a)g(i-a)$$


For the case of two-dimensional tensors:

$$(f*g)(i,j) = \sum_a \sum_b f(a,b)g(i-a, j-b)$$

which is almost like the $\textbf{H}_{i,j}$ except for the divisions. Although we call it convolutions the actual procedure used for $\textbf{H}_{i,j}$ is a cross-correlation.

### 7.1.4 Channels 

Since images consist of three channels, we make the following adjustments:

* Add a dimension to the hidden representation and the convolutional kernel
* Add a dimension to the bias term?
* To add input channel to output channel flexibility s.t. each output channel gets weighted info from each input channel, we add a fourth dimension to the convolutional kernel.

$$ \textbf{H}_{i,j,d} = U_d + \sum_{a=-\Delta}^\Delta\sum_{b=-\Delta}^\Delta\sum_c W_{a,b,c,d}X_{i+a,j+b, c}$$

## 7.2 Convolutions for Images

### 7.2.1 Cross-Correlation Operation

Convolutional layers are a misnomer since the actual operation is a cross-correlation. In 2-D cross-correlations the filter is placed on the corner and slid across the image left to right and top to bottom. At each point the kernel and the input tensor are processed via frobenius inner product and the output tensor is constructed elementwise. 

In this formulation the output tensor is **(image_height - kernel_height + 1) x (image_width - kernel_width + 1)** in dimension. In practice, this the output tensor retains the same dimensions as the input image by padding the input image with 0's at the borders to allow for the filter.

### 7.2.2 Convolutional Layers

Convolutional kernels are usually randomly initialized, the same as MLP weights. They are likewise updated by gradient descent.

### 7.2.3 Object Edge Detection in Images

Using a trivial example, of an "image" with columns that are either all 1 or all 0 it can be shown that with a convolutional filter of [1, -1] that we can indeed detect edges in the image.

In [38]:
import torch 


def corr2d(X, K):
    """Compute 2D cross-correlation."""
    h, w = K.shape
    Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
    return Y

In [39]:
X = torch.ones((6, 8))
X[:, 2:6] = 0
X

tensor([[1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.]])

In [40]:
K = torch.tensor([[1.0, -1.0]])

In [41]:
Y = corr2d(X, K)
Y

tensor([[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.]])

In [42]:
# Transpose doesn't work

corr2d(X.t(), K)

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])

In [44]:
# Transposing the filter fixes this 
corr2d(X.t(), K.t())

tensor([[ 0.,  0.,  0.,  0.,  0.,  0.],
        [ 1.,  1.,  1.,  1.,  1.,  1.],
        [ 0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.],
        [-1., -1., -1., -1., -1., -1.],
        [ 0.,  0.,  0.,  0.,  0.,  0.]])

### 7.2.4 Learning a Kernel

By gradient descent after 10 iterations, a single convolutional layer is able to learn the parameters of the kernel for edge detection using input and output.

### 7.2.5 Cross-Correlation and Convolution

Where $f(a,b) = F_{a,b}$, $g(a,b) = G_{a,b}$ for tensors F and G


Strict convolution:

$$(f*g)(i,j) = \sum_a \sum_b f(a,b)g(i-a, j-b)$$

Cross-Correlation:

$$(f*g)(i,j) = \sum_a \sum_b f(a,b)g(i+a, j+b)$$

To get from cross correlation and convolution the only operation we change is what is happening to g, the filter. So by flipping the filter horizontally and vertically we invert the subtraction. Whether we use convolution and the flipped kernel or cross-correlation and the regular kernel, the same output will be generation.

### 7.2.6 Feature Map and Receptive Field

**Feature Map**: The output of a convolutional layer is also called a feature map since the learned features (edge detection among other things) from the previous layer and passes these as learned representations to coming layers. 

**Receptive Field**: The receptive field of an element x in a layer is the set of all elements from previous layers that may affect the calculation of x during forward pass (can be larger than input size depending on layer depth)



## 7.3 Padding and Stride

Padding allows one to retain input dimension whereas stride allows one to drastically reduce it. Padding also importantly ensures that the frequency of pixel usage is invariant. 

### 7.3.1 Padding

To increase the image output dimensions by padding_height and padding_width:
**(image_height - kernel_height + padding_height + 1) x (image_width - kernel_width + padding_width + 1)**

Most cases, padding_height and width will be set to kernel_height and width -1. There is a preference for odd dimensions for the kernel such that we can evenly distribute the padding and also the interpretation is smoother.

### 7.3.2 Stride

Moving the convolutional kernel more than one pixel at a time has benefits for downsampling the input representation (noising) as well as computational benefit. In the two dimensional example, the stride is for both rows and columns traversed per step. 

**Stride Formula:**

$$ \lfloor (image_height - kernel_height + padding_height + stride_height)/stride_height \rfloor x \lfloor(image_width - kernel_width + padding_width + stride_width)/stride_width \rfloor$$

## 7.4 Multiple Input and Multiple Output Channels

### 7.4.1 Multiple Input Channels

With multiple channels there is barely a difference. We simply must ensure that the kernel is of the same number of dimensions and then the cross-correlation operation is performed on the different channels by the different kernels. The resulting output split by channel is then summed elementwise.

### 7.4.2 Multiple Output Channels

Popular neural network architectures actually increase the number of channels with layer depth, trading off spatial resolution for channel depth.  This is because typically channels correspond to their own filters and each filter represents its own spatial feature (edge detections, circle detection, etc...). For each output channel we have input_channels * kernel_height * kernel_width tensor concatenated along the 0th axis for dimension output_channels.

### 7.4.3 1x1 Convolutional Filter

The use case for a 1x1 convolutional filter is skeptical because the cross-correlation function is less intuitive and arbitrary in this case, but this architecture choice is included in a few DL approaches. 

### 7.4.4 Discussion

Calculation of kxk convolution in an image of (hxw) is O(h*w*k*k). The inclusion of input and output channels make this an O(h*w*k*k*i*o) operation.

## 7.5 Pooling

### 7.5.1 Maximum Pooling and Average Pooling

Pooling operators like convolutional kernels are filters applied to regions of the input according to stride and compute a scalar output for each step. This pooling window computes a deterministic operation (min, max, avg). Max-pooling is almost always preferred.

### 7.5.2 Padding and Stride

Since pooling operations change the output shape, padding and stride can be played with to get the output the desired shape.


Padding and stride are usually kept identical so a maxpool(2) is a downsample such 1/2 of the length and width are taken resulting in a 1/4 downsampling. MaxPool(3) is a 1/9th downsampling.