# Convolution

Convolution is one of the most important layers used in image processing based problems. On the other it is also one of the most computationally demanding layer. This tutorial explains how convolution works and we discuss the backpropagation through the convolutional layer.

## Forward direction (inference)

[![conv](https://img.youtube.com/vi/wUp5hx-onUI/0.jpg)](https://www.youtube.com/watch?v=wUp5hx-onUI)

The video above illustrates how a 2D convolution works in case of a simple case. We can consider as a usual fully connected layer but there are shared weights. In a fully connected layer all the weights are independent (can be adjusted independently) but in a convolution layer each weight is reused among more neurons. 

A convolution has the following components (and charasteristics):
* kernel (sometimes called filters)
* stride
* padding
* dilation.

The **kernel** is a 4 dimensional tensor which contains the weights to be tuned. Basically it is a window which will hover over the input (image) as the video shows. The four dimensions are: 1) the number of filters, 2) the height of the window, 3) the width of the window and 4) the channel. Generally the number of channels are the same as the number of the channels of the input. 

The **stride** defines how the kernel slides in two dimensions (height, width). In each step the kernel is moved away from the current position to the next one according to the slide. For instance, when the stride is (1, 1) that means the kernel moves to the neighboring input value on the right and at the end of a row (of pixels) it moves down by 1 row. If the stride is (2, 1) it means that the window moves by 2 rows down when a row ends and 1 column right in each step. The video is an example for stride (1, 1).

The **padding** means if at the borders there are additional values (pixels) added artificially to the original input in order to increase its size. In the most general case during padding the amount of element added each side of the image can be different and the padding value can be set as well. In deep learning padding can be VALID and SAME. Most of the libraries like tensorflow or pytorch provides them. VALID means basically there is no padding. SAME means after applying the convolution the output has the same size as the input. Without padding this is not possible with kernel size greater than 1. The amount of padding should be chosen in a way that the output size will not change. The padding value is zero.

**Dilation** is rarely used in practice but sometimes it can be still useful. The kernel can be concise in the sense that the distance between the elements of the kernel is 1. But it can be bigger which can cause a bigger preceptive field.  

For the sake of simplicity here is the formula of convolution with padding VALID, stride (1, 1) and dilation (0, 0):

\begin{equation}
O_{b, i, j, f} = \sum_{k_i, k_j, k_c}{I_{b, i + k_i, j + k_j, c + k_c} \cdot K_{f, k_i, k_j, k_c}}.
\end{equation}

$b$ is the batch index, and the other indices are indices in the corresponding dimensions. The reason why this is a convolution, let's see another formula for convolution but with 1 dimensional, continuous functions:

\begin{equation}
\left(f * g \right)(z) = \int_x{f(z - x) \cdot g(x) dx}.
\end{equation}

In [1]:
%matplotlib inline
%config IPCompleter.greedy=True # this line is for autocomplete

In [2]:
from cnn.cnn_base import Convolution
import numpy as np

### Programming example for convolution (forward)

In [3]:
input_shape = (1, 3, 3, 2)
filters = 1
kernel_size = (2, 2)
strides = (1, 1)
dilations = (0, 0)
padding = 'SAME'
conv = Convolution(input_shape, filters, kernel_size, strides, dilations, padding)

In [4]:
# create a kernel full of with 1s
conv.kernel = np.ones((1, 2, 2, 2))

# create the input tensor
x = np.zeros((1, 3, 3, 2))
x[0, :, :, 0] = np.array([[2, 1, 1.5], [3, 2.2, 5], [5, 3, 2]])
x[0, :, :, 1] = np.array([[7.3, 4, 2], [4, 3, 2], [0.8, 2.7, 3.9]])

In [5]:
conv.output_shape()

(1, 3, 3, 1)

In [6]:
y = conv(x)
print(y)

[[[[26.5]
   [20.7]
   [10.5]]

  [[23.7]
   [23.8]
   [12.9]]

  [[11.5]
   [11.6]
   [ 5.9]]]]


The forward calculation can be speed up by different methods like Winogard. See for example: [Winogard](https://arxiv.org/abs/1509.09308)


## Backward (backpropagation)

During training we need the rightgradients in order to alter the current parameters. The gradient is given by the following formula after applying the chain rule.

\begin{equation}
\frac{\partial L}{\partial K_{f, k_i, k_j, k_c}} = \sum_{b, i, j}{\frac{\partial L}{\partial O_{b, i, j, f}} \frac{\partial O_{b, i, j, f}}{\partial K_{f, k_i, k_j, k_c}} } = \sum_{b, i, j}{\frac{\partial L}{\partial O_{b, i, j, f}} \cdot I_{b, i + k_i, j + k_j, c + k_c} }
\end{equation}

In the last expression the first term comes from the previous derivatives. Therefore the gradient of the convolution (J) can be written as follows:

\begin{equation}
J^{b, i, j}_{k_i, k_j, k_c, c} = I_{b, i + k_i, j + k_j, c + k_c}.
\end{equation}



A reference implementation can be found in the cnn.cnn_backward module. 