In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# Convolutions

## Concept

[Tutorial](https://theano-pymc.readthedocs.io/en/latest/tutorial/conv_arithmetic.html)

Each filter needs to have the same number of channels as the input tensor. Each filter will create a single output channel. The number of filters determine the number of output channels. All the filters taken together are usually called "filter".

## Implementation

To use F.conv2d we need the input to be of shape minibatch×in_channels×iH×iW and the filter to be of the shape out_channels×in_channels×kH×kW. In the simple case of simply convolving 2D tensor with a 2D filter, we use the minibatch size, in_channels, and out_channels all as 1. This is the reason to reshape the tensors.

In [3]:
x = torch.tensor([[3., 0., 1., 2., 7., 4.],
                  [1., 5., 8., 9., 3., 1.],
                  [2., 7., 2., 5., 1., 3.],
                  [0., 1., 3., 1., 7., 8.],
                  [4., 2., 1., 6., 2., 8.],
                  [2., 4., 5., 2., 3., 9.]])
x = x.view(1, 1, 6, 6)

w = torch.tensor([[2., 0., -2.],
                  [2., 0., -2.],
                  [2., 0., -2.]])
w = w.view(1, 1, 3, 3)

In [4]:
F.conv2d(x, w)

tensor([[[[-10.,  -8.,   0.,  16.],
          [-20.,  -4.,   4.,   6.],
          [  0.,  -4.,  -8., -14.],
          [ -6.,  -4.,  -6., -32.]]]])

In [5]:
x = torch.tensor([[30., 30., 30., 0., 0., 0.],
                  [30., 30., 30., 0., 0., 0.],
                  [30., 30., 30., 0., 0., 0.],
                  [30., 30., 30., 0., 0., 0.],
                  [30., 30., 30., 0., 0., 0.],
                  [30., 30., 30., 0., 0., 0.]])
x = x.view(1, 1, 6, 6)

F.conv2d(x, w)

tensor([[[[  0., 180., 180.,   0.],
          [  0., 180., 180.,   0.],
          [  0., 180., 180.,   0.],
          [  0., 180., 180.,   0.]]]])

In [6]:
5*5*3*8

600

In [7]:
red_filter = np.array([
    [1., 0., -1.],
    [1., 0., -1.],
    [1., 0., -1.]
])
red_filter_full = torch.tensor(red_filter.reshape(1, 1, 3, 3))

In [8]:
red = np.array([
    [10., 10., 10., 0., 0., 0.],
    [10., 10., 10., 0., 0., 0.],
    [10., 10., 10., 0., 0., 0.],
    [10., 10., 10., 0., 0., 0.],
    [10., 10., 10., 0., 0., 0.],
    [10., 10., 10., 0., 0., 0.]
])
red_batch_of_one = torch.tensor(red.reshape((1, 1, 6, 6)))

In [9]:
red_out = F.conv2d(red_batch_of_one, red_filter_full)
print(red_out.shape)
red_out[0, 0, :, :]

torch.Size([1, 1, 4, 4])


tensor([[ 0., 30., 30.,  0.],
        [ 0., 30., 30.,  0.],
        [ 0., 30., 30.,  0.],
        [ 0., 30., 30.,  0.]], dtype=torch.float64)

In [10]:
green_filter = np.array([
    [2., 0., -2.],
    [2., 0., -2.],
    [2., 0., -2.]
])
green_filter_full = torch.tensor(green_filter.reshape(1, 1, 3, 3))

In [11]:
green = np.array([
    [30., 30., 30., 0., 0., 0.],
    [30., 30., 30., 0., 0., 0.],
    [30., 30., 30., 0., 0., 0.],
    [30., 30., 30., 0., 0., 0.],
    [30., 30., 30., 0., 0., 0.],
    [30., 30., 30., 0., 0., 0.]
])
green_batch_of_one = torch.tensor(green.reshape((1, 1, 6, 6)))

In [12]:
green_out = F.conv2d(green_batch_of_one, green_filter_full)
print(green_out.shape)
green_out[0, 0, :, :]

torch.Size([1, 1, 4, 4])


tensor([[  0., 180., 180.,   0.],
        [  0., 180., 180.,   0.],
        [  0., 180., 180.,   0.],
        [  0., 180., 180.,   0.]], dtype=torch.float64)

In [13]:
blue_filter = np.array([
    [0.5, 0., -0.5],
    [0.5, 0., -0.5],
    [0.5, 0., -0.5]
])
blue_filter_full = torch.tensor(blue_filter.reshape(1, 1, 3, 3))

In [14]:
blue = np.array([
    [20., 20., 20., 0., 0., 0.],
    [20., 20., 20., 0., 0., 0.],
    [20., 20., 20., 0., 0., 0.],
    [20., 20., 20., 0., 0., 0.],
    [20., 20., 20., 0., 0., 0.],
    [20., 20., 20., 0., 0., 0.]
])
blue_batch_of_one = torch.tensor(blue.reshape((1, 1, 6, 6)))

In [15]:
blue_out = F.conv2d(blue_batch_of_one, blue_filter_full)
print(blue_out.shape)
blue_out[0, 0, :, :]

torch.Size([1, 1, 4, 4])


tensor([[ 0., 30., 30.,  0.],
        [ 0., 30., 30.,  0.],
        [ 0., 30., 30.,  0.],
        [ 0., 30., 30.,  0.]], dtype=torch.float64)

In [16]:
img = torch.tensor(np.expand_dims(np.stack([red, green, blue]), axis=0))
img.shape

torch.Size([1, 3, 6, 6])

In [17]:
img_filter = torch.tensor(np.expand_dims(np.stack([red_filter, green_filter, blue_filter]), axis=0))
img_filter.shape

torch.Size([1, 3, 3, 3])

In [18]:
img_out = F.conv2d(img, img_filter)
print(img_out.shape)
img_out[0, 0, :, :]

torch.Size([1, 1, 4, 4])


tensor([[  0., 240., 240.,   0.],
        [  0., 240., 240.,   0.],
        [  0., 240., 240.,   0.],
        [  0., 240., 240.,   0.]], dtype=torch.float64)

In [19]:
red_out + green_out + blue_out

tensor([[[[  0., 240., 240.,   0.],
          [  0., 240., 240.,   0.],
          [  0., 240., 240.,   0.],
          [  0., 240., 240.,   0.]]]], dtype=torch.float64)

### Padding 
With no padding the center pixels get a bigger say in the convolved output tensor. With padding, I can structure it so that all the pixels get more or less an equal say. This can be useful if there are features at the border need to be captured.

![padding](./padding.png)

Here are the different padding types that I know of:
  * **Valid padding**: This is when there is no padding and the output tensor is smaller than the input tensor.
  * **Half or same padding**: This is when the input tensor is framed by zeros. As many rows/cols are added s.t the output tensor has the same dimensions as the input tensor. Most libraries calculate the exact number of "frames" to add.
  * **Full padding**: This is when all the pixels are counted the same number of times. Again, most libraries calculate the exact number of "frames" to surround the input tensor with.

Half and Full padding are instances of "zero" padding, because the pad value is zero.

## Pooling

Pooling is a sort of downsampling. It is used to boil the input tensor to its essentials. The pooled or downsampled tensor is then easier to run subsequent computations than the full tensor. Pooling does not have any learned parameters. It is still differentiable. E.g., the max pooling operation when back propagated will take the gradient as-is for all the pixels that were selected and 0 for all other pixels. See [this stackoverflow question](https://datascience.stackexchange.com/questions/11699/backprop-through-max-pooling-layers).

![pooled](./pooled.png)

In [20]:
x = torch.tensor([[1., 3., 2., 1.],
                  [2., 9., 1., 1.],
                  [1., 3., 2., 3.],
                  [5., 6., 1., 2.]])
x = x.view(1, 1, 4, 4)

F.max_pool2d(x, kernel_size=2, stride=2)

tensor([[[[9., 2.],
          [6., 3.]]]])

In [21]:
x = torch.tensor([[1., 3., 2., 1., 3.],
                  [2., 9., 1., 1., 5.],
                  [1., 3., 2., 3., 2.],
                  [8., 3., 5., 1., 0.],
                  [5., 6., 1., 2., 9.]])
x = x.view(1, 1, 5, 5)
F.max_pool2d(x, kernel_size=3, stride=1)

tensor([[[[9., 9., 5.],
          [9., 9., 5.],
          [8., 6., 9.]]]])

## Upsampling

It is in a sense opposite of pooling, it can be used to "blow up" a tensor. Just like pooling there are no learnable parameters and this operation too is differentiable.

In [22]:
x = torch.tensor([[20, 45],
                  [10, 43]], dtype=torch.float32)
x = x.view(1, 1, 2, 2)
F.interpolate(x, scale_factor=2, mode="nearest")

tensor([[[[20., 20., 45., 45.],
          [20., 20., 45., 45.],
          [10., 10., 43., 43.],
          [10., 10., 43., 43.]]]])

In [23]:
F.interpolate(x, scale_factor=2, mode="bilinear")



tensor([[[[20.0000, 26.2500, 38.7500, 45.0000],
          [17.5000, 24.2500, 37.7500, 44.5000],
          [12.5000, 20.2500, 35.7500, 43.5000],
          [10.0000, 18.2500, 34.7500, 43.0000]]]])

## Transposed Convolution

Think of this as upsampling with learned parameters. These are also sometimes (incorrectly) called as deconvolutions.

![transpose](./transpose.png)

So far I have seen only square kernels and custom square strides. The output for the following input:
  * Square input of size $i \times i$
  * Square kernel of size $k \times k$
  * Square stride of size $s \times s$

will be a square tensor -
$$
s(i - 1) + k
$$

The problem with transposed convolutions is that the center pixel is getting influenced by all the pixels. This can lead to the so-called [checkerboard pattern](https://distill.pub/2016/deconv-checkerboard/) in the output tensor. Doing a (parameterless) upsampling followed by same convolution is one way to combat this.

In [24]:
x = torch.tensor([
    [1, 4],
    [0, 2]
]).view(1, 1, 2, 2)
w = torch.tensor([
    [2, 2],
    [1, 1]
]).view(1, 1, 2, 2)

F.conv_transpose2d(x, w)

tensor([[[[ 2, 10,  8],
          [ 1,  9,  8],
          [ 0,  2,  2]]]])

In [25]:
F.conv_transpose2d(x, w, stride=2)

tensor([[[[2, 2, 8, 8],
          [1, 1, 4, 4],
          [0, 0, 4, 4],
          [0, 0, 2, 2]]]])