intuition for object detection in an image
by enumerating a few desiderata to guide our design
of a neural network architecture suitable for computer vision:

1. In the earliest layers, our network
   should respond similarly to the same patch,
   regardless of where it appears in the image. This principle is called *translation invariance* (or *translation equivariance*).
1. The earliest layers of the network should focus on local regions,
   without regard for the contents of the image in distant regions. This is the *locality* principle.
   Eventually, these local representations can be aggregated
   to make predictions at the whole image level.
1. As we proceed, deeper layers should be able to capture longer-range features of the 
   image, in a way similar to higher level vision in nature. 

Let's see how this translates into mathematics.

In [38]:
import torch
from torch import nn
import torch.nn.functional as F
import matplotlib.pyplot as plt

In [124]:
def corr2d(X, k):

    h, w = k.shape
    y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))

    for i in range(y.shape[0]):
        for j in range(y.shape[1]):
            y[i, j] = (k * X[i : i + h, j : j + w]).sum()

    return y

def test_corr2d():
    con = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, stride=1, bias=False)
    weight = con.weight.data[0, 0].clone().detach()
    x = torch.rand((6, 6))
    x.requires_grad = True
    res1 = corr2d(x, weight)
    res1.backward(torch.ones_like(res1))

    x2 = x.clone().detach().requires_grad_(True)
    res2 = con(x2[None, None, :, :])
    res2.backward(torch.ones_like(res2))


    assert (corr2d(x, weight) - con(x[None, None, :, :])[0, 0] < 1e-5).sum() == 4 * 4, f"problem occour, corr2d result is {corr2d(x, weight)} and conv {con(x[None, None, :, :])[0, 0]}"
    assert (x.grad - x2.grad < 1e-5).sum() == 6 * 6, f"problem occour, x grad is {x.grad} and x2 grad {x2.grad}"

test_corr2d()

In [125]:
from typing import Any


class Conv2D(nn.Module):

    def __init__(self, kernel_size) -> None:
        super().__init__()
        self.weight = nn.Parameter(torch.rand(kernel_size))
        self.bias = nn.Parameter(torch.rand(1))
    
    def forward(self, X):
        return corr2d(X, self.weight) + self.bias
    


In [126]:
#edge detection - vertical

X = torch.ones((6, 8))
X[:, 2:6] = 0

K = torch.tensor([[1.0, -1.0]])

Y = corr2d(X, K)
X, Y, corr2d(X.t(), K)


(tensor([[1., 1., 0., 0., 0., 0., 1., 1.],
         [1., 1., 0., 0., 0., 0., 1., 1.],
         [1., 1., 0., 0., 0., 0., 1., 1.],
         [1., 1., 0., 0., 0., 0., 1., 1.],
         [1., 1., 0., 0., 0., 0., 1., 1.],
         [1., 1., 0., 0., 0., 0., 1., 1.]]),
 tensor([[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
         [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
         [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
         [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
         [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
         [ 0.,  1.,  0.,  0.,  0., -1.,  0.]]),
 tensor([[0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.]]))

In [127]:
# learinng the edge detector!!

# we will use X as our input, Y as our wanted output

conv = nn.LazyConv2d(out_channels=1, kernel_size=(1, 2), bias=False)
lr = 1e-4

X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))

min_loss = 10
best = None



for epoch in range(1501):

    res = conv(X)
    loss = (res - Y) ** 2

    with torch.no_grad():
        if loss.sum().item() < min_loss:

            min_loss = loss.sum().item()
            best = conv.weight.data.clone()
    loss.sum().backward()
    # the use of [:] makes the operation in place
    conv.weight.data[:] += -lr * conv.weight.grad

    if (epoch + 1) % 500 == 0:
        print(f'epoch {epoch + 1}, loss {loss.sum():.3f}, min loss {min_loss:.5f}')

print()
print(f"best fit - {best}")
print(f"last results - {conv.weight.data}")

epoch 500, loss 0.955, min loss 0.09738
epoch 1000, loss 9.858, min loss 0.00369
epoch 1500, loss 3.367, min loss 0.00369

best fit - tensor([[[[ 1.0025, -1.0159]]]])
last results - tensor([[[[ 0.9246, -1.3891]]]])


## Padding

As described above, one tricky issue when applying convolutional layers
is that we tend to lose pixels on the perimeter of our image. Consider :numref:`img_conv_reuse` that depicts the pixel utilization as a function of the convolution kernel size and the position within the image. The pixels in the corners are hardly used at all. 

![Pixel utilization for convolutions of size $1 \times 1$, $2 \times 2$, and $3 \times 3$ respectively.](../img/conv-reuse.svg)
:label:`img_conv_reuse`

Since we typically use small kernels,
for any given convolution
we might only lose a few pixels
but this can add up as we apply
many successive convolutional layers.
One straightforward solution to this problem
is to add extra pixels of filler around the boundary of our input image,
thus increasing the effective size of the image.
Typically, we set the values of the extra pixels to zero.
In :numref:`img_conv_pad`, we pad a $3 \times 3$ input,
increasing its size to $5 \times 5$.
The corresponding output then increases to a $4 \times 4$ matrix.
The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: $0\times0+0\times1+0\times2+0\times3=0$.

![Two-dimensional cross-correlation with padding.](../img/conv-pad.svg)
:label:`img_conv_pad`

In general, if we add a total of $p_\textrm{h}$ rows of padding
(roughly half on top and half on bottom)
and a total of $p_\textrm{w}$ columns of padding
(roughly half on the left and half on the right),
the output shape will be

$$(n_\textrm{h}-k_\textrm{h}+p_\textrm{h}+1)\times(n_\textrm{w}-k_\textrm{w}+p_\textrm{w}+1).$$

This means that the height and width of the output
will increase by $p_\textrm{h}$ and $p_\textrm{w}$, respectively.

In many cases, we will want to set $p_\textrm{h}=k_\textrm{h}-1$ and $p_\textrm{w}=k_\textrm{w}-1$
to give the input and output the same height and width.
This will make it easier to predict the output shape of each layer
when constructing the network.
Assuming that $k_\textrm{h}$ is odd here,
we will pad $p_\textrm{h}/2$ rows on both sides of the height.
If $k_\textrm{h}$ is even, one possibility is to
pad $\lceil p_\textrm{h}/2\rceil$ rows on the top of the input
and $\lfloor p_\textrm{h}/2\rfloor$ rows on the bottom.
We will pad both sides of the width in the same way.

CNNs commonly use convolution kernels
with odd height and width values, such as 1, 3, 5, or 7.
Choosing odd kernel sizes has the benefit
that we can preserve the dimensionality
while padding with the same number of rows on top and bottom,
and the same number of columns on left and right.

Moreover, this practice of using odd kernels
and padding to precisely preserve dimensionality
offers a clerical benefit.
For any two-dimensional tensor `X`,
when the kernel's size is odd
and the number of padding rows and columns
on all sides are the same,
thereby producing an output with the same height and width as the input,
we know that the output `Y[i, j]` is calculated
by cross-correlation of the input and convolution kernel
with the window centered on `X[i, j]`.

In the following example, we create a two-dimensional convolutional layer
with a height and width of 3
and (**apply 1 pixel of padding on all sides.**)
Given an input with a height and width of 8,
we find that the height and width of the output is also 8.


In [4]:
# We define a helper function to calculate convolutions. It initializes the
# convolutional layer weights and performs corresponding dimensionality
# elevations and reductions on the input and output
def comp_conv2d(conv2d, X):
    # (1, 1) indicates that batch size and the number of channels are both 1
    X = X.reshape((1, 1) + X.shape)
    Y = conv2d(X)
    # Strip the first two dimensions: examples and channels
    return Y.reshape(Y.shape[2:])

# 1 row and column is padded on either side, so a total of 2 rows or columns
# are added
conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1)
X = torch.rand(size=(8, 8))
comp_conv2d(conv2d, X).shape



torch.Size([8, 8])

In [5]:
conv2d = nn.LazyConv2d(1, kernel_size=3)
X = torch.rand(size=(8, 8))
comp_conv2d(conv2d, X).shape

torch.Size([6, 6])

In [28]:
kernel_size = (11, 3)
X = torch.rand(size=(15, 15))
pad_last_dim = [int(kernel_size[1] / 2),  int(kernel_size[1] / 2)]
pad_first_dim = [int(kernel_size[0] / 2), int(kernel_size[0] / 2)]
pad_val = (*pad_last_dim, *pad_first_dim)
x_padded = F.pad( X, pad_val, 'constant', 0)

# fig, ax = plt.subplots(1, 2, figsize=(15, 10))
# ax[0].imshow(x_padded)
# ax[0].set_xticks([])
# ax[0].set_yticks([])
# ax[1].imshow(X)
# ax[1].set_xticks([])
# ax[1].set_yticks([])

conv2d = nn.LazyConv2d(1, kernel_size=kernel_size)
comp_conv2d(conv2d, x_padded).shape

torch.Size([15, 15])

## Stride

When computing the cross-correlation,
we start with the convolution window
at the upper-left corner of the input tensor,
and then slide it over all locations both down and to the right.
In the previous examples, we defaulted to sliding one element at a time.
However, sometimes, either for computational efficiency
or because we wish to downsample,
we move our window more than one element at a time,
skipping the intermediate locations. This is particularly useful if the convolution 
kernel is large since it captures a large area of the underlying image.

We refer to the number of rows and columns traversed per slide as *stride*.
So far, we have used strides of 1, both for height and width.
Sometimes, we may want to use a larger stride.
:numref:`img_conv_stride` shows a two-dimensional cross-correlation operation
with a stride of 3 vertically and 2 horizontally.
The shaded portions are the output elements as well as the input and kernel tensor elements used for the output computation: $0\times0+0\times1+1\times2+2\times3=8$, $0\times0+6\times1+0\times2+0\times3=6$.
We can see that when the second element of the first column is generated,
the convolution window slides down three rows.
The convolution window slides two columns to the right
when the second element of the first row is generated.
When the convolution window continues to slide two columns to the right on the input,
there is no output because the input element cannot fill the window
(unless we add another column of padding).

![Cross-correlation with strides of 3 and 2 for height and width, respectively.](../img/conv-stride.svg)
:label:`img_conv_stride`

In general, when the stride for the height is $s_\textrm{h}$
and the stride for the width is $s_\textrm{w}$, the output shape is

$$\lfloor(n_\textrm{h}-k_\textrm{h}+p_\textrm{h}+s_\textrm{h})/s_\textrm{h}\rfloor \times \lfloor(n_\textrm{w}-k_\textrm{w}+p_\textrm{w}+s_\textrm{w})/s_\textrm{w}\rfloor.$$

If we set $p_\textrm{h}=k_\textrm{h}-1$ and $p_\textrm{w}=k_\textrm{w}-1$,
then the output shape can be simplified to
$\lfloor(n_\textrm{h}+s_\textrm{h}-1)/s_\textrm{h}\rfloor \times \lfloor(n_\textrm{w}+s_\textrm{w}-1)/s_\textrm{w}\rfloor$.
Going a step further, if the input height and width
are divisible by the strides on the height and width,
then the output shape will be $(n_\textrm{h}/s_\textrm{h}) \times (n_\textrm{w}/s_\textrm{w})$.

Below, we [**set the strides on both the height and width to 2**],
thus halving the input height and width.


In [37]:
X = torch.rand([5, 5])
conv2d = nn.LazyConv2d(1, kernel_size=(2, 2), stride=(3, 2))
comp_conv2d(conv2d, X)

tensor([[ 0.1125,  0.0626],
        [ 0.0193, -0.0944]], grad_fn=<ViewBackward0>)

In [65]:
from typing import Any


class Conv:

    def __init__(self, kernel_size, bias=True, pad=[0, 0], stride=(0, 0)) -> None:

        self.kernel_size = kernel_size
        self.weights = torch.rand(kernel_size)
        self.bias = torch.rand(1) if bias else 0

        self.pad, self.stride = pad, stride

    def __call__(self, x) -> Any:

        x = F.pad(x, [self.pad[1], self.pad[1], self.pad[0], self.pad[0]])
        h = (x.shape[0] + self.stride[0] - self.kernel_size[0]) / self.stride[0]
        w = (x.shape[1] + self.stride[1] - self.kernel_size[1]) / self.stride[1]
        h, w = int(h), int(w)
        res = torch.zeros((h, w))
        for i in range(h):
            for j in range(w):
                res[i, j] = (x[i * self.stride[0]: i * self.stride[0] + self.kernel_size[0], j * self.stride[1] :j * self.stride[1] + self.kernel_size[1] ] * self.weights).sum() + self.bias

        return res



def test_corr2d_final():

    x = torch.rand((23, 23))

    for ks in [(5, 3), (7, 3), (9, 7), (3, 11)]:
        for pad in [(2, 1), (2, 2), (3, 5), (9, 13)]:
            for st in [(1, 3), (3, 5), (7, 3)]:

                    conv2d = nn.Conv2d(1, 1,  kernel_size=ks, stride=st, padding=pad)
                    corr = Conv(kernel_size=ks, bias=True, pad=pad, stride=st)


                    corr.weights = conv2d.weight.reshape(ks)
                    corr.bias = conv2d.bias.reshape(-1)

                    assert ((corr(x) - conv2d(x.reshape(1, 1, *x.shape))).abs() > 1e-6).sum() == 0, "naive conv did not match torch result"

test_corr2d_final()