## From Fully-Connected Layers to Convolutions

in other nn models, we anticipate patterns we seek could involve interactions among features but we do not assume any structure concerning how these features interact. 

if we do not know how to create crafty architectures to solve such issues, an mlp may just be the best thing one could do. but for high dimensional data, such mlps can grow unweildy. 

for example, a neural network model that classfies dogs and cats pictures that are of size 1 mega pixel i.e., 1000000 dimensions, and a fully connected neural network of size 1000, would mean 1 billion parameters. even with a lot of patience and very well performing GPUs, it's infeasible. 

one might argue that we may not need a 1mp image. even if we reduce the quality down to 100000, it's still infeasible with billions of parameters. and fitting lots of parameters require collecting an enormous dataset. yet, humans are easily able to distinguish between them. this is because the images exhibit rich structure that can be exploited by humans and machine learning models alike. CNNs are one creative way to exploit the structure in such images. 

#### Invariance

spatial invariance: the model's ability to identify targets regardless of it's position in the image. for example the model should be able to identify pigs in the sky and planes on the ground. 

translation invariance: in the early layers of our network, the model should respond similarly to a patch regardless of it's position in the image. 

locality principle: in the early layers of our network, the network should focus on local regions regardless of the contents of the image in other distant regions. eventually, these local representations can be aggregated to make predictions at the whole image level. 

#### Constraining the MLP

let X be a two dimensional image which is the input to the network and their hidden representations as H. they both have the same shape which is a two dimensional tensor. we assume that both the image and hidden representations possess spatial structure. 

we switch from weight matrices to tensors of fourth order. 

![hidden cell equation](./images/6/6.1.png "Hidden Layer Equation")

the transformation from W to V is purely cosmetic, to have a one to one correspondence with the weight tensor and the image. 

k = i + a,

l = j + b

the indices a and b run over both postitive and negative offsets, covering the entire image. 

ie., for any given location i, j in the hidden representation H, we compute the value by summing over the pixels in X, centered at i,j weighted by V(i, j, a, b)

#### Translation Invariance

translational invariance means a shift in the input should lead to the same shift in the hidden representation H. this is possible when V and U do not depend on i and j. so we simplify the definition of H as follows. 

![hidden cell equation](./images/6/6.2.png "Hidden Layer Equation")

this is called a _convolution_

we are essentially weighting the pixels at (i+a, j+b) in the vicinity of location (i, j) with coefficients V(a,b) to obtain the value H(i,j).

V(a,b) requires a lot less parameters thatn V(i,j,a,b) because we no longer depend on the location within the image. 

#### Locality

locality means we should not look very far from location (i,j) in order to gain relevant information to assess what's going on at H(i,j). this means we set a limit on a,b exceeding which we set V(a,b) to 0. 

the updated equation of this becomes

$$[\mathbf{H}]_{i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} [\mathbf{V}]_{a, b}  [\mathbf{X}]_{i+a, j+b}$$

this equation in a nutshell, is the convolution layer. 

convolutional neural networks are a family of neural networks that contain these convolution layers. 

V is referred to as convolution kernel/filter/layer's weights that are learned. 

with the help of this kernel, the number of paramters go down from millions to a few hundred without altering the dimensionality of the inputs or hidden representations and the features are now translation invariant and the layer can only incorporate local information when being the layer is activated. 

if the biases do not agree with reality, models will struggle to fit our training data for example the images might not actually be translation invariant, so it struggles. 

#### Convolution

in mathematics, the convolution between two functions f and g is defined as $$(f∗g)(x)=∫f(z)g(x−z)dz$$

this means we measure the overlap between f and g when one function is _flipped_ and shifted by x. when we have discrete objects, the integral is replaced by sum ( like in the case of convolution kernel)

applying the above convolution formula in case of convolution kernel, we get $$(f∗g)(i,j)=∑a∑bf(a,b)g(i−a,j−b)$$

the difference between the previous notation and the current notation is i+a, j+b being replaced by i-a and j-b. 

this difference is purely cosmetic and the original definition describes what is called _cross correlation_. more on this later

#### Applying above concepts theoretically

consider the _where's waldo_ problem. we need to learn a model in which the convolution layer picks windows of a given size and weights according to V, and it finds the window where the _waldoness_ is highest. 

we find the peak in the hidden layer representations

#### Channels

the problem with the above approach is, images are not two dimensional objects but are third order tensors where the third order depicts one of three channels; red, green and blue. 

this means the image representation goes from X(i,j) to X(i,j,k) and similarly the convolution kernel goes from V(a,b) to V(a,b,c)

now the equation for the hidden representation is adapted to work with the third order tensors as

$$[H]i,j,d=∑a=−ΔΔ∑b=−ΔΔ∑c[V]a,b,c,d[X]i+a,j+b,c$$

here d indexes the output channels in the hidden represeentation. 

## Convolutions to Images

#### The Cross-Correlation Operation

the _convolution_ seen before is a misnomer as the operations can more accurately be expressed as cross correlations where input tensor and a kernel tensor are combined to produce an output tensor. 

if we ignore channels, this is how it works 

![Convolution Layer](./images/6/6.4.png "Convolution Layer")

the convolution slides over the input tensor. the kernel is smaller than the input window. the size of output can be deduced as $$(nh−kh+1)×(nw−kw+1)$$ where nh and nw are heights and widths of inputs and kh and kw are heights and widths of kernel. 

we can pad the input image with zeros around it, to get hidden representations of the same size. 

In [17]:
import tensorflow as tf
tf.get_logger().setLevel('INFO')

def corr2d(X, K):
    h, w = K.shape
    Y = tf.Variable(tf.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1)))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j].assign(tf.reduce_sum(X[i:i+h, j:j+w] * K))
    return Y

In [18]:
X = tf.constant([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = tf.constant([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

<tf.Variable 'Variable:0' shape=(2, 2) dtype=float32, numpy=
array([[19., 25.],
       [37., 43.]], dtype=float32)>

#### Convolution Layer

a convolution layer cross correlates the input and kernel and adds a scalar bias to produce an output ( the hidden representation). the two parameters are the kernel and scalar bias. when training models on convolution layers, we typically initialize them randomly, like we do in fully connected layers. 



In [19]:
class Conv2D(tf.keras.layers.Layer):
    def __init__(self):
        super().__init__()
    def build(self, kernel_size):
        initializer = tf.random_normal_initializer()
        # weight w
        self.weight = self.add_weight(name = 'w', shape = kernel_size, initializer = initializer)
        # bias u
        self.bias = self.add_weight(name = 'b', shape = (1, ), initializer = initializer)
    def call(self, inputs):
        # WX + U
        return corr2d(inputs, self.weight) + self.bias

#### Object Edge Detection in Images

creating an "image" of size 6x8, middle four columns are black and the rest white. 

In [20]:
X = tf.Variable(tf.ones((6, 8)))
X[:, 2:6].assign(tf.zeros(X[:, 2:6].shape))
X

<tf.Variable 'Variable:0' shape=(6, 8) dtype=float32, numpy=
array([[1., 1., 0., 0., 0., 0., 1., 1.],
       [1., 1., 0., 0., 0., 0., 1., 1.],
       [1., 1., 0., 0., 0., 0., 1., 1.],
       [1., 1., 0., 0., 0., 0., 1., 1.],
       [1., 1., 0., 0., 0., 0., 1., 1.],
       [1., 1., 0., 0., 0., 0., 1., 1.]], dtype=float32)>

creating a 1x2 kernel. when we perform the cross correlation operation, if horizontally adjacent elements are same, input is 0, otherwise non zero. 

In [21]:
K = tf.constant([[1.0, -1.0]])

In [22]:
Y = corr2d(X, K)
Y

<tf.Variable 'Variable:0' shape=(6, 7) dtype=float32, numpy=
array([[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
       [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
       [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
       [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
       [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
       [ 0.,  1.,  0.,  0.,  0., -1.,  0.]], dtype=float32)>

the kernel only detects vertical edges ie., it won't work for horizontal edges

In [23]:
corr2d(tf.transpose(X), K)

<tf.Variable 'Variable:0' shape=(8, 5) dtype=float32, numpy=
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]], dtype=float32)>

#### Learning a Kernel

initializing a kernel to [1,-1] to detect edges in a small image makes sense. but for larger inputs and multiple convolution layers, it will be impossible to specify precisely what each filter should be. 

 we first initialize the kernel as a random tensor, then use squared error to compare Y with output of convolution layer and calculate the gradient to update the kernel. 
 
 using the built in keras conv layer for simplicity and ignoring the bias

In [24]:
conv2d = tf.keras.layers.Conv2D(1, (1, 2), use_bias = False)

# the two dimensional conv layer uses 4 dimensional inputs and outputs. 
# they are in the format of (example, height, width, channel) and the batch size and number of channels are 1. 
X = tf.reshape(X, (1, 6, 8, 1))
Y = tf.reshape(Y, (1, 6, 7, 1))
# learning rate
lr = 3e-2

Y_hat = conv2d(X)

for i in range(10):
    with tf.GradientTape(watch_accessed_variables = False) as g:
        g.watch(conv2d.weights[0])
        Y_hat = conv2d(X)
        l = (abs(Y_hat - Y)**2)
        # updating the kernel
        update = tf.multiply(lr, g.gradient(l, conv2d.weights[0]))
        weights = conv2d.get_weights()
        weights[0] = conv2d.weights[0] - update
        conv2d.set_weights(weights)
        print(f"Batch: {i+1}, Loss: {tf.reduce_sum(l):.3f}")

Batch: 1, Loss: 0.078
Batch: 2, Loss: 0.050
Batch: 3, Loss: 0.032
Batch: 4, Loss: 0.020
Batch: 5, Loss: 0.013
Batch: 6, Loss: 0.008
Batch: 7, Loss: 0.005
Batch: 8, Loss: 0.003
Batch: 9, Loss: 0.002
Batch: 10, Loss: 0.001


In [25]:
tf.reshape(conv2d.get_weights()[0], (1, 2))

<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[ 0.9961692, -1.0038975]], dtype=float32)>

#### Cross-Correlation and Convolution

in order to perform strict convolution instead of cross correalation, we need to flip the two dimensional kernel both horizontally and vertically and then perform cross correlation operation. 

since these kernels are learned from data, it doesn't matter whether the convolution layers perform cross correlation or convolution. 

if K is the kernel learned from cross-correlation and K' is learned from strict convolution, K' will be the same as K when flipeed both horizontally and vertically. 

in standard deep learning terminology, cross-correlation is often referred to as the convolution even though they are slightly different. 

#### Feature Map and Receptive Field

the convolution layer's output is often referred to as feature map as it can be regarded as the learned features/representations to the subsequent layer.

for any element , receptive field refers to all elements from all previous layers that may effect the calculation of x during forward propogation. receptive field may be larger than the actual input as it is the set of all previous values before an element. 

![Convolution Layer](./images/6/6.4.png "Convolution Layer")

for example, in the above image, given the 2x2 kernel, receptive field of 19, is the four elements in the shaded portion of input. now assume, the output of this layer, Y,  is the input for another layer which outputs a single value z. the receptive field of z on Y includes all four elements of Y, but the receptive field of z on input includes all the nine input elements. 

thus if we need more elements in the receptive field of an element in a feature map, we create deeper networks. 

## Padding and Stride

as discussed previously, if nh and nw are heights and widths of the input and kh and kw are the height and width of the kernel, the output shape of convolution kernel by default would be (nh-kh+1)x(nw-kw+1).
this means after multiple convolutions, we end up with image shapes that are much smaller than the input shape. 

this can be handled by incorporating padding. in some other cases, we might want to reduce the dimensionality of the input shape drastically. in such cases we make use of striding. 

#### Padding

we basically add filler pixels around the input shape and typically set these filler pixels to 0.

![Padding](./images/6/6.5.svg "Padding")

in general, if we add ph rows and pw columns of padding, the output shape will be

(nh - kh + ph + 1)x(n - kw + pw + 1)

in many cases wew set up ph = kh - 1 and pw = kw - 1 to get input and output shapes of same heights and widths. 

if kh or kw is odd, we pad with ph/2 or pw/2. 

cnns commonly use convolution kernels with odd height and width values such as 1,3,5,7. choosing odd kernel sizes has the benefit that we can preserve the spatial dimensionality while padding with the same number of rows on top and bottom and the same number of columns on left and right. 

another benefit of using odd shaped convolution kernels and padding to get outputs that are the same shape as inputs is, the output Y[i,j] is a direct result of cross correlation of the input centered at X[i,j].

In [26]:
def comp_conv2d(conv2d, X):
    """this function performs reshaping of input and outputs produced by applying the conv2d layer"""
    # (1,) and (1,) denote batch size and number of channels respectively.
    X = tf.reshape(X, (1,) + X.shape + (1,))
    Y = conv2d(X)
    return tf.reshape(Y, Y.shape[1:3])

# padding = 'same' pads the input in such a way that the output and input are the same shape. 
# in this case it adds 2 rows and 2 columns. 
conv2d = tf.keras.layers.Conv2D(1, kernel_size = 3, padding = 'same')
X = tf.random.uniform(shape = (8, 8))
comp_conv2d(conv2d, X).shape

TensorShape([8, 8])

the output and input are the same shape as a result of padding = 'same' no matter the shape of kernel. 

In [27]:
conv2d = tf.keras.layers.Conv2D(1, kernel_size = (5, 3), padding = 'same')
comp_conv2d(conv2d, X).shape

TensorShape([8, 8])

#### Stride

in normal cross correlation, we moved the kernel over the input one element at a time. however because of needs to decrease computation time or to downsample the input, this can be done by skipping more than one element at at time. 

we refer to the number of rows and columns traversed per slide as a _stride_. 

!["3 2 striding"](./images/6/6.6.svg)

in the above image, cross correlation is applied with strides of 3 and 2 for height and width respectively.
kernel slides 2 columns horizontally till it's unable to fit one more window (unless more padding is applied)
and it slides 3 columns vertically to produce the output. 

when stride for height is sh and stride for width is sw, the output shape is

$$⌊(nh−kh+ph+sh)/sh⌋×⌊(nw−kw+pw+sw)/sw⌋$$

In [28]:
conv2d = tf.keras.layers.Conv2D(1, kernel_size = 3, padding = 'same', strides = 2)
comp_conv2d(conv2d, X).shape

TensorShape([4, 4])

the input got downsampled because of the strides. 

In [29]:
# padding = 'valid' means no artifical zeros are added to the data. it drops actual data if needed. 
conv2d = tf.keras.layers.Conv2D(1, kernel_size = (3, 5), padding = 'valid', strides = (3, 4))
comp_conv2d(conv2d, X).shape

TensorShape([2, 1])

## Multiple Inputs and Output Channels

with one channel, we were able to think of inputs, outputs and kernels as two dimensional matrices. but with the addition of channels, they become three dimensional tensors. 

for example, input becomes 3 x h x w. this 3 is referred to as channel dimension. 

#### Multiple Input Channels

when there are multiple channels, the convolution kernel should have the same number of channels as the input so that cross correlation can be performed. 

if ci is the number of channels, then there needs to be a kernel for each of the channel, so the shape of the kernels will be ci x kh x kw. 

since there will be ci channels with each channel having it's own kernel, two dimentional convolution operation can be performed on the two dimensional input with the kernel. this is the result of two dimensional cross correlation between a multi channel input and a multi channel convolution kernel. 

![multi channel cross convolution](./images/6/6.7.svg)

In [30]:
def corr2d_multi_in(X, K):
    return tf.reduce_sum([corr2d(x, k) for x, k in zip(X, K)], axis = 0)

In [31]:
X = tf.constant([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
               [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = tf.constant([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])

corr2d_multi_in(X, K)

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[ 56.,  72.],
       [104., 120.]], dtype=float32)>

#### Multiple Output Channels

so far, the output has always been single channeled. but it's essential that all outputs have multiple channels because they are passed as inputs to the next layers. 

in most popular neural network architectures, we purposefully increase the channel dimension as we go deeper into the architecture, which typically downsamples the data for greater channel depth. 

intiutively, these multiple channels can be thought of as different sets of features learnt, but in reality, these channels are not learned independent but are rather optimized to be jointly useful. 

assume ci and co are input and output channels, kh and kw are height and width of kernels. 

we create a kernel ternsor of shape ci x kh x kw for each layer of input and concatenate them on the output channel dimension to get co x ci x kh x kw

In [32]:
def corr2d_multi_in_out(X, K):
    """iterate through the 0th dimension of k and each time perform cross correaltion operations with X. 
    the results are then stacked together"""
    return tf.stack([corr2d_multi_in(X, k) for k in K], 0)

In [33]:
K = tf.stack((K, K+1, K+2), 0)
K.shape

TensorShape([3, 2, 2, 2])

In [34]:
corr2d_multi_in_out(X, K)

<tf.Tensor: shape=(3, 2, 2), dtype=float32, numpy=
array([[[ 56.,  72.],
        [104., 120.]],

       [[ 76., 100.],
        [148., 172.]],

       [[ 96., 128.],
        [192., 224.]]], dtype=float32)>

#### 1x1 Convolutional Layer