## From Fully-Connected Layers to Convolutions

in other nn models, we anticipate patterns we seek could involve interactions among features but we do not assume any structure concerning how these features interact. 

if we do not know how to create crafty architectures to solve such issues, an mlp may just be the best thing one could do. but for high dimensional data, such mlps can grow unweildy. 

for example, a neural network model that classfies dogs and cats pictures that are of size 1 mega pixel i.e., 1000000 dimensions, and a fully connected neural network of size 1000, would mean 1 billion parameters. even with a lot of patience and very well performing GPUs, it's infeasible. 

one might argue that we may not need a 1mp image. even if we reduce the quality down to 100000, it's still infeasible with billions of parameters. and fitting lots of parameters require collecting an enormous dataset. yet, humans are easily able to distinguish between them. this is because the images exhibit rich structure that can be exploited by humans and machine learning models alike. CNNs are one creative way to exploit the structure in such images. 

#### Invariance

spatial invariance: the model's ability to identify targets regardless of it's position in the image. for example the model should be able to identify pigs in the sky and planes on the ground. 

translation invariance: in the early layers of our network, the model should respond similarly to a patch regardless of it's position in the image. 

locality principle: in the early layers of our network, the network should focus on local regions regardless of the contents of the image in other distant regions. eventually, these local representations can be aggregated to make predictions at the whole image level. 

#### Constraining the MLP

let X be a two dimensional image which is the input to the network and their hidden representations as H. they both have the same shape which is a two dimensional tensor. we assume that both the image and hidden representations possess spatial structure. 

we switch from weight matrices to tensors of fourth order. 

![hidden cell equation](./images/6/6.1.png "Hidden Layer Equation")

the transformation from W to V is purely cosmetic, to have a one to one correspondence with the weight tensor and the image. 

k = i + a,

l = j + b

the indices a and b run over both postitive and negative offsets, covering the entire image. 

ie., for any given location i, j in the hidden representation H, we compute the value by summing over the pixels in X, centered at i,j weighted by V(i, j, a, b)

#### Translation Invariance

translational invariance means a shift in the input should lead to the same shift in the hidden representation H. this is possible when V and U do not depend on i and j. so we simplify the definition of H as follows. 

![hidden cell equation](./images/6/6.2.png "Hidden Layer Equation")

this is called a _convolution_

we are essentially weighting the pixels at (i+a, j+b) in the vicinity of location (i, j) with coefficients V(a,b) to obtain the value H(i,j).

V(a,b) requires a lot less parameters thatn V(i,j,a,b) because we no longer depend on the location within the image. 

#### Locality

locality means we should not look very far from location (i,j) in order to gain relevant information to assess what's going on at H(i,j). this means we set a limit on a,b exceeding which we set V(a,b) to 0. 

the updated equation of this becomes

$$[\mathbf{H}]_{i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} [\mathbf{V}]_{a, b}  [\mathbf{X}]_{i+a, j+b}$$

this equation in a nutshell, is the convolution layer. 

convolutional neural networks are a family of neural networks that contain these convolution layers. 

V is referred to as convolution kernel/filter/layer's weights that are learned. 

with the help of this kernel, the number of paramters go down from millions to a few hundred without altering the dimensionality of the inputs or hidden representations and the features are now translation invariant and the layer can only incorporate local information when being the layer is activated. 

if the biases do not agree with reality, models will struggle to fit our training data for example the images might not actually be translation invariant, so it struggles. 

#### Convolution

in mathematics, the convolution between two functions f and g is defined as $$(f∗g)(x)=∫f(z)g(x−z)dz$$

this means we measure the overlap between f and g when one function is _flipped_ and shifted by x. when we have discrete objects, the integral is replaced by sum ( like in the case of convolution kernel)

applying the above convolution formula in case of convolution kernel, we get $$(f∗g)(i,j)=∑a∑bf(a,b)g(i−a,j−b)$$

the difference between the previous notation and the current notation is i+a, j+b being replaced by i-a and j-b. 

this difference is purely cosmetic and the original definition describes what is called _cross correlation_. more on this later

#### Applying above concepts theoretically

consider the _where's waldo_ problem. we need to learn a model in which the convolution layer picks windows of a given size and weights according to V, and it finds the window where the _waldoness_ is highest. 

we find the peak in the hidden layer representations

#### Channels

the problem with the above approach is, images are not two dimensional objects but are third order tensors where the third order depicts one of three channels; red, green and blue. 

this means the image representation goes from X(i,j) to X(i,j,k) and similarly the convolution kernel goes from V(a,b) to V(a,b,c)

now the equation for the hidden representation is adapted to work with the third order tensors as

$$[H]i,j,d=∑a=−ΔΔ∑b=−ΔΔ∑c[V]a,b,c,d[X]i+a,j+b,c$$

here d indexes the output channels in the hidden represeentation. 

## Convolutions to Images

#### The Cross-Correlation Operation

the _convolution_ seen before is a misnomer as the operations can more accurately be expressed as cross correlations where input tensor and a kernel tensor are combined to produce an output tensor. 

if we ignore channels, this is how it works 

![./images/6/6.3.svg](attachment:correlation.svg)

the convolution slides over the input tensor. the kernel is smaller than the input window. the size of output can be deduced as $$(nh−kh+1)×(nw−kw+1)$$ where nh and nw are heights and widths of inputs and kh and kw are heights and widths of kernel. 

we can pad the input image with zeros around it, to get hidden representations of the same size. 

In [9]:
import tensorflow as tf
tf.get_logger().setLevel('INFO')

def corr2d(X, K):
    h, w = K.shape
    Y = tf.Variable(tf.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1)))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j].assign(tf.reduce_sum(X[i:i+h, j:j+w] * K))
    return Y

In [10]:
X = tf.constant([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = tf.constant([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

<tf.Variable 'Variable:0' shape=(2, 2) dtype=float32, numpy=
array([[19., 25.],
       [37., 43.]], dtype=float32)>

#### Convolution Layer