## From Fully-Connected Layers to Convolutions

in other nn models, we anticipate patterns we seek could involve interactions among features but we do not assume any structure concerning how these features interact. 

if we do not know how to create crafty architectures to solve such issues, an mlp may just be the best thing one could do. but for high dimensional data, such mlps can grow unweildy. 

for example, a neural network model that classfies dogs and cats pictures that are of size 1 mega pixel i.e., 1000000 dimensions, and a fully connected neural network of size 1000, would mean 1 billion parameters. even with a lot of patience and very well performing GPUs, it's infeasible. 

one might argue that we may not need a 1mp image. even if we reduce the quality down to 100000, it's still infeasible with billions of parameters. and fitting lots of parameters require collecting an enormous dataset. yet, humans are easily able to distinguish between them. this is because the images exhibit rich structure that can be exploited by humans and machine learning models alike. CNNs are one creative way to exploit the structure in such images. 

#### Invariance

spatial invariance: the model's ability to identify targets regardless of it's position in the image. for example the model should be able to identify pigs in the sky and planes on the ground. 

translation invariance: in the early layers of our network, the model should respond similarly to a patch regardless of it's position in the image. 

locality principle: in the early layers of our network, the network should focus on local regions regardless of the contents of the image in other distant regions. eventually, these local representations can be aggregated to make predictions at the whole image level. 

#### Constraining the MLP

let X be a two dimensional image which is the input to the network and their hidden representations as H. they both have the same shape which is a two dimensional tensor. we assume that both the image and hidden representations possess spatial structure. 

we switch from weight matrices to tensors of fourth order. 

![hidden cell equation](./images/6/6.1.png "Hidden Layer Equation")

the transformation from W to V is purely cosmetic, to have a one to one correspondence with the weight tensor and the image. 

k = i + a,

l = j + b

the indices a and b run over both postitive and negative offsets, covering the entire image. 

ie., for any given location i, j in the hidden representation H, we compute the value by summing over the pixels in X, centered at i,j weighted by V(i, j, a, b)

#### Translation Invariance

translational invariance means a shift in the input should lead to the same shift in the hidden representation H. this is possible when V and U do not depend on i and j. so we simplify the definition of H as follows. 

![hidden cell equation](./images/6/6.2.png "Hidden Layer Equation")

this is called a _convolution_

we are essentially weighting the pixels at (i+a, j+b) in the vicinity of location (i, j) with coefficients V(a,b) to obtain the value H(i,j).

V(a,b) requires a lot less parameters thatn V(i,j,a,b) because we no longer depend on the location within the image. 

#### Locality

locality means we should not look very far from location (i,j) in order to gain relevant information to assess what's going on at H(i,j). this means we set a limit on a,b exceeding which we set V(a,b) to 0. 

the updated equation of this becomes

$$[\mathbf{H}]_{i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} [\mathbf{V}]_{a, b}  [\mathbf{X}]_{i+a, j+b}$$

this equation in a nutshell, is the convolution layer. 

