# 2.0 Convolutional Layer
The earliest breakthrough in pattern recognition comes from LeCun's Digit Recognition model [[1]](#ref1), which utilized Convolutional Layer to learn features. The model's implementation is based on Fukushima's Neocognitron architecture, which in turn is based on visual cortex of the human body. This mimicry shows that much can be learned from nature. 

At present (writing this as of Sep. 1st, 2025), a lot of computer vision model still utilize Convolutional layer though they are being displaced by Transformers. They are still the basis for most computer vision tasks due to their ability to extract local features and patterns efficiently compared to the approaches in the previous notebook. When we start to stack layers together to add complexity and also train on bigger images with more channels, the time and space requirements grow in order of magnitude rendering our earlier model ill-suited for the task. A Convolutional layer is much more efficient, it utilizes **weight-sharing** or a **kernel** due to **translation invariance** and **locality**.

## 2.1. Convolutional Math
The reader may heard of this from ODE (or taught it, if you're reading this Professor Lin!), which has the following form
$$
\int_{-\infty}^{\infty} f(x)g(t-x)dt
$$
What convolution does is that it slides one function over the other, by adding the product of where they overlap, to produce a final function. Any who, integrals are just Riemann sum, we can get away with a finite sum rather than one that goes to infinity (NOT SUBSTANTIATED, TRUST ME AT YOUR OWN PERIL!!!). 
$$
\sum_{i=-\infty}^{\infty} f(x)g(t-x)dt
$$
In the context of a Classification problem, we want our Convolution function to iterate over all part of an image and extract the important underlying features. We would like to think of these features as being like an image's edge (the outline of a character you want to identify), an ear, a mouth, etc. Now, these features can be anywhere (REALLY IMPORTANT TO REMEMBER THIS STATEMENT), which means we can treat the **feature map** or the **kernel** (let's call it **kernel**) to detect a feature all over an image. What convolution is saying is:

> "Hey, can you like try applying a square detector on every part of an image and see where's there's an ear? Just divide the image into sections to check and find sections where there may be an ear."
> 
> ![A visualization of Convolution](https://anhreynolds.com/img/cnn.png)
> Image courtesy of [Anh H. Reynolds](https://anhreynolds.com/blogs/cnn.html)



Pretty simple right? Okay, let $O(i,j)$ be the output of a Convolution, $I(i,j)$ be the function that grabs pixel value (we will clarify this later), and $K(i, j)$ be the function that grabs from a kernel. Let $m$ and $n$ be the set of things in a kernel.
$$
O(i,j) = \sum_{m}\sum_{n} I(m, n)K(i - m, j - m)
$$
Now according to the GoodFellow, you can flip it relatively to the kernel. It's commutative! (Yeah prove that Lin). [[2]](#ref2).
$$
O(i,j) = \sum_{m}\sum_{n} I(i - m, j - n)K(m, n)
$$
But we can just follow the Cross-correlation version according to GoodFellow [[2]](#ref2).
$$
O(i,j) = \sum_{m}\sum_{n} I(i + m, j + n)K(m, n)
$$
## 2.2 Striding and Padding
It seems a bit costly to go over all the pixels of an image to find an Ear, after all, the ear is usually in one or two spots of our image. We could get by through applying a kernel every *kth* pixel through striding.

{MATH FUNCITON HERE}

The Convolution function **down samples** a matrix, meaning it would have its dimension reduced. We would want to avoid that if we are to make our model complex, by adding more layers on top. If we don't have a way to mitigate aggressive down sampling, we would end up with a pretty small matrix, whose content cannot be used for further analysis by the model. To prevent down sampling, we would implement **padding**, or specifically **zero-padding** by adding zeroes all around the matrix so that we retain some or all of the dimensions.

{MATH FUNCITON HERE}


## References

<a name="ref1">[1]</a> Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998, doi: 10.1109/5.726791.

<a name="ref2">[2]</a> I. Goodfellow, Y. Bengio, and A. Courville, “Convolutional Networks,” in Deep Learning. Cambridge, MA, USA: MIT Press, 2016, pp. 321-360.