# More About Convolution

On Tuesday we learned *how* to do a convolution (on image data):
1. How to make a kernel
2. How to pass the kernel over an image
3. Considerations, like padding and stride, that affect the implementation of passing a kernel over an image

Convolution in a CNN is analogous to *net_in* in a MLP. We use it to make a linear combination of input features:

<img src="https://i.imgur.com/j3lk26U.png" width="300" height="300" />

*Image source: https://www.kaggle.com/code/ryanholbrook/convolution-and-relu*

The numbers in the kernel are analogous to the weights in *net_in*. Some notes:
* We can also add a bias term to a convolution layer!
* A kernel for this class will be 2D because we will use it for computer vision applications. However, kernels could be 1D or 3D, depending on the data and the task.

So what we learn when training a CNN is the weights in the kernels (and biases). We do *not* train to learn:
* The sizes of the kernels
* The stride
* The amount of padding
* The orientation of passing a kernel over the input - for computer vision, we always pass over each channel separately



What is the equivalent of activation in a CNN? Activation!

Here is a simple kernel:

<img src="https://www.kaggleusercontent.com/kf/94853598/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..VHcvCnXRYUNJ8HS2lqXFrA.ZVTaYd13Mk9R6X_Sa-1G_VyeVVHWjMIQAGbdb-ulzlikRLnfsMWOLtE_VPDD-qjFGMVQ56_w12X57mFUeTBXpDG6m8fQwVZsc_rEJIK-NHdMNd_WEshlUvGjaZRzK7o5OON7IBOvkRBQrG4x9GgarZ0PjNrLiIrVZzf5saAJVGr1A_U9FoJyc7xdtdixeHHhEzoIqKmFCiAmJ9BsHNi7c4UFXyxmgS9AhVhZMqUiZq6jUUhKoTYlWHD3lqD_PKhKhvmoN9ldmd8Lv4eTsev9nCyKiKj-7BZd91QpPh3w_IwjnlV7_eCm_0qyS4tql3s9SHKbtDVWlGdzYk_tJxDb5Wsy1gBzNGQAEoaxgRhE4ycqjAvq90J2-qjDvR7XeU57SF9_18ll3fdyX1Tiuv0v0Ox6M_dDMHJ9XqNCSnhzFHjHxX5GSd0XRxNnVwIek-8Va_wmiuHdwKUM0PlAH0yURXvugqWr0Q9Vwbl8UOpmkzmLyGCD9IqnXDeBY6spYZHImIwiJvJiCfSTXv8n9Zgq6k3ZmLuXqp2JuBhkEa1kV_K-mYq3arjd34JRfiiEmeap6Q7DWSk2V7JtnIDtrQPxl3iAT8ApHeDvczKSm5A7TW-mTTbCNwxAe5i8hX6mbi2CkW4XaqZyk_dg967-4_6P89w9-5kES6aarwDHqCYizZU.SjQdB5nSifJyu9HzGGVHGg/__results___files/__results___9_0.png" height=200>

*Image source: https://www.kaggle.com/code/ryanholbrook/convolution-and-relu*


Here is the feature map that results from convolution of this kernel over an input image:

<img src="https://www.kaggleusercontent.com/kf/94853598/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..VHcvCnXRYUNJ8HS2lqXFrA.ZVTaYd13Mk9R6X_Sa-1G_VyeVVHWjMIQAGbdb-ulzlikRLnfsMWOLtE_VPDD-qjFGMVQ56_w12X57mFUeTBXpDG6m8fQwVZsc_rEJIK-NHdMNd_WEshlUvGjaZRzK7o5OON7IBOvkRBQrG4x9GgarZ0PjNrLiIrVZzf5saAJVGr1A_U9FoJyc7xdtdixeHHhEzoIqKmFCiAmJ9BsHNi7c4UFXyxmgS9AhVhZMqUiZq6jUUhKoTYlWHD3lqD_PKhKhvmoN9ldmd8Lv4eTsev9nCyKiKj-7BZd91QpPh3w_IwjnlV7_eCm_0qyS4tql3s9SHKbtDVWlGdzYk_tJxDb5Wsy1gBzNGQAEoaxgRhE4ycqjAvq90J2-qjDvR7XeU57SF9_18ll3fdyX1Tiuv0v0Ox6M_dDMHJ9XqNCSnhzFHjHxX5GSd0XRxNnVwIek-8Va_wmiuHdwKUM0PlAH0yURXvugqWr0Q9Vwbl8UOpmkzmLyGCD9IqnXDeBY6spYZHImIwiJvJiCfSTXv8n9Zgq6k3ZmLuXqp2JuBhkEa1kV_K-mYq3arjd34JRfiiEmeap6Q7DWSk2V7JtnIDtrQPxl3iAT8ApHeDvczKSm5A7TW-mTTbCNwxAe5i8hX6mbi2CkW4XaqZyk_dg967-4_6P89w9-5kES6aarwDHqCYizZU.SjQdB5nSifJyu9HzGGVHGg/__results___files/__results___13_0.png" height=300>

*Image source: https://www.kaggle.com/code/ryanholbrook/convolution-and-relu*

And here is the result of applying ReLU activation to that feature map:

<img src="https://www.kaggleusercontent.com/kf/94853598/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..VHcvCnXRYUNJ8HS2lqXFrA.ZVTaYd13Mk9R6X_Sa-1G_VyeVVHWjMIQAGbdb-ulzlikRLnfsMWOLtE_VPDD-qjFGMVQ56_w12X57mFUeTBXpDG6m8fQwVZsc_rEJIK-NHdMNd_WEshlUvGjaZRzK7o5OON7IBOvkRBQrG4x9GgarZ0PjNrLiIrVZzf5saAJVGr1A_U9FoJyc7xdtdixeHHhEzoIqKmFCiAmJ9BsHNi7c4UFXyxmgS9AhVhZMqUiZq6jUUhKoTYlWHD3lqD_PKhKhvmoN9ldmd8Lv4eTsev9nCyKiKj-7BZd91QpPh3w_IwjnlV7_eCm_0qyS4tql3s9SHKbtDVWlGdzYk_tJxDb5Wsy1gBzNGQAEoaxgRhE4ycqjAvq90J2-qjDvR7XeU57SF9_18ll3fdyX1Tiuv0v0Ox6M_dDMHJ9XqNCSnhzFHjHxX5GSd0XRxNnVwIek-8Va_wmiuHdwKUM0PlAH0yURXvugqWr0Q9Vwbl8UOpmkzmLyGCD9IqnXDeBY6spYZHImIwiJvJiCfSTXv8n9Zgq6k3ZmLuXqp2JuBhkEa1kV_K-mYq3arjd34JRfiiEmeap6Q7DWSk2V7JtnIDtrQPxl3iAT8ApHeDvczKSm5A7TW-mTTbCNwxAe5i8hX6mbi2CkW4XaqZyk_dg967-4_6P89w9-5kES6aarwDHqCYizZU.SjQdB5nSifJyu9HzGGVHGg/__results___files/__results___15_0.png" height=300>

*Image source: https://www.kaggle.com/code/ryanholbrook/convolution-and-relu*


Let's imagine applying a single kernel over an input image with minimal padding. 
* What would be the shape of the output feature map?
* After activation, what would be the shape?
* What happens to the spread of information if we made a second layer, applying this same kernel+activation on the output of the previous layer?

Now let's imagine applying three kernels over an input image with minimal padding.
* What would be the shape of the output feature maps?
* After activation, what would be the shape?

Okay, the promise of CNNs is that as you add depth, each successive layer can look at larger and larger contexts of the input, but do we have that? 
And what are we going to do with the explosion of dimensionality of each layer as we add kernels?

# Pooling

So far every network we have built has had all layers the same. But with a CNN, we typically interleave convolution layers with **pooling** layers. A pooling layer reduces dimensionality of its input. It's a kind of kernel that walks over the input, taking the min/max/median/mode/mean at each step (note, we could get fancier with our aggregation! The textbook mentions stochastic pooling. Or, like, we could learn the aggregation operation: hold that thought for when we get to **attention** at the end of the semester.)

People have tried various types of pooling operator but by far the most common one today is max pooling.

Here is the result of applying max pooling to the feature map coming out of the previous convolution layer:

<img src="https://www.kaggleusercontent.com/kf/94853595/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..IT6yYZjyyR49ErwiLsTQgw.1spwHe9nU6C7ZRnDDvkWcoJ-P1hVMpDjDbDymFWaU5OB3dtUDDWowbAB9lmJGWNPe-i1midhqg1G7aI53czrUFxdgQIkLi3YHbTbscU9JkU0vaROGuaTWDy8zLCYFY4RaORAlHYNjRq0MlxbVJPPU3KDjC4aA26ixg4UE3ieL21hWjokDBtpMd9hB3GvOzL19XHr4a33UT5SocrjPjTBiPoOG7nOzsd2Fm-GRIsvKpQaQnuvJdgs79Hz_xyCUmgBcjZS5UzvIJN1Q1u89Jh-xVOQUAay-uELUVdu8pz21lyMpnL1WBzp90oI1jb6AQTn_0ZiMrbAEYLk6tTCejTJ50s0aOsLZeFgcjorePgGNiRBP-hCXAIe0Q0zoqjz8lbN8eyABA9ICDyoLqVe-j2NCHAOtDfk7h-tWzYioVQYAYhua-Y9yLL5eEOFJoq648fN8lp9H2KRpT38Y_ETXguCIILLejC43WlHIT7qheYCnWoS7Fus9gpYE7ehQSkZHOyoW1HRCCxBurgu0qcReKvV-6FIQ9WhYkdhPzrt5nQcjO8vYU4B7aEJwdQILavypvkkbzi3RKVi7N8hdOoYT7iqpLuByqOP09ps6pAvOxrmiap6Ao37wFLF9-5nRnxrxzo4S84hnUl_qZ4kjeHm6T0KQg.FyiiKdGdtrAYcgt8iCpCJQ/__results___files/__results___5_0.png" height=300>

*Image source: https://www.kaggle.com/code/ryanholbrook/convolution-and-relu*

If we pool, what is the shape of the output?

$\lfloor\frac{I_x-P}{S} \rfloor +1$, where $I_x$ is the input $x$ shape, $P$ is the pooling window shape and $S$ is the stride.

Max pooling helps with:
* Translation invariance
* Dimensionality reduction

Typically in a CNN you'll see convolution layers, then max pooling layers, .... and finally one or more dense linear layers and then the output layer.

# Convolution for Compression and Representation Learning

CNNS can be used for data classification (classifying whole things), object detection (finding things in things), and even regression.

You might think that the convolution/pooling process loses information, but convolution can also be used for representation learning and compression:
* https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7803544
* https://arxiv.org/pdf/1809.03684.pdf


# Convolution and Max Pooling in a CNN

Typically, during the forward pass for a convolution layer we:
1. Compute net_in:
  a. Convolve each kernel with the image, separately on each color channel
  b. Sum the output of each kernel across color channels 
  c. Add bias to each kernel's output (one bias term per kernel)
2. Compute net_act on the net_in of each kernel 

And for a max pooling layer we:
1. Compute net_in: max pooling across each filter map output by a kernel on the previous layer

As we stack these layers, the shape of each layer shrinks and the number of filters (kernels) grows, so layers get skinnier and deeper:

<img src="https://learnopencv.com/wp-content/uploads/2023/01/tensorflow-keras-cnn-vgg-architecture-1024x611.png" height=300>

*Image source: https://learnopencv.com/understanding-convolutional-neural-networks-cnn/*

By the way, these are two other great explanations of CNNs for computer vision:
* https://learnopencv.com/understanding-convolutional-neural-networks-cnn/
* https://www.kaggle.com/learn/computer-vision


On Tuesday we implemented a convolution operation. Today, implement pooling.

In [None]:
def pooling(data, kernel, stride=0, pad=0, method="max"):
    # method could be max, average, median or anything else of your choosing)
    res = []
    for i in range(image.shape[0]-kernel.shape[0]):
        for j in range(image.shape[1]-kernel.shape[1]):
            if method=="max":
                res[i,j] = np.max(data[i:i+kernel.shape[0], j:j+kernel.shape[1]])