# Programming for Data Science and Artificial Intelligence

## 16. Convolutional Neural Networks from Scratch

### Readings

- [WEIDMAN] Ch6
- [CHARU] Ch8



In [7]:
#import from last time work so we can extend further
from neuralnet.second_version import *
import numpy as np
from numpy import ndarray

So far we have focus on **Dense** layers (or also known as fully-connected layer) which are nice in understanding relationships.  Adding **activation function** like Sigmoid or Tanh allows us to understand the non-linear relationship between features and output, with Tanh having a steeper gradient, allowing the network to learn faster.  Adding **SoftMaxCrossEntropy** also enhance the gradient produced but remember that it only works with classification problems.  Adding **Dropout** helps in overfitting; **glorot initialization** to make sure the weight is normally distributed, **learning decay** to make sure we eventually reach the minimum instead of hopping all over the places, and last, the **momentum** to make sure we do not stuck in local minimum.  Such architecture is usally quite okay for **normal classification** problem.   

However, when we talk about specific classification problem such as image or text or signal, they all have specific nature that would benefit from different architectures.   

Today, we gonna work on image (this field is called computer vision) and discuss why Dense layer may not be the best, and propose CNN (Convolutional Neural Network) as a better way for dealing with image classification.

There are mainly three layers that can help dealing with images:

1. Convolutional layer
2. Max/Average pooling layer
3. Flatten layer

### 1. Convolutional Layer
Let's say given a image of 24 x 24 pixels = 576 features like this.  Each data point is an array of numbers describing how dark each pixel is, where value range from 0 to 255.  These values can be normalized ranging from 0 to 1. For example, for the following digit (the digit 1), we could have:

<img src ="figures/one.png" width="500">

We might input these features into Dense layers and try to ask the Dense layers to understand the relationships.

However, this is not so optimal since we do not **actually understand the nature of image**.  The key is that each single pixel actually holds very little information, right?  However, pattern of image can be better recognized by patches of pixels, rather than single pixel.  Imagine I give you a picture of cat, and I give you only a one-fourth of the picture, can you recognize that it's a cat?  Probably yes.  But what if I give you a single pixel.....you will have zero idea. 

**Why pattern of images are better recognized by patches?**...because humans recognize some visual patterns like corners, edges, sharpness.  Combining all these visual patterns form the image.  This is how humans visualize, and in fact, we should also apply these principles to neural networks

**So how do we generate each patch of feature?**...actually, it is very easy.  We simply perform a convolution operation like this:

![](figures/no_padding_no_strides.gif)

Mathematically, it looks like this:

Let's say we have a 5 x 5 input image I:

$$ I = \begin{bmatrix}
i_{11} & i_{12} & i_{13} & i_{14} & i_{15}
\\
i_{21} & i_{22} & i_{23} & i_{24} & i_{25}
\\
i_{31} & i_{32} & i_{33} & i_{34} & i_{35}
\\
i_{41} & i_{42} & i_{43} & i_{44} & i_{45}
\\
i_{51} & i_{52} & i_{53} & i_{54} & i_{55}
\end{bmatrix}
$$

Each of this pixel may represent the brightness ranging from 0 to 255.

If we define a 3 x 3 patch which we commonly called **weights (W)** or in computer vision, we called **filters/kernels** like this:

$$ w = \begin{bmatrix}
w_{11} & w_{12} & w_{13}
\\
w_{21} & w_{22} & w_{23}
\\
w_{31} & w_{32} & w_{33}
\end{bmatrix}
$$

Let's say we are scanning the middle of the image, then the output feature would be (we'll denote this as $o_{33}$):

$$o_{33} = w_{11} * i_{22} + w_{12} * i_{23} + w_{13} * i_{24} + \
           w_{21} * i_{32} + w_{22} * i_{32} + w_{23} * i_{34} + \
           w_{32} * i_{43} + w_{33} * i_{44}$$
           
This will result in one output feature.  Of course, we may add bias to it and then will be fed through an activation function.

When we do this operation across the whole image, this is called **convolution** which will result in the output features

There are actually a few questions remain to be answer namely:

1. How should the weights/filters/kernels look like? How many filters we should apply?
2. How should we convolve?
3. How about the step of convolution?
4. What should be the shape of the weight matrix or the convolutional filters and the output matrix?

#### A. Kernels

1. **How should the weights/filters/kernels look like?**.  It turns out that each filter actually detect the presence of certain visual pattern.  For example, this filter below detects whether there is an edge at that location of the image.  There are also other similar filters detecting corners, lines, etc.  Check out https://setosa.io/ev/image-kernels/  and try changing the values

$$ w = \begin{bmatrix}
0 & 1 & 0
\\
1 & -4 & 1
\\
0 & 1 & 0
\end{bmatrix}
$$

Real kernels can look like this.  They may look somewhat random at first glance, but we can see that clear structure being learned in most kernels. For example, kernels 3 and 4 seem to be learning diagonal edges in opposite directions, and other capture round edges or enclosed spaces:

<img src ="figures/kernels.png" width="450">

**Then how many filters we should apply**.  For each image, we can apply multiple filters.  If we apply 2 filters, the output features will become 3D like this: 

$$ 2 * \text{output-width} * \text{output-height} $$.  

We commonly called number of filters as **channels** (or **feature maps**) and then can also easily generalize to 

$$ \text{num-channels} * \text{output-width} * \text{output-height} $$

Actual feature map look like this.  Each feature map is a output of a single training example and convolve each kernel over the sample.    In simple words, if we have n filters, then we have n feature maps.  They represent the activation part corresponding to the kernels.

<img src ="figures/feature-map2.png" width="450">


#### B. Padding

2. **How should we convolve?**. Should we do the entire image?  Should we maintain the output features to be the same size as input features?  Recall this image:

<img src ="figures/no_padding_no_strides.gif" width="150">

It has 4 x 4 pixels = 16 features.  But after convolution, we only got 2 x 2 pixels = 4 features left.  Is that good?  There is no correct answers here but we are quite sure that we lose some information.  In fact, it is always nice to **maintain the output features to be the same size as input features**, but how?  There is no space to convolve since the filter is 2 x 2 and it can only shift right one time.

The answer is **padding**, where we can enlarge the input image by padding the surroundings with zeros.  How much?  Padding until we get the original size or larger size, for example, like this.  The below put **half padding** which result the output features to be the same size as input features.

<img src ="figures/same_padding_no_strides.gif" width="150">

The below put **full padding** which pad to make sure each single pixel is convoluted, which result the output features to be even large

<img src ="figures/full_padding_no_strides.gif" width="150">

Mathematically, it is easiest to understand padding from the 1D input like this:

$$ input = [1, 2, 3, 4, 5] $$

to

$$ input_{padded} = [0, 1, 2, 3, 4, 5, 0] $$

Normally, large size may benefit from more features, but also suffer from lengthy training time.  It is probably best to only perform enough padding to get the same size as input features.

#### C. Strides

3. **How about the step of convolution** Should we shift 1 step per convolution, or 2 steps, or how many steps.  **In fact, it really depends on how detail you want it to be.  But defining bigger steps reduce the feature size and thus reduce the computation time.**  Bigger step is like human scanning picture more roughly but can reduce the computation time....whether to use it is something to be experimented though. 

In computer vision, we called this step as **stride**.  Example is like this:

**No padding with stride of 1**

<img src ="figures/no_padding_strides.gif" width="150">

**Padding with stride of 1**

<img src ="figures/padding_strides.gif" width="150">

**Padding with stride (odd)**

<img src ="figures/padding_strides_odd.gif" width="150">

Actual image convolution can look like this:

<img src ="figures/conv.gif" width="500">

And the convoluted image look like this:

<img src ="figures/convimages.png" width="500">

**The formula to be used to measure the padding value to get the spatial size of the input and output volume to be the same with stride 1** is

$$ \frac{K-1}{2} $$

where K is the filter size.

Here, our K has size 3 so the padding should be $(3-1)/2 = 1$

This means that if our image is size $24 * 24$, and the filter size is $3 x 3$, then we need to add **a border of one pixel valued 0 around the outside of the image**, which would result in the input image of size $26 * 26$

In some python library such as Keras, we can set the <code>padding=same</code> to get this effect

#### D. Shape of the weight matrix and output

4. **What should be the shape of the weight matrix or the convolutional filters and the output matrix?**.  Recall that in Dense layer, the shape of weight matrix is defined as $$neuron_{in} * neuron_{out}$$

For example, given a image of 24 x 24 pixels = 576 features.  Let's say we got around 1000 images, thus our input has a shape of (1000, 576).  Thus the input layer should have 576 neurons.  Let's say our next hidden layer has 10 neurons, what should be the shape of the weight matrix?  The answer is easy, we need to simply find the ? here:

$$ (1000, 576) @ ? = (1000, 10) $$

Obviously, the weight matrix would be

$$ (576, 10) $$

where you can clearly see 576 is the number of input neurons and 10 is the number of output neurons.

Now our question is **how about convolutional layers**.  In convolutional layer, the shape of input has shape of (num of samples * image height * image width * num of channels).  For example, let's say after our first CNN layer, we have 1000 samples, 4 channels - edges, corners and horizontal and vertical lines, and image height and width to be 24, thus the shape of input is (1000, 24, 24, 4).  Now let's say we want to go to the next CNN layer and apply 2 filters?  What should be the shape of the weight matrix?  Also, what would be the shape of the output?  The answer is a little tough, but we know one thing is that the number of samples will remain the same, and the channel will be same as number of filters we apply, thus we get:

$$ (1000, 24, 24, 4) \circledast ? = (1000, ?, ?, 2) $$

The size actually depends on the stride (S), padding (P), and input size (W) with the formula as follows:

$$ O = \frac{W-F+2P}{S} + 1 $$

Here, $O$ is the output height/length, $W$ is the input height/length, $F$ is the filter size, $P$ is the padding, and $S$ is the stride.

Suppose we have an input image of size $24*24*4$, we apply 2 filters of size $3*3$, with single stride and no zero padding.

Here W=24, F=3, P=0 and S=1.

The size of the output volume will be $([24-3+0]/1)+1 = 22$. Therefore the output volume will be $22*22*2$.  Thus

$$ (1000, 24, 24, 4) \circledast (4, 2, 3, 3)_{p=0, s=1} = (1000, 22, 22, 2) $$


In conclusion, 

- The input will have a 4D shape of $(\text{size}, \text{image height}, \text{image width}, \text{input channels})$

- The output will have a 4D shape of $(\text{size}, \text{changed image height}, \text{changed image width}, \text{ouput channels})$

- The convolutional filters will have 4D shape of $(\text{input channels}, \text{filter height}, \text{filter width}, \text{output channels})$

**Note: The order does not matter and it depends on the python library you use but these four dimensions always exist in CNN.**

**The general rule of selecting padding, stride and filter size are of course of trial-and-error.  But it's important to remember that they should result in output image size of integers not decimals"

#### Demo

https://www.cs.ryerson.ca/~aharley/vis/conv/

### 2. Max/Average Pooling Layer

Talking about **reducing computation time**, another way is to perform a **pooling layer** which simply downsample the image by average a set of pixels, or by taking the maximum value.  If we define a pooling size of 2, this involves mapping each 2 x 2 pixels to one output, like this:

<img src ="figures/pooling.png" width="300">

Nevertheless, pooling has a really big downsides, i.e., it basically lose a lot of information.  Compared to strides, strides simply scan less but maintain the same resolution but pooling simply reduce the resolution of the images....As Geoffrey Hinton said on Reddit AMA in 2014 - **The pooling operation used in CNN is a big mistake and the fact that it works so well is a disaster**.  In fact, in most recent CNN architectures like ResNets, it uses pooling very minimially or not at all.  In this lecture, we are not going to implement pooling, but we just talk about it for the sake of completeness since very early architectures like AlexNet uses pooling

### 3. Flatten Layer

It must be said that in CNN, probably there are many convolutional layers.  However, in the last layer, typically, if we want to predict a certain class, it make sense to use Dense layer as the output layer.

However, the question is how do we send input of shape So far we have focus on **Dense** layers (or also known as fully-connected layer) which are nice in understanding relationships.  Adding **activation function** like Sigmoid or Tanh allows us to understand the non-linear relationship between features and output, with Tanh having a steeper gradient, allowing the network to learn faster.  Adding **SoftMaxCrossEntropy** also enhance the gradient produced but remember that it only works with classification problems.  Adding **Dropout** helps in overfitting; **glorot initialization** to make sure the weight is normally distributed, **learning decay** to make sure we eventually reach the minimum instead of hopping all over the places, and last, the **momentum** to make sure we do not stuck in local minimum.  Such architecture is usally quite okay for **normal classification** problem.   

However, when we talk about specific classification problem such as image or text or signal, they all have specific nature that would benefit from different architectures.   

Today, we gonna work on image (this field is called computer vision) and discuss why Dense layer may not be the best, and propose CNN (Convolutional Neural Network) as a better way for dealing with image classification.

There are mainly three layers that can help dealing with images:

1. Convolutional layer
2. Max/Average pooling layer
3. Flatten layer

### 1. Convolutional Layer
Let's say given a image of 24 x 24 pixels = 576 features like this.  Each data point is an array of numbers describing how dark each pixel is, where value range from 0 to 255.  These values can be normalized ranging from 0 to 1. For example, for the following digit (the digit 1), we could have:

<img src ="figures/one.png" width="500">

We might input these features into Dense layers and try to ask the Dense layers to understand the relationships.

However, this is not so optimal since we do not **actually understand the nature of image**.  The key is that each single pixel actually holds very little information, right?  However, pattern of image can be better recognized by patches of pixels, rather than single pixel.  Imagine I give you a picture of cat, and I give you only a one-fourth of the picture, can you recognize that it's a cat?  Probably yes.  But what if I give you a single pixel.....you will have zero idea. 

**Why pattern of images are better recognized by patches?**...because humans recognize some visual patterns like corners, edges, sharpness.  Combining all these visual patterns form the image.  This is how humans visualize, and in fact, we should also apply these principles to neural networks

**So how do we generate each patch of feature?**...actually, it is very easy.  We simply perform a convolution operation like this:

![](figures/no_padding_no_strides.gif)

Mathematically, it looks like this:

Let's say we have a 5 x 5 input image I:

$$ I = \begin{bmatrix}
i_{11} & i_{12} & i_{13} & i_{14} & i_{15}
\\
i_{21} & i_{22} & i_{23} & i_{24} & i_{25}
\\
i_{31} & i_{32} & i_{33} & i_{34} & i_{35}
\\
i_{41} & i_{42} & i_{43} & i_{44} & i_{45}
\\
i_{51} & i_{52} & i_{53} & i_{54} & i_{55}
\end{bmatrix}
$$

Each of this pixel may represent the brightness ranging from 0 to 255.

If we define a 3 x 3 patch which we commonly called **weights (W)** or in computer vision, we called **filters/kernels** like this:

$$ w = \begin{bmatrix}
w_{11} & w_{12} & w_{13}
\\
w_{21} & w_{22} & w_{23}
\\
w_{31} & w_{32} & w_{33}
\end{bmatrix}
$$

Let's say we are scanning the middle of the image, then the output feature would be (we'll denote this as $o_{33}$):

$$o_{33} = w_{11} * i_{22} + w_{12} * i_{23} + w_{13} * i_{24} + \
           w_{21} * i_{32} + w_{22} * i_{32} + w_{23} * i_{34} + \
           w_{32} * i_{43} + w_{33} * i_{44}$$
           
This will result in one output feature.  Of course, we may add bias to it and then will be fed through an activation function.

When we do this operation across the whole image, this is called **convolution** which will result in the output features

There are actually a few questions remain to be answer namely:

1. How should the weights/filters/kernels look like? How many filters we should apply?
2. How should we convolve?
3. How about the step of convolution?
4. What should be the shape of the weight matrix or the convolutional filters and the output matrix?

#### A. Kernels

1. **How should the weights/filters/kernels look like?**.  It turns out that each filter actually detect the presence of certain visual pattern.  For example, this filter below detects whether there is an edge at that location of the image.  There are also other similar filters detecting corners, lines, etc.  Check out https://setosa.io/ev/image-kernels/  and try changing the values

$$ w = \begin{bmatrix}
0 & 1 & 0
\\
1 & -4 & 1
\\
0 & 1 & 0
\end{bmatrix}
$$

Real kernels can look like this.  They may look somewhat random at first glance, but we can see that clear structure being learned in most kernels. For example, kernels 3 and 4 seem to be learning diagonal edges in opposite directions, and other capture round edges or enclosed spaces:

<img src ="figures/kernels.png" width="450">

**Then how many filters we should apply**.  For each image, we can apply multiple filters.  If we apply 2 filters, the output features will become 3D like this: 

$$ 2 * \text{output-width} * \text{output-height} $$.  

We commonly called number of filters as **channels** (or **feature maps**) and then can also easily generalize to 

$$ \text{num-channels} * \text{output-width} * \text{output-height} $$

Actual feature map look like this.  Each feature map is a output of a single training example and convolve each kernel over the sample.    In simple words, if we have n filters, then we have n feature maps.  They represent the activation part corresponding to the kernels.

<img src ="figures/feature-map2.png" width="450">


#### B. Padding

2. **How should we convolve?**. Should we do the entire image?  Should we maintain the output features to be the same size as input features?  Recall this image:

<img src ="figures/no_padding_no_strides.gif" width="150">

It has 4 x 4 pixels = 16 features.  But after convolution, we only got 2 x 2 pixels = 4 features left.  Is that good?  There is no correct answers here but we are quite sure that we lose some information.  In fact, it is always nice to **maintain the output features to be the same size as input features**, but how?  There is no space to convolve since the filter is 2 x 2 and it can only shift right one time.

The answer is **padding**, where we can enlarge the input image by padding the surroundings with zeros.  How much?  Padding until we get the original size or larger size, for example, like this.  The below put **half padding** which result the output features to be the same size as input features.

<img src ="figures/same_padding_no_strides.gif" width="150">

The below put **full padding** which pad to make sure each single pixel is convoluted, which result the output features to be even large

<img src ="figures/full_padding_no_strides.gif" width="150">

Mathematically, it is easiest to understand padding from the 1D input like this:

$$ input = [1, 2, 3, 4, 5] $$

to

$$ input_{padded} = [0, 1, 2, 3, 4, 5, 0] $$

Normally, large size may benefit from more features, but also suffer from lengthy training time.  It is probably best to only perform enough padding to get the same size as input features.

#### C. Strides

3. **How about the step of convolution** Should we shift 1 step per convolution, or 2 steps, or how many steps.  **In fact, it really depends on how detail you want it to be.  But defining bigger steps reduce the feature size and thus reduce the computation time.**  Bigger step is like human scanning picture more roughly but can reduce the computation time....whether to use it is something to be experimented though. 

In computer vision, we called this step as **stride**.  Example is like this:

**No padding with stride of 1**

<img src ="figures/no_padding_strides.gif" width="150">

**Padding with stride of 1**

<img src ="figures/padding_strides.gif" width="150">

**Padding with stride (odd)**

<img src ="figures/padding_strides_odd.gif" width="150">

Actual image convolution can look like this:

<img src ="figures/conv.gif" width="500">

And the convoluted image look like this:

<img src ="figures/convimages.png" width="500">

**The formula to be used to measure the padding value to get the spatial size of the input and output volume to be the same with stride 1** is

$$ \frac{K-1}{2} $$

where K is the filter size.

Here, our K has size 3 so the padding should be $(3-1)/2 = 1$

This means that if our image is size $24 * 24$, and the filter size is $3 x 3$, then we need to add **a border of one pixel valued 0 around the outside of the image**, which would result in the input image of size $26 * 26$

In some python library such as Keras, we can set the <code>padding=same</code> to get this effect

#### D. Shape of the weight matrix and output

4. **What should be the shape of the weight matrix or the convolutional filters and the output matrix?**.  Recall that in Dense layer, the shape of weight matrix is defined as $$neuron_{in} * neuron_{out}$$

For example, given a image of 24 x 24 pixels = 576 features.  Let's say we got around 1000 images, thus our input has a shape of (1000, 576).  Thus the input layer should have 576 neurons.  Let's say our next hidden layer has 10 neurons, what should be the shape of the weight matrix?  The answer is easy, we need to simply find the ? here:

$$ (1000, 576) @ ? = (1000, 10) $$

Obviously, the weight matrix would be

$$ (576, 10) $$

where you can clearly see 576 is the number of input neurons and 10 is the number of output neurons.

Now our question is **how about convolutional layers**.  In convolutional layer, the shape of input has shape of (num of samples * image height * image width * num of channels).  For example, let's say after our first CNN layer, we have 1000 samples, 4 channels - edges, corners and horizontal and vertical lines, and image height and width to be 24, thus the shape of input is (1000, 24, 24, 4).  Now let's say we want to go to the next CNN layer and apply 2 filters?  What should be the shape of the weight matrix?  Also, what would be the shape of the output?  The answer is a little tough, but we know one thing is that the number of samples will remain the same, and the channel will be same as number of filters we apply, thus we get:

$$ (1000, 24, 24, 4) \circledast ? = (1000, ?, ?, 2) $$

The size actually depends on the stride (S), padding (P), and input size (W) with the formula as follows:

$$ O = \frac{W-F+2P}{S} + 1 $$

Here, $O$ is the output height/length, $W$ is the input height/length, $F$ is the filter size, $P$ is the padding, and $S$ is the stride.

Suppose we have an input image of size $24*24*4$, we apply 2 filters of size $3*3$, with single stride and no zero padding.

Here W=24, F=3, P=0 and S=1.

The size of the output volume will be $([24-3+0]/1)+1 = 22$. Therefore the output volume will be $22*22*2$.  Thus

$$ (1000, 24, 24, 4) \circledast (4, 2, 3, 3)_{p=0, s=1} = (1000, 22, 22, 2) $$


In conclusion, 

- The input will have a 4D shape of $(\text{size}, \text{image height}, \text{image width}, \text{input channels})$

- The output will have a 4D shape of $(\text{size}, \text{changed image height}, \text{changed image width}, \text{ouput channels})$

- The convolutional filters will have 4D shape of $(\text{input channels}, \text{filter height}, \text{filter width}, \text{output channels})$

**Note: The order does not matter and it depends on the python library you use but these four dimensions always exist in CNN.**

**The general rule of selecting padding, stride and filter size are of course of trial-and-error.  But it's important to remember that they should result in output image size of integers not decimals"

#### Demo

https://www.cs.ryerson.ca/~aharley/vis/conv/

### 2. Max/Average Pooling Layer

Talking about **reducing computation time**, another way is to perform a **pooling layer** which simply downsample the image by average a set of pixels, or by taking the maximum value.  If we define a pooling size of 2, this involves mapping each 2 x 2 pixels to one output, like this:

<img src ="figures/pooling.png" width="300">

Nevertheless, pooling has a really big downsides, i.e., it basically lose a lot of information.  Compared to strides, strides simply scan less but maintain the same resolution but pooling simply reduce the resolution of the images....As Geoffrey Hinton said on Reddit AMA in 2014 - **The pooling operation used in CNN is a big mistake and the fact that it works so well is a disaster**.  In fact, in most recent CNN architectures like ResNets, it uses pooling very minimially or not at all.  In this lecture, we are not going to implement pooling, but we just talk about it for the sake of completeness since very early architectures like AlexNet uses pooling

### 3. Flatten Layer

It must be said that in CNN, probably there are many convolutional layers.  However, in the last layer, typically, if we want to predict a certain class, it make sense to use Dense layer as the output layer.

However, the question is how do we send input of shape $(\text{size}, \text{image height}, \text{image width}, \text{input channels})$ into Dense layer?

This is actually quite easy.  What we can do is simply squash all these 4D vectors into 2D vectors.  For example, given (1000, 22, 22, 2), through a *flatten* operation, the vector becomes (1000, 968), which we can then multiply with weight just like in Dense layer, make predictions, and calculate loss just like we did in previous class.

Why we can perform *flatten* operation?  Does it not lost any information?  This is because through flattening, the information is not lost...in fact, it is just another representations, thus flattening does not result in any loss of information.  It also allow the Dense layer to understand the relationships of visual patterns from prior convolutional layers to the output.

Flattening is as simple as this:

<img src ="figures/flatten.png" width="150">



### Let's start coding!!

First off, to make us easily understand CNN coding, let's start simple, working with 1D input.  Also let's write some helpers to make our life easier, namely <code>assert_same_shape</code>, and <code>assert_dim</code>

In [17]:
def assert_same_shape(A: ndarray, B: ndarray):
    assert A.shape == B.shape
    
def assert_dim(X: ndarray, dim: ndarray):
    assert len(X.shape) == dim


#### Padding

Padding can be easily coded.  Let's start simple with 1D input like this:

In [47]:
input_1d = np.array([1,2,3,4,5])
param_1d = np.array([1,1,1])

In [48]:
def _pad_1d(input_: ndarray,
            padding: int) -> ndarray:
    zero = np.array([0])
    zero = np.repeat(zero, padding)  #number of zeros * num
    return np.concatenate([zero, input_, zero])

In [49]:
_pad_1d(input_1d, 1)

array([0, 1, 2, 3, 4, 5, 0])

#### Forward pass - convolution

Convolution in 1D is simple.

We are actually doing something like this:

In [62]:
def conv_1d(input_: ndarray, 
            param: ndarray) -> ndarray:
    
    # assert 1D data
    assert_dim(input_, 1)
    assert_dim(param, 1)
    
    # pad the input
    # (k - 1) / 2 can be implemented as k // 2 where // is floor division
    param_len = param.shape[0]  #3
    param_mid = param_len // 2  #3 // 2 = 1
    input_pad = _pad_1d(input_, param_mid) # [0, 1, 2, 3, 4, 5, 0]
    
    # initialize the output
    # we let output has the same shape of input
    output = np.zeros(input_.shape) # [0, 0, 0, 0, 0]

    # perform the 1d convolution
    for o in range(output.shape[0]): #0 to 4
        for p in range(param_len):  #0 to 2
            output[o] += param[p] * input_pad[o+p] #o move along with p thus o+p
        
    # ensure input has same shape as output
    # this is actually optional
    assert_same_shape(input_, output)

    return output

Next, we code the sum, which is basically sum everything return by the convolution.

In [60]:
def conv_1d_sum(input_: ndarray, 
                param: ndarray) -> ndarray:
    output = conv_1d(input_, param)
    return np.sum(output)

In [61]:
conv_1d_sum(input_1d, param_1d)

39.0

#### Gradients

How to compute the gradients of convolution?

Let's first try some set of numbers and manually get the gradients:

In [63]:
#randomly choose to increase 5th element by 1 
#so we can know the gradient of 5th element in respect to the convolution sum
input_1d_2 = np.array([1,2,3,4,6])
param_1d = np.array([1,1,1])

conv_1d_sum(input_1d_2, param_1d)

41.0

What does this mean?  Since we change the 5th element by 1, which increase the convolution sum by 2, thus the gradient of the 5th element is 2.

Let's see how we actually get the 2.

Given 

$$ t = [0, 1, 2, 3, 4, 5, 0]  $$

and 

$$ w = [1, 1, 1] $$

First, let's look at the convolution equation like this:

$$ o_1 = t_1*w_1 + t_2*w_2 + t_3*w_3 $$
$$ o_2 = t_2*w_1 + t_3*w_2 + t_4*w_3 $$
$$ o_3 = t_3*w_1 + t_4*w_2 + t_5*w_3 $$
$$ o_4 = t_4*w_1 + t_5*w_2 + t_6*w_3 $$
$$ o_5 = t_5*w_1 + t_6*w_2 + t_7*w_3 $$

Look at $t_6$ which is our 5th element, where $t_6$ is a function of $w_3$ in respect to $O_4$, $w_2$ in respect to$O_5$, and $w_1$ in respect to $O_6$ (which we don't have anyway)

This gradient can be written as:

$$ \frac{\partial L}{\partial t_6} = \frac{\partial L}{\partial o_4} * w_3 +  \frac{\partial L}{\partial o_5} * w_2 + \frac{\partial L}{\partial o_6} * w_1$$

Since $o_i$ is contributing to the sum of the convolution sum, its derivative is simply $$\frac{\partial L}{\partial o_i} = 1$$

Thus, $$ \frac{\partial L}{\partial t_6} = \frac{\partial L}{\partial o_4} * w_3 +  \frac{\partial L}{\partial o_5} * w_2 + \frac{\partial L}{\partial o_6} * w_1 = 1 * w_3 + 1 * w_2 + 0 * w_1 = 2$$

since $o_6$ does not exist