# Programming for Data Science and Artificial Intelligence

## 16. Convolutional Neural Networks from Scratch

### Readings

- [WEIDMAN] Ch6
- [CHARU] Ch8



In [1]:
#import from last time work so we can extend further
from neuralnet.second_version import *

So far we have focus on **Dense** layers (or also known as fully-connected layer) which are nice in understanding relationships.  Adding **activation function** like Sigmoid or Tanh allows us to understand the non-linear relationship between features and output, with Tanh having a steeper gradient, allowing the network to learn faster.  Adding **SoftMaxCrossEntropy** also enhance the gradient produced but remember that it only works with classification problems.  Adding **Dropout** helps in overfitting; **glorot initialization** to make sure the weight is normally distributed, **learning decay** to make sure we eventually reach the minimum instead of hopping all over the places, and last, the **momentum** to make sure we do not stuck in local minimum.  Such architecture is usally quite okay for **normal classification** problem.   

However, when we talk about specific classification problem such as image or text or signal, they all have specific nature that would benefit from different architectures.   

Today, we gonna work on image (this field is called computer vision) and discuss why Dense layer may not be the best, and propose CNN (Convolutional Layer) as a better way for dealing with image classification.

### Convolutional Layer
Let's say given a image of 24 x 24 pixels = 576 features.  We might input these features into Dense layers and try to ask the Dense layers to understand the relationships.

However, this is not so optimal since we do not **actually understand the nature of image**.  The key is that each single pixel actually holds very little information, right?  However, pattern of image can be better recognized by patches of pixels, rather than single pixel.  Imagine I give you a picture of cat, and I give you only a one-fourth of the picture, can you recognize that it's a cat?  Probably yes.  But what if I give you a single pixel.....you will have zero idea. 

**Why pattern of images are better recognized by patches?**...because humans recognize some visual patterns like corners, edges, sharpness.  Combining all these visual patterns form the image.  This is how humans visualize, and in fact, we should also apply these principles to neural networks

**So how do we generate each patch of feature?**...actually, it is very easy.  We simply perform a convolution operation like this:

![](figures/no_padding_no_strides.gif)

Mathematically, it looks like this:

Let's say we have a 5 x 5 input image I:

$$ I = \begin{bmatrix}
i_{11} & i_{12} & i_{13} & i_{14} & i_{15}
\\
i_{21} & i_{22} & i_{23} & i_{24} & i_{25}
\\
i_{31} & i_{32} & i_{33} & i_{34} & i_{35}
\\
i_{41} & i_{42} & i_{43} & i_{44} & i_{45}
\\
i_{51} & i_{52} & i_{53} & i_{54} & i_{55}
\end{bmatrix}
$$

Each of this pixel may represent the brightness ranging from 0 to 255.

If we define a 3 x 3 patch which we commonly called **weights (W)** or in computer vision, we called **filters/kernels** like this:

$$ w = \begin{bmatrix}
w_{11} & w_{12} & w_{13}
\\
w_{21} & w_{22} & w_{23}
\\
w_{31} & w_{32} & w_{33}
\end{bmatrix}
$$

Let's say we are scanning the middle of the image, then the output feature would be (we'll denote this as $o_{33}$):

$$o_{33} = w_{11} * i_{22} + w_{12} * i_{23} + w_{13} * i_{24} + \
           w_{21} * i_{32} + w_{22} * i_{32} + w_{23} * i_{34} + \
           w_{32} * i_{43} + w_{33} * i_{44}$$
           
This will result in one output feature.  Of course, we may add bias to it and then will be fed through an activation function.

When we do this operation across the whole image, this is called **convolution** which will result in the output features

There are actually a few questions remain to be answer:

#### Kernels

1. **How should the weights/filters/kernels look like?**.  It turns out that each filter actually detect the presence of certain visual pattern.  For example, this filter below detects whether there is an edge at that location of the image.  There are also other similar filters detecting corners, lines, etc.  Check out https://setosa.io/ev/image-kernels/  and try changing the values

$$ w = \begin{bmatrix}
0 & 1 & 0
\\
1 & -4 & 1
\\
0 & 1 & 0
\end{bmatrix}
$$

**Then how many filters we should apply**.  For each image, we can apply multiple filters.  If we apply 2 filters, the output features will become 3D like this: $$ 2 * \text{output-width} * \text{output-height} $$.  We can also easily generalize to 
$$ \text{num-filters} * \text{output-width} * \text{output-height} $$

#### Padding

2. **How should we convolve?**. Should we do the entire image?  Should we maintain the output features to be the same size as input features?  Recall this image:

<img src ="figures/no_padding_no_strides.gif" width="150">

It has 4 x 4 pixels = 16 features.  But after convolution, we only got 2 x 2 pixels = 4 features left.  Is that good?  There is no correct answers here but we are quite sure that we lose some information.  In fact, it is always nice to **maintain the output features to be the same size as input features**, but how?  There is no space to convolve since the filter is 2 x 2 and it can only shift right one time.

The answer is **padding**, where we can enlarge the input image by padding the surroundings with zeros.  How much?  Padding until we get the original size or larger size, for example, like this.  The below put **half padding** which result the output features to be the same size as input features.

<img src ="figures/same_padding_no_strides.gif" width="150">

The below put **full padding** which pad to make sure each single pixel is convoluted, which result the output features to be even large

<img src ="figures/full_padding_no_strides.gif" width="150">

Mathematically, it is easiest to understand padding from the 1D input like this:

$$ input = [1, 2, 3, 4, 5] $$

to

$$ input_{padded} = [0, 1, 2, 3, 4, 5, 0] $$

Normally, large size may benefit from more features, but also suffer from lengthy training time.  It is probably best to only perform enough padding to get the same size as input features.

#### Strides

3. **How about the step of convolution** Should we shift 1 step per convolution, or 2 steps, or how many steps.  **In fact, it really depends on how detail you want it to be.  But defining bigger steps reduce the feature size and thus reduce the computation time.**  Bigger step is like human scanning picture more roughly but can reduce the computation time....whether to use it is something to be experimented though. 

In computer vision, we called this step as **stride**.  Example is like this:

**No padding with stride of 1**

<img src ="figures/no_padding_strides.gif" width="150">

**Padding with stride of 1**

<img src ="figures/padding_strides.gif" width="150">

**Padding with stride (odd)**

<img src ="figures/padding_strides_odd.gif" width="150">

#### Max/Average pooling

Talking about **reducing computation time**, another way is to perform a **pooling layer** which simply downsample the image by average a set of pixels, or by taking the maximum value.  If we define a pooling size of 2, this involves mapping each 2 x 2 pixels to one output, like this:

<img src ="figures/pooling.png" width="300">

Nevertheless, pooling has a really big downsides, i.e., it basically lose a lot of information.  Compared to strides, strides simply scan less but maintain the same resolution but pooling simply reduce the resolution of the images....As Geoffrey Hinton said on Reddit AMA in 2014 - **The ppooling operation used in CNN is a big mistake and the fact that it works so well is a disaster**.  In fact, in most recent CNN architectures like ResNets, it uses pooling very minimially or not at all.  In this lecture, we are not going to implement pooling, but we just talk about it for the sake of completeness since very early architectures like AlexNet uses pooling