# CNN
- Brief Recap
- Applications


## Filter

Edge dectection: edged detection and other object detection are performed using filters on a given image.

vertical edge detection filter:

$$
\begin{bmatrix}
1 & 0 & -1 \\
1 & 0 & -1 \\
1 & 0 & -1
\end{bmatrix}
$$

Horizontal edge detection filter:
$$
\begin{bmatrix}
1 & 1 & 1 \\
0 & 0 & 0 \\
-1 & -1 & -1
\end{bmatrix}
$$

Usually, the sum of the elements in the filter is close to 0. This is because we want to preserve the brightness of the image.
For vertical edge detection filter, the row-wise sum is 0. For horizontal edge detection filter, the column-wise sum is 0.

The CNN is to learn filters automatically instead of manual designs.

## Convolutional Layer

**Filter**: 

- 3 x 3 x channels, for color images channels = 3, for grayscale images channels = 1
- The values in the filter are not known, and are learned during training. 

**Stride**:
- 1, 2, 3, etc. how many pixels to move the filter each time
- striding:
  - the original image is $(n, n)$, the filter is $(f, f)$, the stride is $s$
  - after filter and stride convolution, the image size is $(\frac{n-f}{s}+1, \frac{n-f}{s}+1)$
    - in case of non-integer, we can use floor 
    - the image size is then $(\lfloor \frac{n-f}{s} \rfloor + 1, \lfloor \frac{n-f}{s} \rfloor + 1)$

**Padding**: add zeros around the image
- after filtering and striding, the image size will be smaller, so we need padding to keep the image size the same if we dont want to shrink the image
- add zeros around the image
  - the original image is $(n, n)$, the filter is $(f, f)$, the padding is $p$, the stride is $s$
    - padding $p$ means adding $p$ zeros around the image
  - after padding $p$, the image is $(\lfloor \frac{n-f+2p}{s} + 1 \rfloor, \lfloor \frac{n-f+2p}{s} + 1 \rfloor$)
- how much to pad?
  - no padding: (n,n) -> (n-f+1, n-f+1)
  - same convolution: padding so that the output size is same as the input size
    - $n = n-f+2p+1$ -> $p = \frac{f-1}{2}$

**Convolutions over Volume**
- RGB image -> 3D volume: $(n, n, n_c)$
- filter -> 3D volume: $(f, f, n_c)$
- padding $p$, stride $s$
- the output is a 2D image: $(\lfloor \frac{n-f+2p}{s} + 1 \rfloor, \lfloor \frac{n-f+2p}{s} + 1 \rfloor$)
  
**Convolutional Layer**: A convolution layer is a set of filters. Each filter is a 3D tensor. The number of filters is the number of channels in the output image.
- input: $(n, n, n_c)$
- filter size: $(f, f, n_c)$, filter number: $n_f$, padding: $p$, stride: $s$
  - filter output: $(\lfloor \frac{n-f+2p}{s} + 1 \rfloor, \lfloor \frac{n-f+2p}{s} + 1 \rfloor, n_f)$
- activation function: $a(z)$
  - input: the filter output, and a bias term
  - output; same size as the input
- How many parameters?
- How many parameters in one layer?
  - filters: $n_f$ filters
    - filter: $(f, f, n_c)$ -> $f^2n_c$
    - bias: 1
  - total: $n_f(f^2n_c+1)$

## Pooling Layer 
**Pooling**: subsample the pixels will not change the object in the image
- no weights to learn
- max pooling: take the max value in the filter
- average pooling: take the average value in the filter

some notation:
- because pooling usually has no weights, most literature doesn't regard pooling as a layer when counting the number of layers for a CNN


## Typical CNN architecture

image -> convolutional layer -> pooling layer -> convolutional layer -> pooling layer -> flatten() -> fully connected layer -> output

Why convolution?
- sparsity of connections -> way less learnable parameters than fully connected layer 
  - fully connected layer: $n_x \times n_y \times n_c \times n_f$
  - cnn: ($f^2n_c+1) \times n_f$
- parameter sharing 
  - a feature dector that is useful in one part of the image is probably useful in another part of the image


## Backpropagation

How backpropagation is implemented for CNN?


## Applications

**Alpha Go**

CNN design:
- input:
  - the image size is 19 x 19
  - channels = 48: humand designed channels to capture the features of the game
- output: 19 x 19 -> where the next move should be

Pooling is not used in Alpha Go, because pooling is like to remove columns and rows from the image, which is not good for board games. 

**Speech**
Need consider the characteristics of speech when designning the CNN architecture.

Reference:
- https://dl.acm.org/doi/10.1109/TASLP.2014.2339736

**Natural Language Processing**

Reference:
- https://www.aclweb.org/anthology/S15-2079.pdf

## Some criticism of CNN
- CNN is not invariant to scale and rotation -> we need data augmentation
- Spatial Transformer Layer can be used to solve this problem as well

## Awesome Architectures

### VGG

### Residual Net

Residual net can be easy to learn an identity layer in a deep neural network, which can maintain the same performance as a shallower neural network. If got lucky, residual net can learn a better function than the shallower neural network.

### Inception Net or GoogleNet

Typical convolutional network shrink the height and width of the image, but increase the number of channels. 
1x1 convolution can be used to shrink the number of channels, but keep the height and width of the image.

1x1 convolution is a simple convolutional layer with a filter size of 1x1, but add nonlinearity such as ReLu to it.

Say we want to design a layer that can transform (28,28,192) to (28,28,32).
For normal convolutional layer, if the filter size is 5x5, then the number of parameters is 32x5x5x192 = 150K, and the total multiplication is 150Kx28x28 = 1.2B.
Using 1x1 convolution, the number of parameters is 32x1x1x192 = 6K, and the total multiplication is 6Kx28x28 = 4.7M.

Inception module
- input: previous layer: (28,28,192)
- channel concatenate: stack the outputs of the following layers by channels
  - 1x1 conv -> (28,28,64)
  - 1x1 conv: (1,1,192,96) -> 3x3 conv -> (28, 28, 128) 
  - 1x1 conv: (1,1,192,16) -> 5x5 conv -> (28, 28, 32)
  - 3x3 maxpool (padding for same, s=1) -> (28, 28, 192) -> 1x1 conv -> (28, 28, 32)
- output: stacking lead to a layer that outputs (28, 28, 256)

## MobileNet

used for mobile and embedded vision applications due to low computational cost

key idea: normal vs depthwise-separable convolutions
- normal convolution: (f, f, n_c) -> (n_h', n_w', n_c')
- depthwise-separable convolution: (f, f, n_c) -> depthwise convolution: (n_h', n_w', 1) -> pointwise covolution: (1, 1, n_c)
    - input: (n_h, n_w, n_c)
    - depthwise: (f, f, n_c)
      - output: (n_h', n_w', n_c)
      - But we want n_c' instead of n_c.
    - pointwise: (1, 1, n_c, n_c')
      - output: (n_h', n_w', n_c')

say we have an input of (6,6,3), we want to convolve it with a filter of (3,3,3,5) to output a (4,4,5)
The computation cost in terms of multiplications:
- normal convolution: # filter params * # filter operations * # of filters
  - multiplications: (3x3x3) x (4x4) * 5 = 2160
  - learnable parameters: (3x3x3)*5 = 135
- depthwise convolution
  - depthwise filter: (3,3,3)
    - output: (4,4,3)
    - multiplications: (3x3) * (4x4) * 3 = 432
    - learnable parameters: (3x3)*3 = 27
  - pointwise filter: (1,1,3,5)
    - output: (4,4,5)
    - multiplications: (1x1x3) * (4x4) * 5 = 240
    - learnable parameters: (1x1x3) * 5 = 15
  - in total:
    - multiplications: 432 + 240 = 672  
    - learnable parameters: 27 + 15 = 42
- comparison:
  - depthwise operations is faster, only 672/2160 = 0.3 of normal convolution operations
  - smaller set of parameters: 42/135

### MobileNet v1

architecture:
- input 
- module
  - depthwise convolution
  - pointwise convolution
- output


### Mobilenet v2
two changes:
- add a residual connection between input and output
- add an expansion filter
  - expansian filter
  - depthwise filter
  - projection filter (i.e., pointwise filter)

## EfficientNet

scale down or up a specific deep network for a particular device.

three operations can scalue up/down of neural networks
- resolution: resolution of input images
- depth: depth of networks
- width: make layers wider

## Practical Advice

1. using open-source implementation
2. transfer learning
3. data augumentation
   - mirroring
   - shape/scale invariant
     - random cropping
     - rotation
     - shearing
     - local warping
   - color shifting -> e.g., (R,G,B) + (20,-20,10) -> color invariant 
     - PCA color augumentation
4. implementing distortions during training
   - large images in hard disk
   - multiple threads
     - load image
     - perform distortion
     - formulate a mini-batch
   - another CPU thread or GPU
     - training
5. tips for doing well on benchmarks/winning competitions
   - ensembling
     - train several (3-15) networks independently and average their outputs -> bagging, typically ~2% boosting
       - rarely used in production
   - multi-crop at test image
     - run classifer on multiple version of test images and average results
     - e.g., 10-crop
     - slows down real-time inference