# Building Blocks of a CNN
## Convolution layer
- One of the challenges of computer vision problems is that the input image can get really big, Thus in standard fully connected NN, the weight matrix has very large dimension. With so many parameters, it's difficult to get enough data to prevent a NN from overfitting. And also, the computational requirements and the memory requirements to train a NN is a bit infeasible. Thus, for computer vison to use large images, we need to implement the convolution operation.
- In general, if a  $n\times n$ matrix convolved with $f \times f$ filter/kernel gives us $(n-f+1) \times (n-f+1)$ matrix 
- In python, use function conv-forward; In tensorflow use tf.nn.conv2d; in keras use Conv2d function

### Padding
- We want to apply convolution operation multiple times, but if the image shrinks we will lose a lot of data during this process. Also, the edges pixels are used less than other pixels in an image.
- To solve these problems, we can pad the input image before convolution by adding some rows and columns to it. We will call the padding amount $p$ the number of row/columns that we will insert in top, bottom, left and right of the image. 
- The general rule now, if a matrix $n\times n $ is convolved with $f\times f $ filter/kernel and padding $p$ gives us $n+2p-f+1 \times n+2p-f+1$ matrix. 
- In computer vision f is usually odd. Some of the reasons is that it will have a center value. 

### Stride Convolution
- When we are making the convolution operation we used $S$ to tell us the number of pixels we will jump when are convolving filter/kernel. 
- General rule: If a matrix $n\times n $ is convolved with $f \times f $ filter/kernel and padding $p$ and stride $s$ it gives us $(\frac{n+2p-f}{s} +1) \times(\frac{n+2p-f}{s} +1) $ matrix
- In case $\frac{n+2p-f}{s} +1$ is fraction we can take floor of this value.
- In math textbooks the conv operation is flipping the filter before using it. What we were doing here is called cross- correlation operation, but the state of art of deep learning is using this as conv operation.

### Convolutions over volumes
- A image of height , width, # of channels is convoluting with  a filter of height, width, and same # of channels. The image number channels and the filter number of channels are the same. The output is here is only 2D. 
- Different layers/channels in the filter can either be the same matirx or different.  
- You can use several  volume filters in one step to detect different features, each of them is going to give a 2D matrix, and then stack the results together to form a 3D matrix. Thus, if a $n\times n \times n_c$ matrix convolves with $ n_c'$  filters with shape $f \times f \times n_c$, then we got a $(n-f) \times (n-f) \times n_c'$ matrix

## Pooling layers
- CNN often uses pooling layers to reduce the size of the inputs, speed up computation, and to make some of the feature dectors more invariant to its positon in the iinput.  
- The two types of pooling layers are: max pooling and average pooing. Max pooling is used much more often than average pooling. 
- The max pooling is saying, if the feature is detected anywhere in this filter then keep the highest number. But the main reason why people are using pooling is because it works well in practice and reduce computations, and nobody know exactly why pooling works.
- Max pooling: slides an (f,f) window over the input and store the max value of the window in the output. Pooling is done on each layer/ chanel independently. 
- Just  like convolution, f and s are hyperparameters of pooling layer. Padding is rarely used. Most often, s=2, f=2, which is going to shrink the matrix size be a factor around 2. 

- Poolign layer has no parameters for backprop to train 

# One layer of convolutional network
- We first convolve input with some filters and then add a bias to each filter, and then get RELU activation of the result. The filter plays the similar role as the weight matrix.
- Summar of notation:
    If layer l is a convolution layer:
        - f[l] = filter size in layer l
        - p[l] = padding 
        - s[l] = stride 
        - n_c[l] = number of filters
        - Each filter is f[l] * f[l]* n[l-1]_c
        
        - Input: n^[l-1]_H * n^[l-1]_w * n^[l-1]_c
        - Output: n^[l]_H * n^[l]_w * n^[l]_c$
        - n^[l] _H = {n[l-1] + 2p[l] -f[l]}/s[l] +1


# Why convolutions
- Two main advantages of Conv are:
    - Parameter sharing
        - A feature detector that's useful in one part of the image is probably useful in another part of the image
    - Sparsity of connections
        - In each layer, each ouput value depends only on a small number of inputs which makes it translation
invariance.
    Through these two mechanisms, a neural netowrk has a lot fewer parameters which allows it to be trained with smaller training cells and is less prone to be overfitting. 

# Deep convolutional models: case studies
- Here are some classical CNN networks:
    - LeNet-5
    - AlexNet
    - VGG
- The best CNN architecture that won the last ImageNet competition is called ResNet and it has 152 layers. 
- There are also an architecture called Inception that was made by Google that are very useful to learn and apply to your tasks.
  <img src="../Images/LeNet5.png" />
### Lenet-5
- Invented by Young Lecun in 1998.
- Some statistics about the example: 
    -n_c: # of filters


|  Layer |                          |Output/Activation Shape |  # of parameters| 
|------  |-------                   |-------| 
| Input |                           |(32,32,3) | |
|1|CONV1(f1=5,s1=1, p1=0,n_c = 6)| (28,28,6)|156|
|  |MaxPooling(f1p =2, s1p=2)       | (14,14,6)|0|
| 2 |CONV2(f2=5,s2=1,p2=0, n_c = 16)|  (10,10,16)|416|
|    |MaxPooling(f2p = 2, s2p =2) | (5,5,16)|0|
 | 3| FC3(number of neurons 120)  |(120,1) |48001|
 | 4 | FC4(number of neurons 84)   | (84,1))|10081|
 |5| Softmax| (?,1)|?| |
### Alexnet
- The goal for the model was the ImageNet challenge which classfies images into 1000 classes. 
- Summary:
    ``` Conv => Max-pool => Conv => Max-pool => Conv => Conv => Conv => Max-pool => Flatten => FC=> FC => Softmax
    ```
- The paper convinced the computer science researchers that deep learning is so important. 
###VGG-16
- A modificatio of AlexNet
- Focus on having only these blocks:
    - CONV = 3x3 filters, s=1,same padding
    - MAX-POOL = 2x2, s=2
- Pooling was the only one who is responsible for shrinking the dimensions.
## Residual Network(ResNets)
- Very, very deep NNs are difficult to train because of vanishing and exploding gradients problems. 
- In ResNets, we skip connection which makes you take the activation from one layer and suddently feed it to another layer even much deeper in NN which allows you to train deeper NNs even with layers greater than 100. 
- Theoratically, deeper and deeper NN should lead to smaller and smaller training error, but in practice, because of the vanishing and exploding gradients problems that performance of the network suffers as it goes deeper. While ResNets allows for deeper NN while not hurting the performance. 
### Why ResNets work?
- Identity function is easy for a residul block to learn, which means that adding these two layers/residual block in your NN, it doesn't really hurt yor neural network's ability to do as well as the simpler network without these two layers. Also, if all these residual blocks actually learned something useful then maybe you do even better. 
- Also, what goes wrong in very  deep palin nets without these residual blocks is that when you make the network deeper and deeper, it's actually very difficutlt for it to choose parameters that learn even the identity function which is why a lot of layers end up making your result worse. 
- The main reason that residul network works is that it's so easy for these extra layers to learn the identity function that you're kind of guaranteed that it doesn't hurt performance and then a lot of the time you maybe be lucky and then even helps performance.

## 1 X 1 convolutions/ Network in Network
- The idea of one by one convolution has been very influential.For example:
    - To shrink the number of channels and therefore save on computations, this is also called feature transformation, while pooing layer only shring $n_w, n_h$
    - If you want to keep the number of channels, that's fine too. The effect of the 1X1 convolution is just adds non-linearity. 
    

## Inception network
- When designing a CNN, you have to deide all the layer, such as will you pick a 3x3 Conv or 5x5 Conv or maybe a maxplooling layer. You have so many choices.
- What inception tells us is, why not use all of them at once? Do the pooing, convs and then stack them together to form a volume output. 
- But the computation is costy in Inception model, but we can use a 1x1 convolution to reduce the computation. The 1x1 Conv here is called BottleNeck. 
- It turns out that 1X1 convolution won't hurt the performance. 
- Inception module

- Inception network is the inception module repeatd several times. 

##  Transfer Learning
- If you are using a specific NN architecture that has been trained before, you can use this pretrained parameters instead of random initialization to solve your problems. 
- Frameworks have options to make the parameters frozen in some layers using ```trainable = 0 ``` or ```freeze =0```

## Data Augmentation
- The more data you have, the better your deep NN's performance. Data augmentation is one of the techniques that deep learning uses to improve the performance of deep NN. 
- Some data augumentatio methods that are used for CV tasks includes:
    - Mirroring
    - Random cropping
        - The issue with this technique is that you might take a wrong crop, and the solution is to make your crop big enough .
    - Color shifting
        - For example, we add tO R,G and B channels different distortions that will make the image identified as the same for the human but is different for the computer. 
        - In practice, the added value are pulled from some probability distribution and these shifts are quite small. 
        - Makes your algorithm more robust in changing colors in image. 
        - There are an algorithm which is called PCA color augmentation that decides the shifts needed automatically, and it is given in AlexNet paper. 
    - Rotation
    - Shearing
    - Local warping

## State of Computer Vision
- Speech recognition problems for example has a big amount of data, while image recognition has a medium amount of data and the object detection has a small amount of data nowadays.
- If your problem has a large amount of data, researchers are tend to use:
    - Simpler algorithms.
    - Less hand engineering
- If you don't have that much data people tend to try more hand engineering for the problem "Hacks". Like choosing a more complex NN architecture
- Learning algorithms has two sources of knowledge:
        - (x,y)labels 
        - hand engineered features/network architectures/other components. 
- Tips for doing well on benchmarks/winning competitions:
    - Ensembling
        - Train several networks independently and average their outputs. 

        - This can give you a push by 2%
        - But this will slow down your production by the number of the ensembles. Also it takes more memory as it saves all the models in the memory.
        - People use this in competitions but few uses this in a real production.
    - Multi crop at test time
        - Do data augumentation on test data as well 
        - Run classifier on multiple versions of test versions and average results.
    - Use open source code
         - Use architectures of networks published in the literature.
         - Use open source implementations if possible.
          - Use pretrained models and fine-tune on your dataset.
