# Building Blocks of a CNN
## Convolution layer
- One of the challenges of computer vision problems is that the input image can get really big, Thus in standard fully connected NN, the weight matrix has very large dimension. With so many parameters, it's difficult to get enough data to prevent a NN from overfitting. And also, the computational requirements and the memory requirements to train a NN is a bit infeasible. Thus, for computer vison to use large images, we need to implement the convolution operation.
- In general, if a  $n\times n$ matrix convolved with $f \times f$ filter/kernel gives us $(n-f+1) \times (n-f+1)$ matrix 
- In python, use function conv-forward; In tensorflow use tf.nn.conv2d; in keras use Conv2d function

### Padding
- We want to apply convolution operation multiple times, but if the image shrinks we will lose a lot of data during this process. Also, the edges pixels are used less than other pixels in an image.
- To solve these problems, we can pad the input image before convolution by adding some rows and columns to it. We will call the padding amount $p$ the number of row/columns that we will insert in top, bottom, left and right of the image. 
- The general rule now, if a matrix $n\times n $ is convolved with $f\times f $ filter/kernel and padding $p$ gives us $n+2p-f+1 \times n+2p-f+1$ matrix. 
- In computer vision f is usually odd. Some of the reasons is that it will have a center value. 

### Stride Convolution
- When we are making the convolution operation we used $S$ to tell us the number of pixels we will jump when are convolving filter/kernel. 
- General rule: If a matrix $n\times n $ is convolved with $f \times f $ filter/kernel and padding $p$ and stride $s$ it gives us $(\frac{n+2p-f}{s} +1) \times(\frac{n+2p-f}{s} +1) $ matrix
- In case $\frac{n+2p-f}{s} +1$ is fraction we can take floor of this value.
- In math textbooks the conv operation is flipping the filter before using it. What we were doing here is called cross- correlation operation, but the state of art of deep learning is using this as conv operation.

### Convolutions over volumes
- A image of height , width, # of channels is convoluting with  a filter of height, width, and same # of channels. The image number channels and the filter number of channels are the same. The output is here is only 2D. 
- Different layers/channels in the filter can either be the same matirx or different.  
- You can use several  volume filters in one step to detect different features, each of them is going to give a 2D matrix, and then stack the results together to form a 3D matrix. Thus, if a $n\times n \times n_c$ matrix convolves with $ n_c'$  filters with shape $f \times f \times n_c$, then we got a $(n-f) \times (n-f) \times n_c'$ matrix

## Pooling layers
- CNN often uses pooling layers to reduce the size of the inputs, speed up computation, and to make some of the feature dectors more invariant to its positon in the iinput.  
- The two types of pooling layers are: max pooling and average pooing. Max pooling is used much more often than average pooling. 
- The max pooling is saying, if the feature is detected anywhere in this filter then keep the highest number. But the main reason why people are using pooling is because it works well in practice and reduce computations, and nobody know exactly why pooling works.
- Max pooling: slides an (f,f) window over the input and store the max value of the window in the output. Pooling is done on each layer/ chanel independently. 
- Just  like convolution, f and s are hyperparameters of pooling layer. Padding is rarely used. Most often, s=2, f=2, which is going to shrink the matrix size be a factor around 2. 

- Poolign layer has no parameters for backprop to train 

# One layer of convolutional network
- We first convolve input with some filters and then add a bias to each filter, and then get RELU activation of the result. The filter plays the similar role as the weight matrix.
- Summar of notation:
    If layer l is a convolution layer:
        - f[l] = filter size in layer l
        - p[l] = padding 
        - s[l] = stride 
        - n_c[l] = number of filters
        - Each filter is f[l] * f[l]* n[l-1]_c
        
        - Input: n^[l-1]_H * n^[l-1]_w * n^[l-1]_c
        - Output: n^[l]_H * n^[l]_w * n^[l]_c$
        - n^[l] _H = {n[l-1] + 2p[l] -f[l]}/s[l] +1


# Why convolutions
- Two main advantages of Conv are:
    - Parameter sharing
        - A feature detector that's useful in one part of the image is probably useful in another part of the image
    - Sparsity of connections
        - In each layer, each ouput value depends only on a small number of inputs which makes it translation
invariance.
    Through these two mechanisms, a neural netowrk has a lot fewer parameters which allows it to be trained with smaller training cells and is less prone to be overfitting. 

# Deep convolutional models: case studies
- Here are some classical CNN networks:
    - LeNet-5
    - AlexNet
    - VGG
- The best CNN architecture that won the last ImageNet competition is called ResNet and it has 152 layers. 
- There are also an architecture called Inception that was made by Google that are very useful to learn and apply to your tasks.
  <img src="../Images/LeNet5.png" />
### Lenet-5
- Invented by Young Lecun in 1998.
- Some statistics about the example: 
    -n_c: # of filters


|  Layer |                          |Output/Activation Shape |  # of parameters| 
|------  |-------                   |-------| 
| Input |                           |(32,32,3) | |
|1|CONV1(f1=5,s1=1, p1=0,n_c = 6)| (28,28,6)|156|
|  |MaxPooling(f1p =2, s1p=2)       | (14,14,6)|0|
| 2 |CONV2(f2=5,s2=1,p2=0, n_c = 16)|  (10,10,16)|416|
|    |MaxPooling(f2p = 2, s2p =2) | (5,5,16)|0|
 | 3| FC3(number of neurons 120)  |(120,1) |48001|
 | 4 | FC4(number of neurons 84)   | (84,1))|10081|
 |5| Softmax| (?,1)|?| |
### Alexnet
- The goal for the model was the ImageNet challenge which classfies images into 1000 classes. 
- Summary:
    ``` Conv => Max-pool => Conv => Max-pool => Conv => Conv => Conv => Max-pool => Flatten => FC=> FC => Softmax
    ```
- The paper convinced the computer science researchers that deep learning is so important. 
###VGG-16
- A modificatio of AlexNet
- Focus on having only these blocks:
    - CONV = 3x3 filters, s=1,same padding
    - MAX-POOL = 2x2, s=2
- Pooling was the only one who is responsible for shrinking the dimensions.
## Residual Network(ResNets)
- The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction, from edges(at the lower level) to very complex features(at the deeper layers). However, using a deeper network doen't always help. A huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent unbearly slow.More specifically, during gradient descent, as you backprop from the final layer back to the first layer, you are multiplying by the weight matrix on each step, and thus the gradient can decrease exponentially quickly to zero(or, in rare cases,grow exponentially quickly and explode to take very large values).
- This problem can be solved by using a ResNet.In ResNets, a "shortcut" or a "skip connection" allow the gradient to be directly backpropogated to earlier layers. 
- Very, very deep NNs are difficult to train because of vanishing and exploding gradients problems. 
- In ResNets, we skip connection which makes you take the activation from one layer and suddently feed it to another layer even much deeper in NN which allows you to train deeper NNs even with layers greater than 100. 
- Theoratically, deeper and deeper NN should lead to smaller and smaller training error, but in practice, because of the vanishing and exploding gradients problems that performance of the network suffers as it goes deeper. While ResNets allows for deeper NN while not hurting the performance. 
- Two main types of locks are used in a ResNet, depending mainly on whether the input/output dimensions are same or different.
    - The identity block: Applies to the case where the input activation (say $a^{[l]}$) has the same dimension as the output activation(say $a^{[l+2]}$)
    - The convolutional block: applies to the case where the input and output dimensions don't match up. The different with the identity block is that there is a CONV2D layer in the shortcut path.  The CONV2D layer is used to resize the input x to a different dimension, so  that the dimensions match up in the final addition needed to add teh shortcut value back to the main path. 
        
### Why ResNets work?
- Identity function is easy for a residul block to learn, which means that adding these two layers/residual block in your NN, it doesn't really hurt yor neural network's ability to do as well as the simpler network without these two layers. Also, if all these residual blocks actually learned something useful then maybe you do even better. 
- Also, what goes wrong in very  deep palin nets without these residual blocks is that when you make the network deeper and deeper, it's actually very difficutlt for it to choose parameters that learn even the identity function which is why a lot of layers end up making your result worse. 
- The main reason that residul network works is that it's so easy for these extra layers to learn the identity function that you're kind of guaranteed that it doesn't hurt performance and then a lot of the time you maybe be lucky and then even helps performance.

## 1 X 1 convolutions/ Network in Network
- The idea of one by one convolution has been very influential.For example:
    - To shrink the number of channels and therefore save on computations, this is also called feature transformation, while pooing layer only shring $n_w, n_h$
    - If you want to keep the number of channels, that's fine too. The effect of the 1X1 convolution is just adds non-linearity. 
    

## Inception network
- When designing a CNN, you have to deide all the layer, such as will you pick a 3x3 Conv or 5x5 Conv or maybe a maxplooling layer. You have so many choices.
- What inception tells us is, why not use all of them at once? Do the pooing, convs and then stack them together to form a volume output. 
- But the computation is costy in Inception model, but we can use a 1x1 convolution to reduce the computation. The 1x1 Conv here is called BottleNeck. 
- It turns out that 1X1 convolution won't hurt the performance. 
- Inception module

- Inception network is the inception module repeatd several times. 

##  Transfer Learning
- If you are using a specific NN architecture that has been trained before, you can use this pretrained parameters instead of random initialization to solve your problems. 
- Frameworks have options to make the parameters frozen in some layers using ```trainable = 0 ``` or ```freeze =0```

## Data Augmentation
- The more data you have, the better your deep NN's performance. Data augmentation is one of the techniques that deep learning uses to improve the performance of deep NN. 
- Some data augumentatio methods that are used for CV tasks includes:
    - Mirroring
    - Random cropping
        - The issue with this technique is that you might take a wrong crop, and the solution is to make your crop big enough .
    - Color shifting
        - For example, we add tO R,G and B channels different distortions that will make the image identified as the same for the human but is different for the computer. 
        - In practice, the added value are pulled from some probability distribution and these shifts are quite small. 
        - Makes your algorithm more robust in changing colors in image. 
        - There are an algorithm which is called PCA color augmentation that decides the shifts needed automatically, and it is given in AlexNet paper. 
    - Rotation
    - Shearing
    - Local warping

## State of Computer Vision
- Speech recognition problems for example has a big amount of data, while image recognition has a medium amount of data and the object detection has a small amount of data nowadays.
- If your problem has a large amount of data, researchers are tend to use:
    - Simpler algorithms.
    - Less hand engineering
- If you don't have that much data people tend to try more hand engineering for the problem "Hacks". Like choosing a more complex NN architecture
- Learning algorithms has two sources of knowledge:
        - (x,y)labels 
        - hand engineered features/network architectures/other components. 
- Tips for doing well on benchmarks/winning competitions:
    - Ensembling
        - Train several networks independently and average their outputs. 

        - This can give you a push by 2%
        - But this will slow down your production by the number of the ensembles. Also it takes more memory as it saves all the models in the memory.
        - People use this in competitions but few uses this in a real production.
    - Multi crop at test time
        - Do data augumentation on test data as well 
        - Run classifier on multiple versions of test versions and average results.
    - Use open source code
         - Use architectures of networks published in the literature.
         - Use open source implementations if possible.
          - Use pretrained models and fine-tune on your dataset.


## Keras tutorial:
- Define a function to describe your model, remember to creast a model instance in your function, you'll use this instance to train/test the model
- To train and test an model in Keras, there are four steps: Create-> Compile -> Train -> Test
 
    - 1. Create the model by calling the function defined.
    - 2. Compile by calling model.compile(optimizer = "", loss = "", metrics= "")
    - 3. Train the model on train data by calling model.fit(x= ..,y=.., epochs = ..., batch_size= ...)
    - 4. Test the model on test data by calling model.evaluate(X=.., Y=...)
- Two other basic features of Keras that are useful:
     - model.summary(): print the details of your layers in a table with the size of its inputs/outputs， and number of parameters at each layer.
     - plot_model(): plots your graph in a nice layout. You can even save it as ".png" using SVG() 
- To choose the Keras backend you should go to ```$HOME/.keras/keras.json ``` and change the file to the desired backend like Theano or Tensorflow or whatever backend you want. 
- After you create the model you can run it in a tensorflow session without compling, training and testing capabilities. 
- You can save your model with ```model_save``` and load your model using ``` load_model```. This will save your whole trained model to disk with the trained weights. 

# Object detection
## Object Localization

- What are localization and detection?
    - Image Classification:
        - Classify an image to a specific class. The whole image represents one class. We don't want to know exactly where are the object. Usually only one object is presented. 
    - Classification with localization
        - Given an image we want to learn the class of the image and where are the class location in the image. We need to detect a class and a rectangle of where that object is. Usually one object is presented.
    - Object detection
         - Given an image we want to detect all the object in the image that belong to a specific classes and give their location. An image can contain more than one object with different classses. 
    - Semantic Segmentation
         - We want to Label each pixel in the image with a category label. Semantic Segmantation don't differentiate instances, only care about pixels. It detects no objects just pixels. 
         - If there are two objects of the same class is intersected, we won't be able to separate them .
- To make classificaton with localization we use a Conv Net with a softmax attached to the end of it and a four numbers ```bx, by,bh and bw``` to tell you the location of the class in the image. 
- Target label Y in classfication with localization problems:
    ```
    Y = [  Pc                      # probability of an obejct is presented
           bx                      # x- coordinate of the center of the Bounding box
           by
           bh
           bw
           c1                      # The classes
           c2
           ...
        ]
        ```
- In practice, we use log likely hood loss for classes and square error for the bounding box.

# Landmark Detection
- In some of the computer vision problems we will need to output some points. This is called landmark detection.
- For example, if you are working in a face recognition problem you might want some points on the face like corners of the eys, corners of the mouth, and corners of the nose and so on. This can help in a lot of application like detecting the pose the person. 

# Obejct Detection
- We will use a Conv net to solve the object detection problem using a technique called the sliding windows detection algorithm. 
- For example, we are working on Car detection algorithm.
- The first thing, we will train a Conv net on label training set with closely cropped car images(meaning x is pretty much only the car) and non car images
- After we finish training of this Conv net we will then use it with the sliding window technique
- Sliding window detection algorithm
        
     - Decide a rectangle size
     - Split your image into rectangles of the size you picked. Each region should be covered. You can use some strides as well.
     - For each rectangle feed the image into the Conv net and decide if its a car or not
     - Pick larger/ smaller rectangles and repeat the process from 2 to 3. The hope is that so long as there's a car somewhere in the image, there will be a window to localize the car.
     - Store the rectangles that contains the cars
     - If two or more rectangles intersects choose the rectangle with the best accurary.
- Disadvantages of sliding window is the computation time
- In the era of machine learning before deep learning, people used a hand crafted linear classifiers that classifies the object and then use the sliding window technique. The linear classier make it a cheap computation. But in the deep learning era that is so computational expensive due to the complexity of the deep learning model.
- The problem of compution cost has an solution, the sliding window object detection can be implemented using convolution.


# Convolutional Implementation of Sliding Windows
## FC layer can be turned into convolution layers. 
- A FC layer can be turned into Conv layer using a filter with the  same width and height as the input.  Mathenatically, this is the same as a fully connected layer. 

## Convolution Implementation of Sliding windows
- By turning into convolution, it can make all the predictions of all windows at the same time by one forward pass through the big Conv net.
- It's more efficient because it now shares the computations. 
- The weakness of the algorithm is that the position of the rectangle won't be too accurate. Maybe none of the rectangles is match perfectly with the position of the object you want to recgonize.

## Bounding box Predictions
- A better algorithm than the sliding window is the YOLO algorithm.
- YOLO stands for you only look at once and was developed in 2015.
- The basic idea is : 
    - Lets say we have an image 
    - Place a nxn grid on the image, in practice, it's a fine gride, say 19X19, to avoid different object were assigned to the same cell.
    - Apply the classfication and localization algorithm to each section of the grid. Each grid cell will have a label with the format ```[Pc, bx,by,bw,bh,c1,c2,c3]``` The yolo algorithm take the midpoint of each the object in the image and assign the object to the grid cell containing the midpoint. 
    - $b_x,b_y, b_w,b_h$ are specified relative to the grid cell. $b_x,b_y$ must be less 1, while $b_w,b_h$ can be greater than 1. 
    
## Techniques that will make yolo work better    
### Intesection over Union
- It is a function used to evaluate the object detection algorithm.
- It computes size of intersection and divide it by the union of output bounding box and ground true boudning box. More generally, IOU is a measur eof the overlap between two bounding boxes. 
- If your IOT >= 0.5, then its good, the best answer will be 1. 
- The higer the IOU the better is the accuracy.

### Non max suppresion
- One of the problems of object detection is that your algorithm rather than detecting an object just once, it might detect it multiple times. Non-max suppresion is a way for you to make sure that your algorithm detects each object only once. 
- Non max means that you're going to output your maximal probabilies classfications but suppres the close_by ones that are non-maximal.
- Algorithm:
    - Get the output of each of the grid box. 
    - Discard all boxes with $P_c < 0.6$
    - While for the remaining boxes
        - Pick the box with the largest Pc output that as a prediction
        - Discard any remaining box with $IOU>0.5$ with that box output in the previous step. 
    - Repeatly doing the previous step until you've taken each of the boxes and either output it as a prediction, or discard it as having too high IOU.
 - If you tried to detect multiple, say three objects, you should independently carry out non-max suppression three times, one of each of the outputs classes.
 
## Anchor boxes
- One of the problems with object detection as we have seen so far is that each of the grid cells can detect only one object, what if a grid cell wants to detect multiple objects?  
- For example, a person standing in front of a car, the center of the car and the person are almost the same position and both of them fall into the same grid cell, the algorithm we saw before will not give two detections.  
- To solve this problem,We can use the idea of anchor boxes. 
- With the idea of anchor boxes, we are going to choose multiple different pre-defined shpaes, and associate multiple predictions with the multiple anchor boxes.  
- Previously, each object in training image is assigned to grid cell that contains that object's midpoint. With anchor boxes, each object in training image is assigned to grid cell (that contains object's midpoint) and anchor box with highest IOU. 
- For example, if we have two anchor box, the target label with be $P_c,P_x,p_y,p_w,p_h, c1,c2,c3 $ associated with anchor box 1, and then another set of output associated withe anchor box 2. 
- anchor box essentialy refer to a set of values$P_c,P_x,p_y,p_w,p_h, c1,c2,c3 $,  your output will have a dimension which is  8 * number of anchor boxes
- You may use k-mean clustering to choose anchor boxes.

# YOLO algorithm
- YOLO900 Better, faster, stronger
- YOLO implenentation can be found here: 
    - https://github.com/allanzelener/YAD2K
    - https://github.com/thtrieu/darkflow
    - https://pjreddie.com/darknet/yolo/

# Regional Proposal
- Is another set of idea in object detection. 


## Special Applications of CNN: face recogization and Neural Style Transfer
### Face recognization
- Face verification vs. face recognization
    - Verification "Is this the claimed person"
        - Input: Image and name/ID
        - Output: whether the input image is that of the claimed person
    - Recogization "Who is this person"
        - Has a database of K persons
        - Get an input image
        - Output ID if the image is any of the K persons
- We can use a face verification system as block to build a face recognization system.

###  One shot learning
- One of the face recognition challenges is to solve one shot learning problem, which means to recognize a person using one image.
- Historically, deep learning doesn't work well with a small number of data
- Instead, we will learn a similarity function:
    - d(img1, img2) = degree of difference between images
    - We want d to be low in case of the same faces
    - We use tau T as a threshold for d: 
        - If d(img1, img2) < = T then the faces are the same
- Similarity function helps us solving the one shot learning, also its robust to new inputs.

### Siamese Network
- The simililarity function can be implemented using a type of CNN called Siamese Network.
- Siamese Network output an feature vector as an encoding of  the input image.
- ``` d(img1, img2) = || feature_vector(img1) - feature_vector(img2)||^2```

### Triplet Loss
-  Is one of the loss function we can use to learn the parameters in Siamese Network
- Our learning objective in the triplet loss function to  get the distance between and an Anchor image and a positive or a negative image.
    - Positive means the same person, while negative means different person
- Formally, we want
    - Positive distance to be less than the negative distance
    - To make sure that the NN won't get an output  zero/ set the encodings equal to each other, we let
        ```|| f(A) - f(p)||^2 - ||f(A) - f(N)|| ^2 + alpha < = 0```
- Final loss funcion:
    - L(A,P,N) = max(|| f(A) - f(p)||^2 - ||f(A) - f(N)|| ^2 + alpha, 0): So long as you achived your goal , which is the first term is < 0, the lost is 0.
    - J = sum(L(A(i),P(i),N(i))) for all triplets of images
- During training, if A, P, N are chosen randomly, $d(A,P) + \alpha < d(A,N)$ is easily satisfied.
- Choose triplets that's hard to train on. 
- Available implementations for face recognition using deep learning includes:
    - Openface
    - Facenet
    - Deepface
- Another way to learn the parameters is pose the face recognization as a binary classification problem. Take a pair of image, and feed to the Siaseme network and output 1/0.

### Neural Style Transfer
- To visualize what a deep learning layer is learning, pick a unit in layer l. Find the nine image patches that maximize the unit's activation. 
- It turns out the shallow layers are learning low level features like edge or colors, while deeper layers are learning more complex feature. 
#### Cost function
- Given a content image C, a style image S and a generated image G:
    - J(G) = alpha * J(C,G) + beta J(S,G)
    - $\alpha, \beta$ are relative weighting to the similarity and these are hyperparameters.
    - In practice,only one hyperparameter is needed.
- Find the generated image G:
    - Initiate G randomly:
    - Use a gradient descent to minimize J(G), G = G - dG
    
#### Content cost function
- Say you use hidden layer l to computer content test:
    - If we choose l to be really small, then it will really force your generated image to pixel values very similar to your cotent image. 
    - In practice l is not too small and not too deep but in the middle. 
- Use pre-trained ConvNet(E.g. VGG network)
- Let a(C)[l] and a(G)[l] be the activation of layer l on the images
- If a(C)[l] and a(G)[l] are similar then they will have the same content.
    - J(C,G) at layer l = 1/2|| a(C)[l] - a(G)(l)||^2
    
#### Style Cost function
- Say you are using layer l's activation to measure Style
- Define style as correlation between activations across channels. 
- Correlation tells you which of these high level features/ texture compoenents detected by different channels  tend to occur or not to occure gogether.
- More formally, given an image, calculate the Style matrix
  - Let a(i,j,k)[l] = activation at (i,j,k). 
  - The style matrix G[l] has dimension n^{[l]}{_c} x n^{[l]}{_c} $. 
    
    - ```G_{kk'}^{[l]} = \sum _ {i=1}^{n_H^{[l]}} a_{ijk}^{[l]}a_{ijk'}^{[l]}```
- More specifically, $G^{[l]}_{kk'}$ measure how correlated are the activations in channel k compared to the activations in channel $k'$ 
- G_{kk'} is big, means more correlation
- It turns out that you get more visual pleasing results if you use the sytle cost function from multiple layers. So the overall style cost function 
- Steps to be made if you want to create a tensorflow model for neural style transfer:
    - Create an Interactive Session
    - Load the content image
    - Load the style image
    - Randomly initialize the image to be generated
    - Load the VGG16 model
    - Build the Tensorflow graph:
        - Run the content image through the VGG16 model and compute the content cost
        - Run the style image through the VGG16 model and compute the style cost
        - Compute the total cost
        - Define the optimizer and the learning rate
- Initialize the TensorFlow graph and run it for a large number of iterations, updating the generated image at every step.


    
