#### <h1><center>CMSC 478: Machine Learning</center></h1>

<center><img src="img/title.jpg" align="center"/></center>


<h3 style="color:blue;"><center>Instructor: Fereydoon Vafaei</center></h3>


<h5 style="color:purple;"><center>Convolutional Neural Networks CNN</center></h5>

<center><img src="img/UMBC_logo.png" align="center"/></center>

<h1><center>Agenda</center></h1>

- <b>Convolutional Neural Networks</b>
    - Computer Vision Applications
    - Convolution Operation
    - Pooling Operation
    - Padding and Strides
    - CNN Architectures
        - LeNet-5
        - AlexNet
            - Data Augmentation
        - GoogleNet
            - Inception Module
        - VGGNet
        - ResNet
            - Residual Learning
        - Xception
        - SENet
- <b>Further Applications</b>
    - Object Detection (IoU & Non-Max Suppression)
    - Face Recognition

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

<h1><center>Computer Vision</center></h1>

- Computer Vision is one of the areas that has been advancing rapidly thanks to Deep Learning and specifically **CNN**s.


- CV in self-driving cars includes different ML tasks such as: image classification, object detection, image segmentation, etc. 

<center><img src="img/computer-vision.gif" align="center"/></center>

<center><img src="img/cv-2.png" align="center"/></center>

<font size='1'>Image from Ref[4]</font>

<h1><center>Image Classification vs Object Detection</center></h1>

- **Image classification** takes an image as an input and outputs the classification label of that image with some metric (probability score, confidence, etc). 


- **Object detection** is the process of finding instances of objects in images. The task of classifying and localizing multiple objects in an image is called **object detection**.

<center><img src="img/object-localization.png" align="center"/></center>
<font size=1>Image from: https://www.kaggle.com/getting-started/169984</font>

<h1><center>Object Detection Examples</center></h1>

[Tensorflow](https://www.tensorflow.org/lite/models/object_detection/overview)

[YOLO](https://github.com/pjreddie/darknet)

<h1><center>Neural Style Transfer</center></h1>

[Tensorflow Tutorial](https://www.tensorflow.org/tutorials/generative/style_transfer)

<center><img src="img/neural-style-transfer.jpeg" align="center"/></center>

<font size='1'>Image from Ref[5]</font>

<h1><center>Image and Video Colorization</center></h1>

[Image Colorization API from DeepAI](https://deepai.org/machine-learning-model/colorizer)


[DeOldify - A library in GitHub](https://github.com/jantic/DeOldify)


[Coloring Movie Psycho (1960)](https://www.youtube.com/watch?v=l3UXXid04Ys&feature=emb_logo)

<h1><center>Motivation - Large Images & Too Many Parameters</center></h1>

<center><img src="img/large-images.png" align="center"/></center>

<font size='1'>Image from Ref[3]</font>

<h1><center>Motivation - Convolution Operation</center></h1>

- Convolutional networks have been tremendously successful in CV practical applications. The name **convolutional neural network** indicates that the network employs a mathematical operation called **convolution**.


- **Convolution** is a specialized kind of linear operation.


- Convolutional networks are simply neural networks that use **convolution** in place of general matrix multiplication in at least one of their layers.

<h1><center>Motivation - Convolution Operation</center></h1>

- **Convolution** leverages three important ideas that can help improve a machine learning system:
    - Sparse interactions

    - Parameter sharing

    - Equivariant representations


- Moreover, **convolution** provides a means for working with inputs of variable size.

<h1><center>Edge Detection Example</center></h1>

<center><img src="img/edge-detection.png" align="center"/></center>

<font size='1'>Image from Ref[3]</font>

<h1><center>Convolution Operation - Vertical Edge Detection</center></h1>

<center><img src="img/conv-op-1.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Convolution Operation - Vertical Edge Detection</center></h1>

<center><img src="img/conv-op-2.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Convolution Operation - Vertical Edge Detection</center></h1>

<center><img src="img/conv-op-3.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Convolution Operation - Vertical Edge Detection</center></h1>

<center><img src="img/conv-op-4.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Convolution Operation Example</center></h1>

<center><img src="img/conv-op-text.png" align="center"/></center>

<font size='1'>Image from Ref[2]</font>

<h1><center>Convolution Operation Animation</center></h1>

<center><img src="img/convolution-operation-1.gif" align="center"/></center>

<font size=1> Image from Ref[13]</font>

<h1><center>Convolution Operation Example-1</center></h1>

<center><img src="img/convolution-operation-2.gif" align="center"/></center>

<font size=1> Image from Ref[13]</font>

<h1><center>Convolution Operation Example-2</center></h1>

<center><img src="img/convolution-operation-3.gif" align="center"/></center>

<font size=1> Image from Ref[15]</font>

<h1><center>Inspiration from Visual Cortex</center></h1>

<center><img src="img/visual-cortex.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Convolution Layers</center></h1>

<center><img src="img/cnn-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Filters and Feature Maps</center></h1>

<center><img src="img/feature-map-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Padding</center></h1>

<center><img src="img/padding.gif" align="center"/></center>

<font size=1> Image from Ref[13]</font>

<h1><center>Zero Padding</center></h1>

<center><img src="img/padding-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>"same" Padding</center></h1>

- Pad so that output size is the same as input size:

    - $n+2p-f+1 = n \implies p = \frac{f-1}{2}$
    
    
- Note: Here stride is assumed to be 1.

<h1><center>Padding "same" vs "valid"</center></h1>

- If set to "same" , the convolutional layer uses zero padding if necessary.

- Then zeros are added as evenly as possible around the inputs, as needed. When strides=1 , the layer’s outputs will have the same spatial dimensions (width and height) as its inputs, hence the name "same".

<center><img src="img/padding-same-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Strided Convolution</center></h1>

<center><img src="img/strided.gif" align="center"/></center>

<font size=1> Image from Ref[13]</font>

<h1><center>Strided Convolution Example - 1</center></h1>

<center><img src="img/strided-1.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Strided Convolution Example - 2</center></h1>

<center><img src="img/strided-2.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Strided Convolution Example - 3</center></h1>

<center><img src="img/strided-3.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Strided Convolution Example - 4</center></h1>

<center><img src="img/strided-4.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Strided Convolution - Computing Output Dimension</center></h1>

$$\frac{n + 2p - f}{s} + 1$$

- where:
    - n: input size
    - f: filter size
    - p: padding
    - s: stride
      

- If the result is not integer, round it down, i.e. use floor() function.

<h1><center>Convolution Summary</center></h1>

<center><img src="img/conv-summary.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Strided Convolution Effect - Dimensionality Reduction</center></h1>

<center><img src="img/striding-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Convolution over Volume</center></h1>

<center><img src="img/conv-3D.png" align="center"/></center>

<font size='1'>Slide from Ref[3]</font>

<h1><center>Convolution Over RGB Channels</center></h1>

<center><img src="img/rgb.gif" align="center"/></center>

<font size=1> Image from Ref[13]</font>

**Equation 14-1: Computing the output of a neuron in a convolutional layer**

$
z_{i,j,k} = b_k + \sum\limits_{u = 0}^{f_h - 1} \, \, \sum\limits_{v = 0}^{f_w - 1} \, \, \sum\limits_{k' = 0}^{f_{n'} - 1} \, \, x_{i', j', k'} \times w_{u, v, k', k}
\quad \text{with }
\begin{cases}
i' = i \times s_h + u \\
j' = j \times s_w + v
\end{cases}
$

<h1><center>Stacking Multiple Feature Maps</center></h1>

- A convolutional layer has multiple filters (you decide how many) and outputs one feature map per filter, so it is more accurately represented in 3D (see Figure 14-6).

- It has one neuron per pixel in each feature map, and all neurons within a given feature map share the same parameters (i.e., the same weights and bias term). Neurons in different feature maps use different parameters. A neuron’s receptive field extends across all the previous layers’ feature maps.

- In short, a convolutional layer simultaneously applies multiple trainable filters to its inputs, making it capable of detecting multiple features anywhere in its inputs.

<center><img src="img/conv-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Shared Parameters in CNN</center></h1>


- The fact that all neurons in a feature map **share the same parameters** dramatically reduces the number of parameters in the model.


- Once the CNN has learned to recognize a pattern in one location, it can recognize it in any other location. 


- In contrast, once a regular DNN has learned to recognize a pattern in one location, it can recognize it only in that particular location.

<h1><center>Sparse Connectivity</center></h1>

- Traditional neural network layers use matrix multiplication by a matrix of parameters with a separate parameter describing the interaction between each input unit and each output unit. This means that every output unit interacts with every input unit. 


- Convolutional networks, however, typically have sparse interactions---also referred to as **sparse connectivity** or **sparse weights**.

<h1><center>Effect of Sparse Connectivity - Viewed from Below</center></h1>

<center><img src="img/sparse-connectivity-1.png" align="center"/></center>

<font size='1'>Image from Ref[2]</font>

<h1><center>Effect of Sparse Connectivity - Viewed from Above</center></h1>

<center><img src="img/sparse-connectivity-2.png" align="center"/></center>

<font size='1'>Image from Ref[2]</font>

<h1><center>Pooling</center></h1>

- The goal of **pooling** is to subsample (i.e., shrink) the input image in order to reduce the computational load, the memory usage, and the number of parameters (thereby limiting the risk of overfitting).

- Just like in convolutional layers, each neuron in a pooling layer is connected to the outputs of a limited number of neurons in the previous layer, located within a small rectangular receptive field. You must define its size, the stride, and the padding type, just like before. However, a pooling neuron has no weights; all it does is aggregate the inputs using an aggregation function such as the max or mean.

- Figure 14-8 shows a max pooling layer, which is the most common type of pooling layer. In this example, we use a 2 × 2 pooling kernel, with a stride of 2 and no padding. Only the max input value in each receptive field makes it to the next layer, while the other inputs are dropped.

- For example, in the lower-left receptive field in Figure 14-8, the input values are 1, 5, 3, 2, so only the max value, 5, is propagated to the next layer. Because of the stride of 2, the output image has half the height and half the width of the input image (rounded down since we use no padding).

<center><img src="img/pooling-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Max Pooling</center></h1>

- Other kernels we’ve discussed so far had trainable weights, but pooling kernels do not: they are just stateless sliding windows.

<center><img src="img/max-pooling.png" align="center"/></center>

<font size=1> Image from Ref[14]</font>

<h1><center>Pooling Effect - Translation Invariance</center></h1>

- Other than reducing computations, memory usage, and the number of parameters, a max pooling layer also introduces some level of invariance to small translations, as shown in Figure 14-9. Here we assume that the bright pixels have a lower value than dark pixels, and we consider three images (A, B, C) going through a max pooling layer with a 2 × 2 kernel and stride 2. Images B and C are the same as image A, but shifted by one and two pixels to the right.

- As you can see, the outputs of the max pooling layer for images A and B are identical. This is what **translation invariance** means.

- For image C, the output is different: it is shifted one pixel to the right (but there is still 75% invariance). By inserting a max pooling layer every few layers in a CNN, it is possible to get some level of translation invariance at a larger scale.

- Moreover, max pooling offers a small amount of rotational invariance and a slight scale invariance. Such invariance (even if it is limited) can be useful in cases where the prediction should not depend on these details, such as in classification tasks.

<center><img src="img/pooling-invariance-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Depthwise Pooling</center></h1>

- Note that max pooling and average pooling can be performed along the depth dimension rather than the spatial dimensions, although this is not as common. This can allow the CNN to learn to be invariant to various features.

- For example, it could learn multiple filters, each detecting a different rotation of the same pattern (such as handwritten digits; see Figure 14-10), and the depthwise max pooling layer would ensure that the output is the same regardless of the rotation.

- The CNN could similarly learn to be invariant to anything else: thickness, brightness, skew, color, and so on.

<center><img src="img/pooling-invariance-2-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>CNN Architecture</center></h1>

- Typical CNN architectures stack a few convolutional layers (each one generally followed by a ReLU layer), then a pooling layer, then another few convolutional layers (+ReLU), then another pooling layer, and so on.

<center><img src="img/cnn-architecture-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>CNN Architecture</center></h1>

- The image gets smaller and smaller as it progresses through the network, but it also typically gets deeper and deeper (i.e., with more feature maps), thanks to the convolutional layers.

- At the top of the stack, a regular feedforward neural network is added, composed of a few fully connected layers (+ReLUs), and the final layer outputs the prediction (e.g., a softmax layer that outputs estimated class probabilities).

<center><img src="img/cnn-mathworks.png" align="center"/></center>

<font size='1'>Image from Ref[15]</font>

<h1><center>CNN in Tensorflow</center></h1>

- In TensorFlow, each input image is typically represented as a 3D tensor of shape [height, width, channels].


- A mini-batch is represented as a 4D tensor of shape [mini-batch size, height, width, channels].


- The weights of a convolutional layer are represented as a 4D tensor of shape [$f_h , f_w , f_n′ , f_n$].


- The bias terms of a convolutional layer are simply represented as a 1D tensor of shape [$f_n$].

In [67]:
# CNN Example: The following code works on a 10-class image classification with input size: 32x32x3
model = keras.models.Sequential()
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), strides=(1, 1), padding='same',
                        input_shape=(32, 32, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu')) # number of filters usually grows
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu')) 
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu')) # FC Layer
model.add(layers.Dense(10, activation='softmax')) # Output layer for 10-class classification

In [68]:
model.summary()

Model: "sequential_16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_42 (Conv2D)           (None, 32, 32, 32)        896       
_________________________________________________________________
max_pooling2d_28 (MaxPooling (None, 16, 16, 32)        0         
_________________________________________________________________
conv2d_43 (Conv2D)           (None, 14, 14, 64)        18496     
_________________________________________________________________
max_pooling2d_29 (MaxPooling (None, 7, 7, 64)          0         
_________________________________________________________________
conv2d_44 (Conv2D)           (None, 5, 5, 64)          36928     
_________________________________________________________________
flatten_13 (Flatten)         (None, 1600)              0         
_________________________________________________________________
dense_28 (Dense)             (None, 64)              

<h1><center>CNN Architectures and ImageNet</center></h1>

- Over the years, variants of this fundamental architecture have been developed, leading to amazing advances in the field.

- A good measure of this progress is the error rate in competitions such as the ILSVRC **ImageNet** challenge.

- In this competition the top-five error rate for image classification fell from over 26% to less than 2.3% in just six years.

- The top-five error rate is the number of test images for which the system’s top five predictions did not include the correct answer.

- The images are large (256 pixels high) and there are 1,000 classes, some of which are really subtle (try distinguishing 120 dog breeds).

- Looking at the evolution of the winning entries is a good way to understand how CNNs work.

- Keras has implementations of multiple CNN architectures: https://keras.io/api/applications/

<h1><center>Using CNN Architectures</center></h1>


- In general, you won’t have to implement standard models like GoogLeNet or ResNet manually, since pretrained networks are readily available with a single line of code in the [`keras.applications`](https://keras.io/api/applications/) package.


- For example, you can load the ResNet-50 model, pretrained on ImageNet, with the following line of code:

In [3]:
model = keras.applications.resnet50.ResNet50(weights="imagenet")

In [5]:
'''- The global average pooling layer computes the mean activation for each feature map:
for example, if its input contains 256 feature maps, 
it will output 256 numbers representing the overall level of response for each filter.'''
model.summary()

Model: "resnet50"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
conv1_pad (ZeroPadding2D)       (None, 230, 230, 3)  0           input_2[0][0]                    
__________________________________________________________________________________________________
conv1_conv (Conv2D)             (None, 112, 112, 64) 9472        conv1_pad[0][0]                  
__________________________________________________________________________________________________
conv1_bn (BatchNormalization)   (None, 112, 112, 64) 256         conv1_conv[0][0]                 
___________________________________________________________________________________________

<h1><center>Transfer Learning with CNNs</center></h1>

- If you want to build an image classifier but you do not have enough training data, then it is often a good idea to reuse the lower layers of a pretrained model, as we discussed Transfer Learning in DNNs.


- For example, you can train a model to classify pictures of flowers, reusing a pretrained **Xception** model.

In [None]:
'''- Load an Xception model, pretrained on ImageNet.
- We exclude the top of the network by setting `include_top=False` : 
    this excludes the global average pooling layer and the dense output layer.
- We then add our own global average pooling layer, 
    based on the output of the base model, followed by a dense output layer with one unit per class,
    using the softmax activation function. Finally, we create the Keras Model: '''
base_model = keras.applications.xception.Xception(weights="imagenet",
include_top=False)
avg = keras.layers.GlobalAveragePooling2D()(base_model.output)
output = keras.layers.Dense(n_classes, activation="softmax")(avg)
model = keras.Model(inputs=base_model.input, outputs=output)

<h1><center>LeNet-5 Architecture</center></h1>

- The LeNet-5 architecture is perhaps the most widely known CNN architecture. As mentioned earlier, it was created by Yann LeCun in 1998 and has been widely used for handwritten digit recognition (MNIST). It is composed of the layers shown in Table 14-1.

<center><img src="img/lenet-5-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>AlexNet</center></h1>

- The AlexNet CNN architecture won the 2012 ImageNet ILSVRC challenge by a large margin: it achieved a top-five error rate of 17%, while the second best achieved only 26%!


- AlexNet was developed by Alex Krizhevsky (hence the name), Ilya Sutskever, and Geoffrey Hinton.


- AlexNet is similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of one another, instead of stacking a pooling layer on top of each convolutional layer.


- To reduce overfitting, the authors used two regularization techniques:
    - First, they applied dropout with a 50% dropout rate during training to the outputs of layers F8 and F9.
    - Second, they performed **data augmentation** by randomly shifting the training images by various offsets, flipping them horizontally, and changing the lighting conditions.

<h1><center>AlexNet Architecture</center></h1>

<center><img src="img/alexnet-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Data Augmentation</center></h1>

- Data augmentation artificially increases the size of the training set by generating many realistic variants of each training instance.


- This reduces overfitting, making this a regularization technique.


- The generated instances should be as realistic as possible: ideally, given an image from the augmented training set, a human should not be able to tell whether it was augmented or not. Simply adding white noise will not help; the modifications should be learnable (white noise is not).


- For example, you can slightly shift, rotate, and resize every picture in the training set by various amounts and add the resulting pictures to the training set.


- This forces the model to be more tolerant to variations in the position, orientation, and size of the objects in the pictures.


- For a model that’s more tolerant of different lighting conditions, you can similarly generate many images with various contrasts.


- In general, you can also flip the pictures horizontally (except for text, and other asymmetrical objects). By combining these transformations, you can greatly increase the size of your training set.

<h1><center>Data Augmentation</center></h1>

<center><img src="img/data-aug-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Image Augmentation</center></h1>

https://github.com/aleju/imgaug

<center><img src="img/imgaug.png" align="center"/></center>

<font size='1'>Image from Ref[6]</font>

<h1><center>Image Augmentation Using Tensorflow ImageDataGenerator()</center></h1>


In [None]:
# https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Create data generator
data_generator = ImageDataGenerator(width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True)

# Prepare iterator
iterate_train = data_generator.flow(X_train, y_train, batch_size=64)
steps = int(train_images.shape[0] / 64)

# Train the model using the iterator
# Note: In the previous tf versions, fit_generator() was used to train generators instead of fit() 
# fit() now supports generators, so there is no longer any need to use fit_generator()
# https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit_generator

history = model.fit(iterate_train, steps_per_epoch=steps, validation_data=(X_test, y_test),
                              epochs=EPOCHS, callbacks=[early_stop], verbose=1)

<h1><center>GoogleNet - Inception Module</center></h1>

- GoogLeNet uses subnetworks called **inception modules** which allow to use parameters much more efficiently than previous architectures: GoogLeNet actually has 10 times fewer parameters than AlexNet (roughly 6 million instead of 60 million).

- Figure 14-13 shows the architecture of an inception module. The notation “3 × 3 + 1(S)” means that the layer uses a 3 × 3 kernel, stride 1, and "same" padding. The input signal is first copied and fed to four different layers. All convolutional layers use the ReLU activation function.

- Note that the second set of convolutional layers uses different kernel sizes (1 × 1, 3 × 3, and 5 × 5), allowing them to capture patterns at different scales.

- Also note that every single layer uses a stride of 1 and "same" padding (even the max pooling layer), so their outputs all have the same height and width as their inputs. This makes it possible to concatenate all the outputs along the depth dimension in the final depth concatenation layer (i.e., stack the feature maps from all four top convolutional layers).

- This concatenation layer can be implemented in TensorFlow using the `tf.concat()` operation, with `axis=3` (the axis is the depth).

<center><img src="img/inception-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Inception Module</center></h1>

- You may wonder why inception modules have convolutional layers with 1 × 1 kernels. Can these layers capture any features because they look at only one pixel at a time?

- In fact, the layers serve three purposes:
    - Although they cannot capture spatial patterns, they can capture patterns along the depth dimension.
    - They are configured to output fewer feature maps than their inputs, so they serve as bottleneck layers, meaning they reduce dimensionality. This cuts the computational cost and the number of parameters, speeding up training and improving generalization.
    - Each pair of convolutional layers ([1 × 1, 3 × 3] and [1 × 1, 5 × 5]) acts like a single powerful convolutional layer, capable of capturing more complex patterns.

- Indeed, instead of sweeping a simple linear classifier across the image (as a single convolutional layer does), this pair of convolutional layers sweeps a two-layer neural network across the image.

- In short, you can think of the whole inception module as a convolutional layer on steroids, able to output feature maps that capture complex patterns at various scales.

<h1><center>GoogleNet Architecture</center></h1>

- Now let’s look at the architecture of the GoogLeNet CNN (see Figure 14-14). The number of feature maps output by each convolutional layer and each pooling layer is shown before the kernel size.

- The architecture is so deep that it has to be represented in three columns, but GoogLeNet is actually one tall stack, including nine inception modules (the boxes with the spinning tops). The six numbers in the inception modules represent the number of feature maps output by each convolutional layer in the module (in the same order as in Figure 14-13).

- Note that all the convolutional layers use the ReLU activation function.

<center><img src="img/googlenet-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>VGGNet</center></h1>

- The runner-up in the ILSVRC 2014 challenge was VGGNet, developed by Karen Simonyan and Andrew Zisserman from the Visual Geometry Group (VGG) research lab at Oxford University.


- VGGNet has a very simple and classical architecture, with 2 or 3 convolutional layers and a pooling layer, then again 2 or 3 convolutional layers and a pooling layer, and so on (reaching a total of just 16 or 19 convolutional layers, depending on the VGG variant), plus a final dense network with 2 hidden layers and the output layer.


- VGGNet uses only 3 × 3 filters, but many filters.

<h1><center>Residual Learning</center></h1>

- As CNN models get deeper and deeper, training them gets more and more challenging, e.g. issues such as vanishing and exploding gradients may arise.


- The key to being able to train such a deep network is to use skip connections (also called shortcut connections): the signal feeding into a layer is also added to the output of a layer located a bit higher up the stack.

<center><img src="img/residual-learning-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Residual Network vs DNN</center></h1>

- If you add many skip connections, the network can start making progress even if several layers have not started learning yet.

- Thanks to skip connections, the signal can easily make its way across the whole network.

<center><img src="img/dnn-resnet-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>ResNet Architecture</center></h1>

<center><img src="img/resnet-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Skip Connection - Matching Dimensions</center></h1>

- Note that the number of feature maps is doubled every few residual units, at the same time as their height and width are halved (using a convolutional layer with stride 2).

- When this happens, the inputs cannot be added directly to the outputs of the residual unit because they don’t have the same shape.

- To solve this problem, the inputs are passed through a 1 × 1 convolutional layer with stride 2 and the right number of output feature maps.

<center><img src="img/skip-connection.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Xception</center></h1>

- Another variant of the GoogLeNet architecture is worth noting: Xception (which stands for Extreme Inception) was proposed in 2016 by François Chollet (the author of Keras), and it significantly outperformed Inception-v3 on a huge vision task (350 million images and 17,000 classes).

- Just like Inception-v4, it merges the ideas of GoogLeNet and ResNet, but it replaces the inception modules with a special type of layer called a depthwise separable convolution layer (or separable convolution layer for short).

- While a regular convolutional layer uses filters that try to simultaneously capture spatial patterns (e.g., an oval) and cross-channel patterns (e.g., mouth + nose + eyes = face), a separable convolutional layer makes the strong assumption that spatial patterns and cross-channel patterns can be modeled separately. Thus, it is composed of two parts: the first part applies a single spatial filter for each input feature map, then the second part looks exclusively for cross-channel patterns—it is just a regular convolutional layer with 1 × 1 filters.

<center><img src="img/depthwise.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>SENet</center></h1>

- The winning architecture in the ILSVRC 2017 challenge was the Squeeze-and-Excitation Network (SENet).

- This architecture extends existing architectures such as inception networks and ResNets, and boosts their performance.

- This allowed SENet to win the competition with an astonishing 2.25% top-five error rate!

- The extended versions of inception networks and ResNets are called SE-Inception and SE-ResNet, respectively.

- The boost comes from the fact that a SENet adds a small neural network, called an **SE Block**, to every unit in the original architecture (i.e., every inception module or every residual unit).


<center><img src="img/senet.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>SE Block</center></h1>

- An SE block analyzes the output of the unit it is attached to, focusing exclusively on the depth dimension (it does not look for any spatial pattern), and it learns which features are usually most active together.

- It then uses this information to recalibrate the feature maps, as shown in Figure 14-21.

- For example, an SE block may learn that mouths, noses, and eyes usually appear together in pictures: if you see a mouth and a nose, you should expect to see eyes as well. So if the block sees a strong activation in the mouth and nose feature maps, but only mild activation in the eye feature map, it will boost the eye feature map (more accurately, it will reduce irrelevant feature maps). If the eyes were somewhat confused with something else, this feature map recalibration will help resolve the ambiguity.

<center><img src="img/se-block-1.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>SE Block</center></h1>

- An SE block is composed of just three layers: a global average pooling layer, a hidden dense layer using the ReLU activation function, and a dense output layer using the sigmoid activation function.

<center><img src="img/se-block-2.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Object Localization</center></h1>


- Localizing an object in a picture can be expressed as a regression task.


- To predict a bounding box around the object, a common approach is to predict the horizontal and vertical coordinates of the object’s center, as well as its height and width. This means we have four numbers to predict.

<center><img src="img/object-detection.png" align="center"/></center>

<font size='1'>Image from Ref[7]</font>

<h1><center>Intersection over Union (IoU)</center></h1>

- The MSE often works fairly well as a cost function to train a model for regression, but it is not a great metric to evaluate how well the model can predict bounding boxes.


- The most common metric for this is the Intersection over Union (IoU): the area of overlap between the predicted bounding box and the target bounding box, divided by the area of their union.


- In `tf.keras`, it is implemented by the `tf.keras.metrics.MeanIoU` class.

<center><img src="img/iou.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Object Detection</center></h1>

- The task of classifying and localizing multiple objects in an image is called **object detection**.

- Until a few years ago, a common approach was to take a CNN that was trained to classify and locate a single object, then slide it across the image, as shown below.

- This technique is fairly straightforward, but as you can see it will detect the same object multiple times, at slightly different positions. Some post-processing will then be needed to get rid of all the unnecessary bounding boxes. A common approach for this is called **non-max suppression**.

<center><img src="img/object-detection-text.png" align="center"/></center>

<font size='1'>Image from Ref[1]</font>

<h1><center>Non-Max Suppression</center></h1>

<center><img src="img/non-max.jpg" align="center"/></center>

<font size='1'>Image from Ref[10]</font>

<h1><center>YOLO - You Only Look Once</center></h1>

- YOLO developed in University of Washington is the state-of-the-art object detection algorithm that uses several components including: CNNs, IoU, and Non-Max Suppression among others.

https://pjreddie.com/darknet/yolo/

<h1><center>Face Recognition</center></h1>

<center><img src="img/face-recog.jpg" align="center"/></center>

<font size='1'>Image from Ref[12]</font>

<h1><center>Face Recognition with One-Shot Learning</center></h1>

<center><img src="img/one-shot.png" align="center"/></center>

<font size='1'>Image from Ref[9]</font>

<h1><center>Triplet Loss</center></h1>

<center><img src="img/triplet_loss.png" align="center"/></center>

<font size='1'>Image from Ref[3]</font>

<h1><center>Triplet Loss</center></h1>

<center><img src="img/triplet_loss-2.png" align="center"/></center>

<h1><center>Facial Expression Recognition</center></h1>

**Demo:**

https://github.com/justadudewhohacks/face-api.js#age-estimation--gender-recognition

<h1><center>References</center></h1>

[1] Hands-On ML Textbook Edition-2 2019

[2] Deep Learning [Textbook](http://www.deeplearningbook.org/contents/convnets.html) by Ian Goodfellow et al.

[3] Andrew Ng's CNN Course in [Coursera](https://www.coursera.org/learn/convolutional-neural-networks?=)

[4] https://towardsdatascience.com/everything-you-ever-wanted-to-know-about-computer-vision-heres-a-look-why-it-s-so-awesome-e8a58dfb641e

[5] https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-neural-style-transfer-ef88e46697ee

[6] https://github.com/aleju/imgaug

[7] https://towardsdatascience.com/object-detection-with-10-lines-of-code-d6cb4d86f606

[8] https://github.com/justadudewhohacks/face-api.js#age-estimation--gender-recognition

[9] https://blog.netcetera.com/face-recognition-using-one-shot-learning-a7cf2b91e96c

[10] https://www.pyimagesearch.com/2014/11/17/non-maximum-suppression-object-detection-python/

[11] https://pjreddie.com/darknet/yolo/

[12] https://www.nec.com/en/global/solutions/biometrics/face/index.html

[13] https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1

[14] https://www.freecodecamp.org/news/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050/

[15] https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html