<a href="https://colab.research.google.com/github/chaitragopalappa/MIE590-690D/blob/main/5_NN_for_image_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Networks for Image Data
## Convolution Neural Networks (CNN)

Sources:
* Chapter 14, Probabilistic Machine Learning: An Introduction by Kevin Murphy  
* Dive into Deep Learning, by Aston Zhang, Zachary C. Lipton, Mu Li, Alexander J. Smola https://d2l.ai/index.html

---

# **Challenges with applying MLP to image data**

## **Application of MLP to image data and corresponding challenges**
1. Variable-sized input issue
  * An image is essentially a grid of tiny squares called pixels. Each pixel contains information about color and intensity.
  * In classification ($Pr(y|\mathbf{x})$) tasks, e.g., cat or dog, defective product or good product , $\mathbf{x}\in \mathbb{R}^{H \times W\times C}$, where,  $H \times W\times C $ represents width, height, and channels (color e.g., RGB); $\mathbf{x}$ is first flattened into 1D array for input to MLP.
  * However, images (sample data) may not have fixed $H, W$.
<img src="https://raw.githubusercontent.com/chaitragopalappa/MIE590-690D/main/images/MLP_Image.png" width="400" height="200">

2. Parameter explosion
  * For a 128×128×3 image, even modest hidden layer sizes result in millions of parameters.
3. Lack of translation invariance
  * A pattern that occurs in one location may not be recognized when it occurs in a different location as weights of NN are not shared across locations, e.g., a defective product maybe misclassified as a good product if the defect occurs in a location differnt than any of the samples it was trained on.  
   <img src="https://raw.githubusercontent.com/chaitragopalappa/MIE590-690D/main/images/Defect_SpatialInv_MLP.png" width="400" height="200">

## Solution: Covolution neural networks (CNN)


# **Convolution neural networks (CNN)**

## **CNN: Core concepts**
* Divide the input into overlapping 2d image patches
* Compare each patch with a set of small **weight matrices**, or **filters**, which represent parts of an object (**template** matching).
  * Learn these templates from data
  * Because the templates are small (often just 3x3 or 5x5), the number of parameters is significantly reduced.
  * Because we use convolution to do the template matching, instead of matrix multiplication, the model will be translationally invariant.
  
      <img src="https://raw.githubusercontent.com/probml/pml-book/main/book1-figures/Figure_14.2.png" width="200" height="200">

*Classify a digit by looking for certain discriminative features (image templates) occuring in the correct (relative) locations.
Figure Source: Embeded from Github repo of PML: An Introduction by Murphy (https://github.com/probml)
Original Source: F. Chollet. Deep learning with Python. Manning, 2017.*





## **CNN Vizualization tool**
[CNN Vizualization tool](https://poloclub.github.io/cnn-explainer/)

# **Basic architecture of CNN**
* Convolution layers
* Pooling layers
* Normalization layer
-----

## **Convolution layers**

### **What is convolution and cross-correlation in the context of two 1D continuous functions?**

### Convolution of two continuous 1D functions
Convolution between two functions say $f, g: \mathbb{R}^D -> \mathbb{R}$ is defined as
$[f ⊛g](z) = ∫_{\mathbb{R}^D} f(u)g(z − u)du$  
Basically interprets to, flip one of the functions and drag along the axis of the other, taking the integral of the product of the two functionas at each position.
### Cross-correlation of two continuous 1D functions
Cross-correlation: (do not flip any function) Drag one of the functions along the axis of the other, taking the integral of the product of the two functionas at each position.

  <img src="https://upload.wikimedia.org/wikipedia/commons/2/21/Comparison_convolution_correlation.svg" width="600" height="600">

*Vizualization of convolution and cross-correlation [See GIF](https://en.wikipedia.org/wiki/Convolution#/media/File:Convolution_of_spiky_function_with_box2.gif)  
Convolution: Flip one of the functions and drag along the axis of the other and take the integral of the product of the two functionas at each point.  
Cross-correlation: (do not flip any of the functions) Drag one of the functions along the axis of the other and take the integral of the product of the two functions at each point.  
Autocorrelation: relation of signnal with itself.
Figure source: Directly embeded from Wikipedia page*


Commutative property of convolution
f ⊛g =g ⊛f

**Convolution = cross-correlation if the the flipped vector is symmteric.**

---





### **What is convolution and cross-correlation - in 1D discrete domain?**
### Two discrete functions (two vectors)
Suppose we have two discrete functions (or continuous functions defined at discrete points), then we have two vectors:
* $w$ say defined on domain ${[0,1....,L]}$  
* $x$ say defined on domain ${[0,1....,N]}$

**Convolution** $[w \circledast x]$: Flip one of the vectors and drag along the axis of the other, taking the dot product of the two vectors at each position $(i)$ .
$$
[w \circledast x](i) = w_{0} \, x_{i-0}  + w_{1} \, x_{i-1} + \cdots + w_{L} \, x_{i-L}
$$


<img src="https://raw.githubusercontent.com/chaitragopalappa/MIE590-690D/main/images/1D_convolution_MurphytextBook.png" width="600" height="200">


**Cross-correlation**$[w \ast x]$: Drag one of the vectors along the axis of the other, taking the dot product of the two vectors at each position $(i)$
$$
[w \ast x](i) = w_{0} \, x_{i+0} + w_{1} \, x_{i+1} + \cdots + w_{L} \, x_{i+L}
$$

*Figure Source: copied from PML: An Introduction, by Murphy (https://github.com/probml)*

---

### **Convolution in CNN for images (2D)**

The 2D convolution of a filter $\mathbf{ W} $ with an input $ \mathbf{ X} $ at position $(i, j) $ is defined as:

$
[\mathbf{ W} \circledast\mathbf{ X}[(i,j) = \sum_{u=0}^{H-1} \sum_{v=0}^{W-1} w_{u,v}  x_{i+u,j+v}
]$

where:
- $\mathbf{ W} $ is the filter of size $( H \times W )$
- $ \mathbf{ X} $ is the input image
- $ w_{u,v} $ are the filter weights
- $ x_{i+u,j+v} $ are the input pixel values

For example, consider convolving a 3 × 3 input $ \mathbf{ X} $  with a 2 × 2 kernel $\mathbf{ W}$ to compute a 2 × 2 output $\mathbf{ Y}$

$$
\mathbf{ Y} =
\begin{bmatrix}
x_1 & x_2 & x_3 \\
x_4 & x_5 & x_6 \\
x_7 & x_8 & x_9
\end{bmatrix}
\;\; \circledast \;\;
\begin{bmatrix}
w_1 & w_2 \\
w_3 & w_4
\end{bmatrix}
=
\begin{bmatrix}
(w_1x_1 + w_2x_2 + w_3x_4 + w_4x_5) & (w_1x_2 + w_2x_3 + w_3x_5 + w_4x_6) \\
(w_1x_4 + w_2x_5 + w_3x_7 + w_4x_8) & (w_1x_5 + w_2x_6 + w_3x_8 + w_4x_9)
\end{bmatrix}
$$

<img src="https://github.com/probml/probml-notebooks/blob/main/images/d2l-correlation.png?raw=true" height=200>

*Source: Embeded from Github repo of textbook (PML: An Introduction by Murphy) See textbook for original source*

[Convolution in 2D Discrete Functions- GIF](https://en.wikipedia.org/wiki/Convolution#/media/File:2D_Convolution_Animation.gif)



#### Cross-correlation code: [Source](https://colab.research.google.com/github/d2l-ai/d2l-pytorch-colab/blob/master/chapter_convolutional-neural-networks/conv-layer.ipynb#scrollTo=3b5cdda0)

In [None]:
!pip install d2l==1.0.3

In [4]:
import torch
from torch import nn
#from d2l import torch as d2l

In [5]:
def corr2d(X, K):
    """Compute 2D cross-correlation."""
    h, w = K.shape
    Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
    return Y

In [6]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

tensor([[19., 25.],
        [37., 43.]])

#### Kernel interpretation- example of edge detection
[Code Source](https://colab.research.google.com/github/d2l-ai/d2l-pytorch-colab/blob/master/chapter_convolutional-neural-networks/conv-layer.ipynb#scrollTo=3b5cdda0)

In [10]:
#Interpretation of kernels- example of edge detection
#Create an image with zeros for black and 1 for white
X = torch.ones((6, 8))
X[:, 2:6] = 0
X

tensor([[1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.]])

In [13]:
#Create kernel of size 1X 2
K = torch.tensor([[1.0, -1.0]])
print(K)

tensor([[ 1., -1.]])


In [None]:
#Do cross-correlation
Y = corr2d(X, K)
Y
#Notice output show location of edges
#The kernel served as a finite difference operator'

#### **Kernel as an operator**
In above example, the kernel served as a finite difference operator.
At location $(i,j)$ it computes $x_{i,j} - x_{(i+1),j}$, which is a discrete approximation of the first derivative in the horizontal direction (for a function $f(i,j)$ its derivative $-\partial_i f(i,j) = \lim_{\epsilon \to 0} \frac{f(i,j) - f(i+\epsilon,j)}{\epsilon})$.

---


### **Convolution or cross-correlation in CNN?**
In CNN:
* We are convolving an input signal (2D function) with a weight matrix (2D function)
* The weight matrix is called 'kernel' or 'filter'.
* **Model fitting** in context of CNN involves learning the weight matrix (equivalent to learning the weights on the arrows in MLP).
  * Therefore, technically it does not make sense to flip the matrix (by definition of convolution); cross-correlation sufficiently *describes* the operation of CNN.  

---

#### **Model fitting in CNN entails "learning" a kernel**
[Code Source](https://colab.research.google.com/github/d2l-ai/d2l-pytorch-colab/blob/master/chapter_convolutional-neural-networks/conv-layer.ipynb#scrollTo=3b5cdda0)

In above example code we specified the kernel ( K = torch.tensor([[1.0, -1.0]]) )
The ojective of CNN is to learn a kernel that provided a good fit to data (supervised learning, i.e., we will have samples of X and Y, and we want to fit kernels to this data. Below we will write a code to learn the kernel given one sample image. In reality, we will have mnay samples, below is just for demonstration of CNN using in built NN clases from PyTorch, to 'learn' a filter (we know the 'accurate' filter should a finte differnce operator - is the CNN capable of learning this?).

In [16]:
#Suppose we have one sample and correponding label lets learn the filter for this dataset
print("X=", X)
print("Y=", Y)

X= tensor([[1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.]])
Y= tensor([[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.]])


In [17]:
#In the __init__ constructor method, weight and bias are declared as the two model parameters. The 'forward' function does the forward pass (also called forward propogation) by calling the corr2d function and adds the bias.
class Conv2D(nn.Module):
    def __init__(self, kernel_size):
        super().__init__()
        self.weight = nn.Parameter(torch.rand(kernel_size))
        self.bias = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        return corr2d(x, self.weight) + self.bias

In [None]:
# Construct a two-dimensional convolutional layer with 1 output channel and a
# kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here
conv2d = nn.LazyConv2d(out_channels=1, kernel_size=(1, 2), bias=False) #initialize an object of class Conv2D ;
            #Note: For nn.Conv2d the size of the input channel should explicitly specified. LazyConv2d is a nn.Conv2d module with lazy initialization of the in_channels argument, i.e., the in_channels argument of the Conv2d is inferred from the input.size(1). The attributes that will be lazily initialized are weight and bias.

# The two-dimensional convolutional layer uses four-dimensional input and
# output in the format of (example, channel, height, width), example refers to the batch
# size (number of examples in the batch), here batch size and the number of channels are both 1
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))

lr = 3e-2  # Learning rate

for i in range(10):
    Y_hat = conv2d(X) #forward pass
    l = (Y_hat - Y) ** 2 #calculate loss
    conv2d.zero_grad()  #set gradients to zero
    l.sum().backward() #apply backprop to calcualte gradients of the loss
    # Update the kernel
    conv2d.weight.data[:] -= lr * conv2d.weight.grad ##apply SDG update
    if (i + 1) % 2 == 0:
        print(f'epoch {i + 1}, loss {l.sum():.3f}')


In [None]:
conv2d.weight.data.reshape((1, 2))
## Notice that the learned kernel is close to what we had initially pre-defined for generating the data Y.
##In above example we use just one data sample. Would this filter be able to detect vertical edges in images? No. In reality, the model will be trained using multiple samples of different types of edges for an edge detection task, to generate a better filter.

### **Rewriting convolution as matrix-vector multiplication**
We can rewrite above as matrix-vector mutiplication by flattening the 2d
matrix $\mathbf{X}$ into a 1d vector $x$, and multiplying by a Toeplitz-like matrix $\mathbf{C}$ derived from the kernel $\mathbf{W}$, and recover the 2 × 2 output by reshaping the 4 × 1 vector $y$ back to $\mathbf{Y}$, as follows:

$$
y = Cx =
\begin{bmatrix}
w_1 & w_2 & 0   & w_3 & w_4 & 0   & 0   & 0   & 0 \\
0   & w_1 & w_2 & 0   & w_3 & w_4 & 0   & 0   & 0 \\
0   & 0   & 0   & w_1 & w_2 & 0   & w_3 & w_4 & 0 \\
0   & 0   & 0   & 0   & w_1 & w_2 & 0   & w_3 & w_4
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2 \\
x_3 \\
x_4 \\
x_5 \\
x_6 \\
x_7 \\
x_8 \\
x_9
\end{bmatrix}
=
\begin{bmatrix}
w_1x_1 + w_2x_2 + w_3x_4 + w_4x_5 \\
w_1x_2 + w_2x_3 + w_3x_5 + w_4x_6 \\
w_1x_4 + w_2x_5 + w_3x_7 + w_4x_8 \\
w_1x_5 + w_2x_6 + w_3x_8 + w_4x_9
\end{bmatrix}
$$

$$
\mathbf{ Y} =
\begin{bmatrix}
y_{1} & y_{2} \\
y_{3} & y_{4}
\end{bmatrix}
$$

---

### **Interpretation of convolution in image data- feature detection or feature map**
* 2d convolution is equivalent to template matching, i.e., the output at a point $(i, j)$ will be large if the corresponding image patch centered on $(i, j)$ is similar to $\mathbf{W}$.
* If the template $\mathbf{W}$, corresponds to an oriented edge, then convolving with it will cause the output heat map to “light up” in regions that contain edges that match that orientation (see Figure below).
* More generally, we can think of convolution as a form of feature detection.
* The resulting output $\mathbf{Y =W⊛X}$ is therefore called a feature map.
<img src="https://raw.githubusercontent.com/probml/pml-book/main/book1-figures/Figure_14.6.png" height=200>

*Figure: Convolving a 2d image (left) with a 3 × 3 filter (middle) produces a 2d response map (right). The bright spots of the response map correspond to locations in the image which contain diagonal lines sloping down and to the right.*
Source: Embeded from Github repo of textbook (PML: An Introduction by Murphy). See textbook for original source.

---

### **Boundary conditions and padding**
Valid convolution:
* in above equations, convolving a $f_h × f_w$ filter over an image of size $x_h × x_w$ produces an output of size $(x_h − f_h + 1) × (x_w − f_w + 1)$;
* It is called valid convolution because we only apply the filter to “valid” parts of the input, i.e., we don’t let it “slide off the ends”.

Same convolution:
* If we want the output to have the same size as the input, we can use zero-padding, which means we add a border of 0s to the image.
* This is called same convolution.
* Setting padding size $p_h = \frac{f_h − 1}{2}, p_w = \frac{f_w − 1}{2}$; will give output size of  $(x_h + 2p_h − f_h + 1) × (x_w + 2p_w − f_w + 1)=(x_h \times x_w)$

<img src= "https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.7.png?raw=true" height=400>

*Figure: Same-convolution (using zero-padding) ensures the output is the same size as the input*

Source: Embeded from Github repo of textbook (PML: An Introduction by Kevin Murphy). See textbook for original source.

---

### **Strided convolution**
* In approach discusssed above, neighboring outputs can be similar in value because their inputs overlap (as the kernel slides over the input one pixel at a time).
* To reduce this redundancy (and speedup computation) skip every $s^{th}$ pixel. This is called strided convolution.
* This also makes the output smaller than the input
  * if we have input data of size $x_h × x_w$, kernel of size $f_h × f_w$, zero- padding on all sides of size $p_h$ and $p_w$, and strides of size $s_h$ and $s_w$, then the output has the following size
$\frac{x_h + 2p_h - f_h + s_h}{s_h} \times (x_w + 2p_w - f_w + s_w)
$
<div style="display: flex; align-items: center;">
  <img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.8_A.png?raw=true" height="400" width ="400" style="margin-right: 2px;">
  <img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.8_B.png?raw=true" height="400" width ="400">
</div>

*Figure: Illustration stride in 2d convolution on a 5 × 7 input using a 3 × 3 filter. (a) Stride of 1 creates output of size 5 × 7 output. (b) Stride of 2 creates output of size 3 × 4.  
Source: Embeded from Github repo of textbook (PML: An Introduction by Kevin Murphy). See textbook for original source.*

---

### **Multiple layers**
The CNN can have multiple layers (deep network)
As we go deeper into the network, a pixel in a layer has a larger **receptive field** (the region of the input image that the pixel represents)

---

### **Multiple input channels**
* The above illustrations used images in gray-scale.
* The input can have multiple channels (e.g., RGB, or hyper-spectral bands for satellite images).
  * This would extend the convolution to 3d:
 $\mathbf{W}$ will be a 3d weight matrix or [tensor](https://colab.research.google.com/drive/1pPB_YTQ93pXyXctHPP-TMBN5woWJvV6J).
* We compute the output by convolving channel c of the input with kernel $\mathbf{W}_{:,:,c}$, and then summing
over channels:
$$
z_{i,j} = b + \sum_{u=0}^{H - 1} \sum_{v=0}^{W - 1} \sum_{c=0}^{C - 1} x_{s i + u,\, s j + v,\, c} w_{u, v, c}
$$
where, $s$ is the stride (assume is the same for both height and width for simplicity), and $b$
is the bias term
<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.9.png?raw=true" height="200" width ="400">

*Figure: Illustration of 2d convolution applied to an input with 2 channels (no padding, stride 1).
Source: Embeded from Github repo of textbook (PML: An Introduction by Kevin Murphy). See textbook for original source.*

---

### **Multiple output channels**
Each filter (the weight matrix) can detect a single kind of feature. We can use multiple filters to detect multiple kinds of features by making $\mathbf{W}$ a 4D matrix. Then the output of the convolution will be multiple feature maps (each feature map will be equivalent to a channel for the next convolution layer).
Specifically,
* The filter to detect feature type $d$ in input channel $c$ is stored in $\mathbf{W}_{:,:,c,d}$
$
z_{i,j,d} = b_d + \sum_{u=0}^{H - 1} \sum_{v=0}^{W - 1} \sum_{c=0}^{C - 1} x_{s i + u,\, s j + v,\, c} w_{u, v, c, d}
$

* For every convolution layer, the $c$ refers to the channels in the previous layer.
  * For the first convolution layer, $c$ refers to channels in the input image.
  * For subsequent convolution layers, $c$ refers to the feature maps in the previous layer
* Thus, a neuron located in row $i$, column $j$ of the feature map $d$ in a given convolutional layer $l$ is connected to the outputs of the neurons in the previous layer $l – 1$, located in rows $i × s_h$ to $i × s_h + f_h – 1$ and columns $j × s_w$ to $j × s_w + f_w – 1$, across all feature maps in layer $l – 1$.

*Note: In Tensorflow, a filter for 2d CNNs has shape (H,W,C,D), and a minibatch of feature maps has shape (batch-size, image-height, image-width, image-channels); this is called NHWC format. Other systems use different data layouts.*

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.10.png?raw=true" height="200" width ="400">

*Figure: Illustration of a CNN with 2 convolutional layers. The input has 3 color channels. The feature
maps at internal layers have multiple channels. The cylinders correspond to hypercolumns, which are feature
vectors at a certain location.
Source: Embeded from Github repo of textbook (PML: An Introduction by Kevin Murphy). See textbook for original source.*


---



### **1 × 1 (pointwise) convolution layer**
Sometimes we just want to take a weighted combination of the features. This can be done by convolving with a kernel of size 1x1. Such a layer is called 1 × 1 convolution or pointwise convolution.
$$
z_{i,j,d} = b_d + \sum_{c=0}^{C - 1} x_{i,j,c} \cdot w_{0,0,c,d}
$$

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.11.png?raw=true" height="150" width ="400">

*Figure: Mapping 3 channels to 2 using convolution with a filter of size 1 × 1 × 3 × 2.
Source: Embeded from Github repo of textbook (PML: An Introduction by Kevin Murphy). See textbook for original source.*

----

## **Actvitation function**
Note that convolutions are mainly linear tranformations of the input. To add non-linearity, and activation function can be added to pass each element (pixel) through an acitvation function after convolution.

---

## **Pooling layers**
Equivariance property in convolution: Convolution preserves information about the location of input features.
In some cases, we want it to be invariant to the location, e.g., for image classification we want to know if an object of interest (e.g., a face) is present anywhere in the image.
This can be achieved by pooling

**Max pooling**: computes the maximum over its incoming values (see Figure below).

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.12.png?raw=true" height="150" width ="400">

*Figure: Illustration of maxpooling with a 2x2 filter and a stride of 1.
Source: Embeded from Github repo of textbook (PML: An Introduction by Kevin Murphy). See textbook for original source.*

**Average pooling**: replaces the max by the mean.
In either case, the output neuron has the same response no matter where the input pattern occurs in its receptive field. Note: pooling is applied to each feature map (channel) independently

**Global average pooling**: Average over all the locations in a feature map, i.e., a $H ×W × D$ feature map will be converted into a $1 × 1 × D$ dimensional feature map; this can be reshaped to a D-dimensional vector, which can be passed into a fully connected (FFNN) layer to map it to a C-dimensional (C classes fir a classfication problem) vector before passing into a softmax output. The use of global average pooling
implies we can apply the classifier to an image of any size, since the final feature map will always be converted to a fixed D-dimensional vector before being mapped to a distribution over the C classes.

---

## **Flatten, Fully Connected, Softmax layers**
It is also typical to pass outputs of convolution into a feed forward fully connectd network. The inputs for fully connected networks are 1D. Thus, a Flatten layer is added to convert a tensor into 1D vector. If the task of the CNN is classification, it is typically to have a softmax layer as the last layer.  

---

# **Sample CNN architecture** - Putting all layers together

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.13.png?raw=true" height="200" width ="600">

*Figure: A simple CNN for classifying images.
Source: Embeded from Github repo of textbook (PML: An Introduction by Kevin Murphy). See textbook for original source.*

[CNN Vizualization tool](https://poloclub.github.io/cnn-explainer/)

---

# **Normalization (regularization)**
Why normalization?
  A common design pattern is to create a CNN by alternating convolutional layers with max pooling
layers, followed by a final linear classification layer at the end, which would be ok for shallow CNNs. For deeper models this creates issues of vanishing or exploding gradient.

Solution: **Normalize each layer.**

---

## **Batch normalization**
Replace the activation vector $z_n$ (n is some sample image in that batch) by $\tilde{z}_n$
$$
\begin{aligned}
\tilde{z}_n &= \gamma \odot \hat{z}_n + \beta \\
\hat{z}_n &= \frac{z_n - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \\
\mu_B &= \frac{1}{|B|} \sum_{z \in B} z \\
\sigma_B^2 &= \frac{1}{|B|} \sum_{z \in B} (z - \mu_B)^2
\end{aligned}
$$
where
* $B$ is the minibatch containing sample $n$ (samples maybe referred to as examples too),
* $μ_B$ is the mean of the activations for this batch, $σ^2$ is the corresponding variance,
  * Note: When applied to a convolutional layer, we average across spatial locations and across examples (samples), but not across channels (so the length of μ is the number of channels). When applied to a fully connected layer, we just average across examples (so the length of μ is the width (number of nodes) of the layer).
* $\hat{z}_n$ is the standardized activation vector,
* $\odot$ is element-wise multiplication, e.g., if
$$
z_{n} = [z_{1}, z_{2}, \dots, z_{k}]_n
\quad\text{and}\quad \gamma = [\gamma_{1}, \gamma_{2}, \dots, \gamma_{k}]$$
then $$
\gamma \odot z_{n} = [\gamma_{1} z_{1}, \gamma_{2} z_{2}, \dots, \gamma_{k} z_{k}]_n $$; $k$ is the number of feature maps (or channels)

* $\tilde{z}_n$ is the shifted and scaled version (the output of the BN layer),
* β and γ are learnable parameters for this layer, and
* ϵ > 0 is a small constant.   

**As this transformation is differentiable, we can easily pass gradients back to the input of the layer and to the BN parameters β and γ.**

### Note:
* During training: While data in input layer is static the other layers keep changing, thus, the mean and variance for normalization is to be calculated every epoch of training.
* During testing: if we have a single data we cannot calculate mean and variance. Thus, it is typical to compute mean and variance at the end of training for every layer using all samples and freeze it for use in test data.
  * For speed, frozen batch norm layer can be combined with previous layer (**fused batchnorm**)    

**Fused batchnorm**
* Suppose previous layer computes $\mathbf{XW+b}$, then the BN layer will be $\gamma \odot \frac{\mathbf{XW+b}-\mu}{\sigma} +\beta$  
* Suppose we rewrite $\mathbf{W'}=\gamma \odot \frac{\mathbf{W}}{\sigma}$ and $\mathbf{b'}=\gamma \odot \frac{\mathbf{b}-\mu}{\sigma}+\beta $   
* Then, we can write the combined layers as $\mathbf{XW'+b'}$


**Issues with batch normalization** : struggles when the batch size is small, since the estimated mean and variance parameters can be unreliable  
**Solution**: Compute the mean and variance by pooling statistics across other dimensions of the tensor, but not across samples in the batch.

---

## **Layer/Instance/Group normalization**
Let $z_i$ refer to the $i$’th element of a tensor; in the case of 2d images, the index $i$ has 4 components,  batch, height, width and channel, $i = (i_N, i_H, i_W, i_C)$. Compute  mean and standard deviation for each index $z_i$ as follows:
$$
\mu_i = \frac{1}{|S_i|} \sum_{k \in S_i} z_k, \quad
\sigma_i = \sqrt{\frac{1}{|S_i|} \sum_{k \in S_i} (z_k - \mu_i)^2 + \epsilon}
$$
where $S_i$ is the set of elements to average over. We then compute
$$
\hat{z}_i = \frac{z_i - \mu_i}{\sigma_i} \quad \text{and} \quad \tilde{z}_i = \gamma_c \hat{z}_i + \beta_c$$,
where $c$ is the channel corresponding to index $i$.  
**Batch norm**(discussed above): pool over batch, height, width   
**Layer normalization**: pool over channel, height, width   
**Instance normalization**: separately normalize each channel and sample (so pool over height and width within each sample and channel).   
**Group normalization**: pool over all locations whose channel is in the same group as $i$

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_14.14.png?raw=true" height="150" width ="600">

*Figure: Illustration of different activation normalization methods for a CNN. Each subplot shows a
feature map tensor, with $N$ as the batch axis, $C$ as the channel axis, and $(H, W)$ as the spatial axes. The pixels in blue are normalized by the same mean and variance, computed by aggregating the values of these pixels. Left to right: batch norm, layer norm, instance norm, and group norm (with 2 groups of 3 channels).
Source: Embeded from Github repo of textbook (PML: An Introduction by Kevin Murphy). See textbook for original source.*

---

## **Normalizer-free networks**
No normalization. Instead use adaptive gradient clipping to avoid training instabilities. The
resulting model was found to be faster to train, and more accurate, than other competitive models trained with batchnorm.
*Reference: A. Brock, S. De, S. L. Smith, and K. Simonyan. “High-Performance Large-Scale Image Recognition Without Normalization”. In: (2021). arXiv: 2102.06171 [cs.CV].*

---

# **Codes**
[Convolution_2D](https://colab.research.google.com/github/probml/pyprobml/blob/master/notebooks/book1/14/conv2d_torch.ipynb)  
[Batch_norm](https://colab.research.google.com/github/probml/pyprobml/blob/master/notebooks/book1/14/batchnorm_torch.ipynb)  
[Layer_norm](https://colab.research.google.com/github/probml/pyprobml/blob/master/notebooks/book1/14/layer_norm_torch.ipynb)

---

# **Common architectures**  
* [LeNet](https://colab.research.google.com/github/probml/pyprobml/blob/master/notebooks/book1/14/lenet_torch.ipynb): Designed to classify images of handwritten digits, and was trained on the MNIST. LeNet was developed LeCun and colleagues in 1990's and used in [ATMs for reading checks](https://en.wikipedia.org/wiki/LeNet).
 Classifying isolated digits is of limited applicability. In the real world, people usually write strings of digits or other letters. This requires both segmentation and classification. A combineed convolutional neural networks with a model similar to a conditional random field was designed to solve this problem. The system was deployed by the US postal service.
  * Reference: Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-Based Learning Applied
to Document Recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.See [LeC+98] for a more detailed account of the system.*
* [AlexNet](https://d2l.ai/chapter_convolutional-modern/alexnet.html) (2012)
  * Was the winner on the 2012 ImageNet challenge - achieved top-5 error of 15% - a dramatic improvement from the runner-up that had 26%
  * Much deeper than LeNet and used ReLU instead of Sigmoid (made posible by imporved compute capacity and ReLU to avoid vanishing/exploding gradient issues
  * The last two layers have more than 40 million parameters, trained on a datasets of 60 thousand images
    * But training and validation loss are identifcal
    * Because of regularization, such as drop-out, that avoids overfitting

---

## **Neural architecture search (NAS) or AutoML**
* Method to solve for the optimal architecture (layers, stride, learning rate, number of channels etc.).
* It is also called AutoML as it is an automated search process.
* It can be setup as a multi-objective optimization problem - to optimize accuracy, model size, training or inference speed, etc.
* Solution algorithms include derivative free optimization, or more efficient Bayesian optimization.
* But it is a computationally expensive approach given the nature of the problem. Research into more efficient solution methods is an ongoing area

---