In [21]:
import numpy as np
import collections
from scipy.ndimage import convolve

# Neural Network<br><br>

A <span style="color:blue"><b>neural network unit</b></span> is a primitive neural network that consists of only the “input layer", and an output layer with only one output. It is represented pictorially as follows:<br>
<img style="float: center;" src="https://feiyiwang.github.io/notebook/jupyter/img/images_lec8_neuralnetunit.svg" width="300">

# <span style="color:green">Feed-forward Neural Networks (FFNNs)</span>

A <b>deep (feedforward) neural network</b> refers to a neural network that contains not only the input and output layers, but also hidden layers in between. For example, below is a deep feedfoward neural network of 2 hidden layers, with each hidden layer consisting of 5 units:<br>
<img style="float: center;" src="https://feiyiwang.github.io/notebook/jupyter/img/images_lec8_deepneuralnet.svg" width="250">

### Why Deep Learning
1. lots of data - many significant problems can only be solved at scale
2. computational resources (esp. GPUs) - platforms/systems that support running deep (machine) learning algorithms at scale
3. large models are easier to train - large models can be successfully estimated with simple gradient based learning algorithms
4. flexible neural “lego pieces” - common representations, diversity of architectural choices

### Advantages
1. One of the main advantages of deep neural networks is that in many cases, they can learn to extract very complex and sophisticated features from just the raw features presented to them as their input. For instance, in the context of image recognition, neural networks can extract the features that differentiate a cat from a dog based only on the raw pixel data presented to them from images.
2. The initial few layers of a neural networks typically capture the simpler and smaller features whereas the later layers use information from these low-level features to identify more complex and sophisticated features.

### Example 1 Representation Power of Neural Networks
The logic NAND function is defined as $y=NOT(x_1 $ AND $ x_2)$ where  𝑥1  and  𝑥2∈{0,1}  are binary inputs (and  1  denotes  True  and  0  denotes  False ).<br>
If the activation function is the step function $U(z)=\begin{cases} 0 & \text{ if } z \leqslant 0 \\ 1 & \text{ if } z >0 \end{cases}$, write a possible combination of $w_0, w_1, w_2$

<img style="float: center;" src="https://feiyiwang.github.io/notebook/jupyter/img/images_lec8_logicNAND.svg" width="250">

<b>Solution:</b><br>
$\because NAND(x_1,x_2)=\begin{cases} 0 & \text{ if } (x_1,x_2)=(1,1) \\ 1 & \text{ otherwise }  \end{cases}$ and $U(z)=\begin{cases} 0 & \text{ if } z \leqslant 0 \\ 1 & \text{ if } z >0 \end{cases}$.<br>
$\therefore$ when $x_1=x_2=1, w_1x_1+w_2x_2+w_0=w_1+w_2+w_0\leqslant0$<br>
&nbsp;&nbsp;&nbsp;&nbsp;when $x_1=x_2=0, w_1x_1+w_2x_2+w_0=w_0>0$<br>
&nbsp;&nbsp;&nbsp;&nbsp;when $x_1=1, x_2=0, w_1x_1+w_2x_2+w_0=w1+w_0>0$<br>
&nbsp;&nbsp;&nbsp;&nbsp;when $x_1=0, x_2=1, w_1x_1+w_2x_2+w_0=w_2+w_0>0$<br>
So, one example can be $w_0=6, w_1=-5, w_2=-5$
<br><br>
NAND function is known as a universal logic function, which can be used to implement any boolean functions, including also  XOR , without the use of any other type of function (except for the identity and zero function).<br>
Use the NAND function only as the basic neural network unit and De Morgan's law in boolean algebra. <br><br>
<b>AND  function:</b>
<img style="float: center;" src="https://feiyiwang.github.io/notebook/jupyter/img/images_lec8_logicAND.svg" width="250"><br>
Here, each pair of edges of the same color along with the nodes they are connected to form a neural network unit that represents the NAND function. (They do not represent values of inputs or outputs). In the example above,  𝑥1  and  𝑥2  are inputs to two NAND units, and are connected to output of respective units by the blue and orange arrows.<br>
<b>NOT function:</b>
<img style="float: center;" src="https://feiyiwang.github.io/notebook/jupyter/img/images_lec8_logicNOT.svg" width="250"><br>NOT(𝑥1 AND 𝑥1)=NOT 𝑥1, <br><br>
<b>OR function:</b>
<img style="float: center;" src="https://feiyiwang.github.io/notebook/jupyter/img/images_lec8_logicOR.svg" width="250"><br>NOT( NOT (𝑥1) AND NOT(𝑥2))=NOT( NOT(𝑥1 OR 𝑥2))=𝑥1 OR 𝑥2. <Br><br>

### Hidden layer representation
<img src="img/8.1.png" width="500"/><img src="img/8.2.png" width="500"/>
<img src="img/8.3.png" width="500"/>
### Does orientation matter?
<img src="img/8.4.png" width="500"/><img src="img/8.5.png" width="500"/>
### Random hidden units
<img src="img/8.6.png" width="500"/><br><img src="img/8.7.png" width="500"/>
<img src="img/8.8.png" width="500"/>

### Summary
1. Units in neural networks are linear classifiers, just with different output non-linearity
2. The units in feed-forward neural networks are arranged in layers (input, hidden,..., output)
3. By learning the parameters associated with the hidden layer units, we learn how to represent examples (as hidden layer activations)
4. The representations in neural networks are learned directly to facilitate the end-to-end task
5. A simple classifier (output unit) suffices to solve complex classification tasks if it operates on the hidden layer representations

<br><br>
# <span style="color:green">Back-propagation Algorithm</span>

<img src="img/9.1.png" width="500"/>

Gradient Descent Update: <br>
$$w_1 \leftarrow w_1 - \eta\cdot\nabla_{w_1}Loss(y, f_L)$$

Recursive Expression: <br>
$$\frac{\partial Loss}{\partial w_1}=\frac{\partial z_1}{\partial w_1}\frac{\partial Loss}{\partial z_1}$$<br>
$$\because z_1=w_1x\quad\rightarrow \quad\frac{\partial z_1}{\partial w_1}=\frac{\partial (w_1\cdot x)}{\partial w_1}=x$$<br>
$$\therefore \frac{\partial Loss}{\partial w_1} = x \delta_1 \qquad(1)$$

Chain rule:<br>
$$\delta_1=\frac{\partial f_1}{\partial z_1}\cdot\frac{\partial z_2}{\partial f_1}\cdot\frac{\partial Loss}{\partial z_2}$$
$$\because f_1=tanh(z_1) \quad\rightarrow\quad\frac{\partial f_1}{\partial z_1}=(1-f_1^2)$$
$$\textrm{and} z_2=w_2 \cdot f_1\quad\rightarrow\quad\frac{\partial f_2}{\partial z_2}=w_2$$
$$\therefore \delta_1=(1-f_1^2)\cdot w_2 \cdot\frac{\partial Loss}{\partial z_2}=(1-f_1^2)\cdot w_2 \cdot \delta_2\qquad(2)$$

Final Expression of the Gradient:<br>
$$\because \textrm{(1) & (2)} \quad \rightarrow \quad \delta_{L-1}=(1-f_{L-1}^2)\cdot w_L \cdot \delta_L$$
$$\therefore \delta_L = \frac{\partial Loss}{\partial z_L}=\frac{\partial Loss}{\partial f_L} \cdot \frac{\partial f_L}{\partial z_L}=\frac{\partial (f_L-y)^2}{\partial f_L}\frac{\partial f_L}{\partial z_L}$$
$$=2(f_L-y)\frac{\partial f_L}{\partial z_L}=2(f_L-y)(1-f_L^2)$$
$$\therefore \frac{\partial Loss}{\partial w_1}=x \cdot \delta_1=x \cdot (1-f_1^2) \cdot w_2 \cdot \delta_2 $$
$$=x \cdot (1-f_1^2) \cdot w_2 \cdot (1-f_2^2)\cdot w_3 \cdot\delta_3$$ 
$$= ...$$
$$=x \cdot (1-f_1^2) \cdot (1-f_2^2)\cdot...\cdot (1-f_L^2)\cdot w_2 \cdot w_3 \cdot ...\cdot(2(f_L-y))$$

### Optimization

1. Train 2 hidden units (Randomly initialized weights + zero offset)<br>
<img src="img/9.2.png" width="500"/><br>
After ~10 passes through the data<br>
<img src="img/9.3.png" width="500"/>
2. Train 10 hidden units (Randomly initialized weights + zero offset)<br>
<img src="img/9.4.png" width="500"/>
3. Cannot get solved via 2 hidden units<br>
<img src="img/9.5.png" width="250"/>
<img src="img/9.6.png" width="250"/>
<img src="img/9.7.png" width="250"/>
4. ReLU units
- Many recent architectures use ReLU units (cheap to evaluate, sparsity)
- Easier to learn as large models<br>
<img src="img/9.8.png" width="250"/>
<img src="img/9.9.png" width="250"/>
<img src="img/9.10.png" width="250"/>

### Issues: Exploding Gradient

<b>What Are Exploding Gradients?</b>
An error gradient is the direction and magnitude calculated during the training of a neural network that is used to update the network weights in the right direction and by the right amount.

<b>What Is the Problem with Exploding Gradients?</b>
In deep multilayer Perceptron networks, exploding gradients can result in an unstable network that at best cannot learn from the training data and at worst results in NaN weight values that can no longer be updated.

<b>How do You Know if You Have Exploding Gradients?</b>
Signs to pay attention:
1. The model is unable to get traction on your training data (e.g. poor loss).
2. The model is unstable, resulting in large changes in loss from update to update.
3. The model loss goes to NaN during training.

Signs to Confirm:
1. The model weights quickly become very large during training.
2. The model weights go to NaN values during training.
3. The error gradient values are consistently above 1.0 for each node and layer during training.

<b>How to Fix Exploding Gradients?</b>
1. Re-Design the Network Model
- exploding gradients may be addressed by redesigning the network to have fewer layers.

In RNNs, gradient exploding can occur given the inherent instability in the training of this type of network, e.g. via Backpropagation through time that essentially transforms the recurrent network into a deep multilayer Perceptron neural network.

2. Use Long Short-Term Memory Networks
- perhaps because of the gated-type neuron structures.

Exploding gradients can still occur in very deep Multilayer Perceptron networks with a large batch size and LSTMs with very long input sequence lengths.

3. Use Gradient Clipping
- check for and limit the size of gradients during the training of your network.

4. Use Weight Regularization 
- check the size of network weights and apply a penalty to the networks loss function for large weight values.

### Issues: Vanishing Gradient

<b>What Are Vanishing Gradients?</b>
Back-propagation i.e moving backward in the Network and calculating gradients of loss(Error) with respect to the weights , the gradients tends to get smaller and smaller as we keep on moving backward in the Network. This means that the neurons in the Earlier layers learn very slowly as compared to the neurons in the later layers in the Hierarchy. The Earlier layers in the network are slowest to train.

<b>What Is the Problem with Vanishing Gradients?</b>
The Training process takes too long and the Prediction Accuracy of the Model will decrease.

<b>How to Fix Exploding Gradients?</b>
1. In deep models, use simple functions such as ReLU
- avoid using Sigmoid and Tanh as Activation functions which causes vanishing Gradient Problems 

### Summary

1. Neural networks can be learned with SGD similarly to linear classifiers
2. The derivatives necessary for SGD can be evaluated effectively via back-propagation
3. Multi-layer neural network models are complicated, so we are no longer guaranteed to reach global (only local) optimum with SGD
4. Larger models tend to be easier to learn because their units only need to be adjusted so that they are, collectively, sufficient to solve the task

<br><br>
# <span style="color:green">Recurrent Neural Networks (RNNs)</span>

### Feed-forward networks VS RNNs:
When it comes to temporal sequences, RNNs in general can automatically address some issues that need to be engineered with feed-forward networks. 

E.g. How many time steps back should we look at in the feature vector?<br>
How to retain important items mentioned far back?

RNN can learn the data/history and encode it into a feature vector, unlike feed-forward networks.

3 differences in architecture between encoder (unfolded RNN) and a standart feed-forward:
- input is received at each layer (per word), not just at the beginning as in a typical feed-forward network
- the number of layers varies, and depends on the length of the sentence
- parameters of each layer (representing an application of an RNN) are shared (same RNN at each step)

### <span style="color:orange">Encoding</span>
- e.g., mapping a sequence to a vector

Easy to introduce adjustable “lego pieces” and optimize them for end-to-end performance
<img src="img/10.1.png" width="450"/>

Here, we will expect the states that contain information about the phrase "Efforts and courage" at time steps 3 and onward to contain information about the first three words.

### Gate

<img src="img/10.2.png" width="450"/>

Here, gate vector $g_t$ has the same dimension as $S_t$, which determines "how much information to overwrite in the next state." The sign  ⨀  denotes element-wise multiplication.

$\therefore$ If the  𝑖 th element of  $g_{t}$  is 0, the  𝑖 th element of  $S_t$  and that of  $S_{t-1}$  are equal; 

$\therefore$ If  $g_{t}$  is a vector whose elements are all 0,  $S_t$  and  $S_{t-1}$  are equal

In [15]:
w_fh, w_fx, b_f, w_ch = 0, 0, -100, -100
w_ih, w_ix, b_i, w_cx = 0, 100, 100, 50
w_oh, w_ox, b_o, b_c = 0, 100, 0, 0
h_t, c_t, h = 0, 0, []

def calc_func(wh, h, wx, x, b, func):
    z = wh*h + wx*x + b
    if func == 'sigmoid':
        return 1 if z >= 1 else 0 if z <= -1 else 1/(1 + np.exp(-z))
    elif func == 'tanh':
        return 1 if z >= 1 else -1 if z <= -1 else np.tanh(z)
    
for x_t in [0, 0, 1, 1, 1, 0]: # For the LSTM unit
    f_t = calc_func(w_fh, h_t, w_fx, x_t, b_f, 'sigmoid')
    i_t = calc_func(w_ih, h_t, w_ix, x_t, b_i, 'sigmoid')
    o_t = calc_func(w_oh, h_t, w_ox, x_t, b_o, 'sigmoid')
    c_t = np.multiply(f_t, c_t) + np.multiply(i_t, calc_func(w_ch, h_t, w_cx, x_t, b_c, 'tanh'))
    h_t = np.multiply(o_t, calc_func(0,0,0,0,c_t,'tanh'))
    if h_t == 0.5 or h_t == -0.5:
        h_t = 0
    h.append(h_t)
print(h)

[0.0, 0.0, 1, -1, 1, 0]


### Long-short Term Memory (LSTM)
The diagram below shows a single LSTM unit that consists of Input, Output, and Forget gates.

$f_t$ is <b>forget gate</b>; $i_t$ is <b>input gate</b>; $o_t$ is <b>output gate</b>; $c_t$ is <b>memory cell</b>; $h_t$ is <b>visible state</b>; 

<img src="img/images_hw4_p2.png" width="450"/><br>
The behavior of such a unit as a recurrent neural network is specified by a set of update equations. These equations define how the gates, “memory cell"  $c_t$  and the “visible state"  $h_t$  are updated in response to input  $x_t$  and previous states  $c_{t-1}$ ,  $h_{t-1}$ . For the LSTM unit,<br>
<img src="img/images_hw4_p3.png" width="400"/><br>
where symbol  ⊙  stands for element-wise multiplication. The adjustable parameters in this unit are matrices  $W^{f,h}$ ,  $W^{f,x}$ ,  $W^{i,h}$ ,  $W^{f,x}$ ,  $W^{o,h}$ ,  $W^{o,x}$ ,  $W^{c,h}$ ,  $W^{c,x}$ , as well as the offset parameter vectors  $b_f$ ,  $b_i$ ,  $b_o$ , and  $b_c$ . By changing these parameters, we change how the unit evolves as a function of inputs  $x_t$ .

To keep things simple, in this problem we assume that  $x_t$ ,  $c_t$ , and  $h_t$  are all scalars. Concretely, suppose that the parameters are given by<br>
<img src="img/images_hw4_p4.png" width="400"/><br>
We run this unit with initial conditions  $h_{-1}=0$  and  $c_{-1}=0$ , and in response to the following input sequence: [0, 0, 1, 1, 1, 0] (For example,  $x_0=0$ ,  $x_1=0$ ,  $x_2=1$ , and so on).

### <span style="color:orange">Decoding</span> 

<img src="img/12.22.png" width="400"/>

### Markov Models
Next word in a sentence depends on previous symbols already written. Say, use previous words to predict "$bumfuzzled$"
$$The \quad lecture \quad leaves \quad me \quad bumfuzzled$$

Let $w \in V$ denote the set of possible words/symbols that includes unknown words (UNK), beginning (beg) and end
$$<beg> \quad The \quad lecture \quad leaves \quad me \quad UNK \quad <end>$$
$$w0 \qquad w1 \qquad w2 \qquad w3 \qquad w4 \qquad w5 \qquad w6$$

#### A simple first order Markov model
Each symbol (except $<beg>$) in the sequence is predicted using the same conditional probability table
until an $<end>$ symbol is seen

<img src="img/12.23.png" width="400"/>

<b>Maximum likelihood estimation: </b> The goal is to maximize the probability that the model can generate all the observed sentences (corpus S)
$$s \in S, \quad s=\left \{ w_1^s,w_2^s, ... , w_{\left | s \right |}^s \right \}$$

The ML estimate is obtained as normalized counts of successive word occurrences (matching statistics)

##### Example1 
The probability of generating the following sentence <span style="color:blue">$<beg>$ ML course UNK $<end>$</span> is

𝑃(𝑀𝐿|<𝑏𝑒𝑔>)×𝑃(𝑐𝑜𝑢𝑟𝑠𝑒|𝑀𝐿)×𝑃(𝑈𝑁𝐾|𝑐𝑜𝑢𝑟𝑠𝑒)×𝑃(<𝑒𝑛𝑑>|𝑈𝑁𝐾) = 0.7×0.5×0.1×0.2 = 0.007 

##### Example2
Some sentences below CANNOT be generated.
- <span style="color:blue">$<beg>$ course ML is UNK $<end>$</span>
- <span style="color:blue">$<beg>$  $<end>$</span>
- <span style="color:blue">course is ML $<end>$</span>

##### Example3
Suppose our training examples are the following three sentences.
- ML courses are cool.
- Humanities courses are cool.
- But some courses are boring.

Using a BIGRAM model, the maximum likelihood estimate for the probability that the next word is 'cool', given that the previous word is 'are', is 

In [22]:
data = ['Humanities courses are cool.', 'But some courses are boring.', 'ML courses are cool.']
data = [['<beg>']+i.replace('.',' <end>').split(' ') for i in data]
data = [bi_w for this_s in data for bi_w in zip(this_s[:-1],this_s[1:])]
counter=collections.Counter(data)
bow = dict(counter)
bow

{('<beg>', 'Humanities'): 1,
 ('Humanities', 'courses'): 1,
 ('courses', 'are'): 3,
 ('are', 'cool'): 2,
 ('cool', '<end>'): 2,
 ('<beg>', 'But'): 1,
 ('But', 'some'): 1,
 ('some', 'courses'): 1,
 ('are', 'boring'): 1,
 ('boring', '<end>'): 1,
 ('<beg>', 'ML'): 1,
 ('ML', 'courses'): 1}

In [20]:
bow[('are', 'cool')]/sum([v for i,v in bow.items() if i[0] == 'are'])

0.6666666666666666

#### Feature based Markov Model
We can also represent the Markov model as a feed-forward neural network (very extendable)
Take softmax activation of our outputs. Say given a word i, let the probability that word j occurs next to i be  $p_j$ . $p_j$ satisfies:

$$\sum_{k =1}^{K}p_k=1 \quad \text{where } p_k \geqslant 0 \text{ for } \forall k \in K$$

In this case, the words are one-hot encoded, so each input word would activate one unique node on the input layer.

Advantages of the FFNNs vs Markov models are:
- contain a fewer number of parameters

A Markov model would have 100 choices for the previous two words, and 10 choices for the next word, leading to a size of 1000. A feedforward neural network would have an input layer of size 20 and an output layer of size 10, leading to a weight matrix of size 200. We add 10 parameters for the bias vector.

- can easily control the complexity of a FFNN by introducing hidden layers

However,any information encoded in a neural network could also be encoded in a very large transition probability matrix, i.e. a Markov Model. Therefore, the essential information is the same.

#### Temporal / sequence problems
Language modeling: what comes next?

<img src="img/12.23.png" width="300"/>

A trigram language model

<img src="img/9.4.png" width="300"/>
<img src="img/9.4.png" width="300"/>
<img src="img/9.4.png" width="300"/>

### Some real cases
<img src="img/12.29.png" width="500"/><br>
<img src="img/12.30.png" width="800"/><br>

### Summary
Markov models for sequences
- how to formulate, estimate, sample sequences from

RNNs for generating (decoding) sequences - relation to Markov models
- evolving hidden state
- sampling from

Decoding vectors into sequences

<br><br>
# <span style="color:green">Converlutional Neural Networks (CNNs)</span>

### Convolution

#### Continuous Case
$$(f*g)(t)\equiv\int_{-\infty}^{+\infty}f(\tau)g(t-\tau)d\tau$$
𝜏  is the dummy variable for integration and  𝑡  is the parameter. Intuitively, convolution 'blends' the two function  𝑓  and  𝑔  by expressing the amount of overlap of one function as it is shifted over another function.

<img src="img/images_L12_cov_f.png" width="300"/><img src="img/images_L12_cov_g.png" width="300"/><img src="img/images_L12_cov_f*g.png" width="300"/>

The area under the convolution: $\int_{-\infty}^{+\infty}(f*g)dt$ is the product of the areas under  𝑓  and  𝑔<br>
<img src="img/12.1.png" width="400"/>

#### Discrete Case
$$(f*g)[n]\equiv\sum_{m=-\infty}^{m=+\infty}f[m]g[n-m]$$

1. 1D discrete signal example:

Let  𝑓[𝑛]=[1,2,3] ,  𝑔[𝑛]=[2,1]  and suppose  𝑛  starts from  0 . We are computing  ℎ[𝑛]=𝑓[𝑛]∗𝑔[𝑛] .
As  𝑓  and  𝑔  are finite signals, we just put  0  to where  𝑓  and  𝑔  are not defined. This is usually called zero padding. Now, let's compute  ℎ[𝑛] step by step:<br>
<img src="img/12.2.png" width="800"/><br>
The other parts of  ℎ  are all  0 .

<img src="img/images_L12_cov_example.png" width="600"/>

In practice, it is common to call the flipped  𝑔′  as filter or kernel, for the input signal or image  𝑓 .

As we forced to pad zeros to where the input are not defined, the result on the edge of the input may not be accurate. To avoid this, we can just keep the convolution result where the input  𝑓  is actually defined. That is  ℎ[𝑛]=[5,8] .

So with zero padding, ℎ[𝑛]=[2,5,8,3];  without zero padding, ℎ[𝑛]=[5,8]

2. 2D discrete signal example:

$$f=\begin{bmatrix}1 & 2 & 1\\ 2 & 1 & 1 \\1&1&1\end{bmatrix} \qquad g'=\begin{bmatrix}1 & 0.5\\ 0.5 & 1\end{bmatrix} \qquad$$
Without zero padding, we have $h=\begin{bmatrix}4 & 4\\ 4 & 3\end{bmatrix}$

### Pooling
Pooling region and “stride” may vary
- pooling induces translation invariance at the cost of spatial resolution
- stride reduces the size of the resulting feature map

<img src="img/12.3.png" width="500"/>

### Flattening
convert the data into a 1-dimensional array for inputting it to the next layer. We flatten the output of the convolutional layers to create a single long feature vector. And it is connected to the final classification model, which is called a fully-connected layer.

<img src="img/12.20.png" width="300"/>
<img src="img/12.21.png" width="300"/>

### CNN Example
Say we have just one conv layer consisting of just one filter  𝐹  of shape  2×2 followed by a max-pooling layer of shape  2×2 . The input image is of shape  3×3.

Assuming that the stride for the convolution and pool layers is  1. And our image I and filter weights F are below.
$$I=\begin{bmatrix}1 & 0 & 2\\ 3 & 1 & 0 \\ 0&0&4\end{bmatrix} \qquad F=\begin{bmatrix}1 & 0\\ 0 & 1\end{bmatrix} \qquad$$

The output of the CNN is calculated as $Pool(ReLU(Conv(I)))$. 

In [16]:
I = np.array([[1,0,2],[3,1,0],[0,0,4]])
F = np.array([[1,0],[0,1]])
# without zero padding
Conv_I = convolve(I,F)[:-1,:-1]
ReLU_I = np.maximum(0, Conv_I)
Pool_I = np.max(ReLU_I)
print('Thus, the final output is '+str(Pool_I))

Thus, the final output is 5


torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros')

Documentation: https://pytorch.org/docs/stable/nn.html#conv2d

### Some real cases
<img src="img/12.4.png" width="500"/><br>(LeCun 13’)

<img src="img/12.5.png" width="500"/><br>(Krizhevsky et al., 12’)

<br><br>
# <span style="color:green">Optimization Algorithms</span>

### <span style="color:orange;">Gradient Descent</span>
$$\theta=\theta-\eta\cdot \nabla J(\theta)$$

<img src="img/12.6.png" width="350">
- Cons: if the Weight(W) values are too small or too large then we have large Errors , so want to update and optimize the weights such that it is neither too small nor too large , so we descent downwards opposite to the Gradients until we find a local minima.

### Variants of Gradient Descent
##### <span style="color:orange;">1. Stochastic gradient descent</span>
$$\theta=\theta-\eta\cdot \nabla J(\theta;x^{(i)};y^{(i)})$$
- Pros:usually much faster technique. Also, due to these frequent updates ,parameters updates have high variance and causes the Loss function to fluctuate to different intensities. This helps us discover new and possibly better local minima , whereas Standard Gradient Descent will only converge to the minimum of the basin as mentioned above.
- Cons:due to the frequent updates and fluctuations it ultimately complicates the convergence to the exact minimum and will keep overshooting due to the frequent fluctuations. Although, it has been shown that as we slowly decrease the learning rate-η, SGD shows the same convergence pattern as Standard gradient descent.

<img src="img/12.7.png" width="350">

##### <span style="color:orange;">2. Mini Batch Gradient Descent</span>
- Pros: very efficient; rectify the problems of high variance parameter updates and unstable convergence discussed above and thus lead to a much better and stable convergence; Commonly Mini-batch sizes Range from 50 to 256, but can vary as per the application and problem being solved.
- Cons: still difficult to choose a proper learning rate; cannot apply a different learning rate to different para updates; can get trapped in numerous sub-optimal local minima

##### <span style="color:orange;">3. Mini Batch Stochastic Gradient Descent</span>

### Optimizing the Gradient Descent
##### <span style="color:orange;">1. Momentum</span>
$$V_t=\gamma V_{t-1}+\eta\cdot \nabla J(\theta) \quad\rightarrow \quad \theta=\theta-V_t $$
 
The momentum term γ is usually set to 0.9 or a similar value. It's a fraction of the update vector of the past step to the current update vector.

Here the momentum is same as the momentum in classical physics , as we throw a ball down a hill it gathers momentum and its velocity keeps on increasing.It does parameter updates only for relevant examples.

- Pros: accelerates SGD by navigating along the relevant direction and softens the high variance oscillations in irrelevant directions; leads to faster and stable convergence.
- Cons: A ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We’d like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.

##### <span style="color:orange;">2. Nesterov accelerated gradient</span>
$$V_t=\gamma V_{t-1}+\eta\cdot \nabla J(\theta-\gamma V_{t-1}) \quad\rightarrow \quad \theta=\theta-V_t $$

1. make a big jump based on out previous momentum 
2. calculate the Gradient 
3. make an correction which results in an parameter update. 

- Pros: This anticipatory update prevents us to go too fast and not to miss the minima and makes it more responsive to changes.
- Cons: we would also like to adapt our updates to each individual parameter to perform larger or smaller updates depending on their importance.

##### <span style="color:orange;">3. Ada-grad</span>
$$\theta_{t+1,i}=\theta_{t,i}-\frac{\eta}{\sqrt{G_{t,ii}+\varepsilon}} \cdot g_{t,i}$$

Most implementations use a default value of 0.01 and leave it at that.

- Pros: need to manually tune the learning Rate η. It makes big updates for infrequent parameters and small updates for frequent parameters. Thus, it is well-suited for dealing with sparse data.
- Cons: its learning rate η is always decaying and becoming so small due to the accumulation of each squared Gradients in the denominator , since every added term is positive. As a result, the model has very slow convergence and eventually just stops learning entirely and stops acquiring new additional knowledge. 

##### <span style="color:orange;">4. AdaDelta</span>
$$E[g^2]_t=\gamma E[g^2]_{t-1}+(1-\gamma)\cdot g^2(t)$$

Instead of inefficiently storing w previous squared gradients, the sum of gradients is recursively defined as a decaying mean of all past squared gradients. 

The running average $E[g^2]_t$ at time step t then depends (as a fraction γ similarly to the Momentum term) only on the previous average and the current gradient.

set γ to a similar value as the momentum term, around 0.9.

$$\Delta\theta_t=-\eta\cdot g_{t,i}\quad \rightarrow \quad \theta_{t+1}=\theta_t+\Delta\theta_t$$
$$\Delta \theta_{t}=-\frac{\eta}{\sqrt{E\left | g^2 \right |_t+\varepsilon}} \cdot g_{t}$$

When the denominator is just the root mean squared (RMS) error criterion of the gradient, we can replace it with the criterion short-hand:

$$\Delta \theta_{t}=-\frac{\eta}{RMS\left | g \right |_t} \cdot g_{t}$$

- Pros: rectify the problem of Decaying Learning Rate. Instead of accumulating all previous squared gradients, Adadelta limits the window of accumulated past gradients to some fixed size w.Don’t even need to set a default learning Rate .

What we achieved so far:
1. calculate different learning Rates for each parameter.
2. calculate momentum.
3. prevent Vanishing(decaying) learning Rates.

##### <span style="color:orange;">3. Adaptive Moment Estimation (Adam)</span>
$$\hat{m_t}=\frac{m_t}{1-\beta_1^t} \qquad \hat{v_t}=\frac{v_t}{1-\beta_2^t} $$

keeps an exponentially decaying average of past gradients $M_t$, similar to momentum. Here, $M_t$ and $V_t$ are values of the first moment which is the Mean and the second moment which is the uncentered variance of the gradients respectively.

$$\theta_{t+1}=\theta_{t}-\frac{\eta}{\sqrt{\hat{v_t}+\varepsilon}} \cdot \hat{m_t}$$

The values for β1 is 0.9 , 0.999 for β2, and (10 x exp(-8)) for ϵ.

- Pros: not only computes adaptive learning rates for each parameter, but also calculate individual momentum changes for each parameter and store them separately.
Adam works well in practice and compares favorably to other adaptive learning-method algorithms as it converges very fast and the learning speed of the Model is quiet Fast and efficient and also it rectifies every problem that is faced in other optimization techniques such as vanishing Learning rate , slow convergence or High variance in the parameter updates which leads to fluctuating Loss function.

### Comparisons
Adam or any other Adaptive learning rate techniques outperform every other optimization algorithms.

<img src="img/12.8.png" width="500">
<img src="img/12.9.gif" width="500">
<img src="img/12.10.gif" width="500">
<img src="img/12.11.gif" width="500">

<br><br>
# <span style="color:green">Regularization</span>

Cost function = Loss (say, binary cross entropy) + Regularization term

### <span style="color:orange;">L1 & L2 Regularization</span>
1. L2: $$Cost function = Loss + \frac{\lambda}{2m}\cdot\sum\left \| w \right \|^2$$

L2 regularization is also known as weight decay as it forces the weights to decay towards zero (but not exactly zero).

2. L1: $$Cost function = Loss + \frac{\lambda}{2m}\cdot\sum\left \| w \right \|$$

Unlike L2, the weights may be reduced to zero here. Hence, it is very useful when we are trying to compress our model. Otherwise, we usually prefer L2 over it.

### <span style="color:orange;">Dropout</span>
<img src="img/12.12.png" width="250">
<img src="img/12.13.png" width="250">

So each iteration has a different set of nodes and this results in a different set of outputs. It can also be thought of as an ensemble technique in machine learning.

<img src="img/12.14.gif" width="500">

It produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.

### <span style="color:orange;">Data Augmentation</span>
The simplest way to reduce overfitting is to increase the size of the training data. In machine learning, we were not able to increase the size of training data as the labeled data was too costly.

There are a few ways of increasing the size of the training data – rotating the image, flipping, scaling, shifting, etc. 

<img src="img/12.15.png" width="500">
<img src="img/12.16.png" width="500">
<img src="img/12.17.png" width="500">
<img src="img/12.18.png" width="500">

### <span style="color:orange;">Early Stopping</span>
A kind of cross-validation strategy where we keep one part of the training set as the validation set. When we see that the performance on the validation set is getting worse, we immediately stop the training on the model. This is known as early stopping.

There are a few ways of increasing the size of the training data – rotating the image, flipping, scaling, shifting, etc. 

<img src="img/12.19.png" width="400">

### Use GPU to run models
Google Colab (free):
https://colab.research.google.com/notebooks/intro.ipynb#recent=true

Paperspace (<$1/hr):
https://www.paperspace.com

In [None]:
Resnet, AlexNet