# Deep learning

A deep neural network is used to extract feature for a linear basis regression. This avoids mannualy design/select features for a given problem.

Typical questions to ask:
- how many layers? how many neurons for each layer?
  - trial and errors + intuition  
- can the structure be automatically determined?
  - yes. some research are ongoing, but not widely applied. For example, evolutionary artificial neural networks
- can we design the network structure?
  - yes. Most of popular deep neural network models have their own structure design, such as residual network, recurrent network, convolution networks, etc. All these models are not fully connected networks.

## 1. Backpropagation


### 1.1 Chain Rule

**Case 1**: 

given $y = g(x)$, $z=h(y)$, we can do $\Delta x \rightarrow \Delta y \rightarrow \Delta z$, which results in: $\frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx}$


**Case 2**

given $x=g(s), y=h(s), z=k(x,y)$, we have $\frac{dz}{ds} = \frac{\alpha z}{\alpha x} \frac{dx}{ds} + \frac{\alpha z}{\alpha y}\frac{dy}{ds}$.

### 1.2 Forward and Backward Pass

For a neural network, forward pass is to calculate the output of the network given the input. Backward pass is to calculate the gradient of the loss function with respect to the parameters of the network.

We define a neural netowrk as follows.

$x$: input
$x_i$: input for the $i$-th layer
$z_i$: input for the activation functions $\delta_i$
$w_i$: weight for the $i$-th layer
$\delta_i$: activation function for the $i$-th layer
$\mathcal{L}$: loss function
Therefore,

we have:

$$ z_i = w_i x_{i-1}$$
$$ x_{i+1} = \delta_i(z_i)$$

Procedure: start from the output layer, backpropagate the gradient to the input layer.

$$ \frac{\partial \mathcal{L}}{\partial w_i} = \frac{\partial \mathcal{L}}{\partial x_{i+1}} \frac{\partial x_{i+1}}{\partial z_i} \frac{\partial z_i}{\partial w_i}$$



## 2. Tips for Deep Learning
The overall decision tree for tuning a deep learning model is shown in the following figure.
Reference: https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/overfit-v6.pdf

![training-tips](./resources/imgs/training-tips.png)

The training does not perform well, then:
- large training loss
    - underfit: model bias is large -> model is too simple
    - optimization failures
- small training loss, large validation loss
    - overfit: model variance is large -> model is too complex
    - mismatch
- small training loss, small validation loss
    - good model

### 2.1 How to Debug

**Model Bias**
Underfitting -> model bias is large -> model is too simple
- increase model complexity to check on the training loss
- if the model complexity is enough, then we should see the training loss is small. The validation loss after complexity increase might be large. We can stop adding complexity till this point.

**Optimization Failures**
- gain insights from comparisons: compare with a more complex model
- start from a shallower networks or other models which are easier to optimize
- if deeper networks do not obtain smaller loss on training data, then there are optimzation failures.

Solutions:
- critical points
  - smaller batch
  - momentum
  - adaptive learning rate
- vanishing gradients
  - change of activation function
  - skip connections
- exploding gradients
  - gradient clipping
  - smaller learning rate

**Model Variance**
Overfitting -> model variance is large -> model is too complex
- small training loss, large validation loss

Solutions:
- reduce model complexity (trade-off between bias and variance, may use cross-validation)
  - **why cross-validation can reduce variance? isn't it just a way to evaluate the model?**
    - perform cross-validation on a set of models, and choose the one with the smallest validation loss
  - less pameters: smaller model, sharing parameters
  - less features
  - early stopping
  - regularization
  - dropout
- more training data
  - data augmentation

**Mismatch**
The training data and testing data have different distributions.
How to detect the distribution mismatch?




### 2.2 When gradient is small

The training process is not always successful. It may fail to converge, or converge to a bad local minimum.
When the loss is not small enough, most likely the optimization fails becuase of critical points such as local minima or saddle points.
- we can check gradients to see if they are close to zero. If they are close to zero, then it is likely to be a critical point.
- local minima is difficult to escape.
- saddle point can be easily escaped. However, the gradient is small, and the training is slow.


How to know if it is due to saddle points or local minima since the explicit shape of the loss function is difficult/impossible to draw?
We can use Taylor series expansion to have a look at the loss function space around the critical point.

Say after a few of epochs of training, the loss function is $\mathcal{L}(w)$. We can use Taylor series expansion to approximate the loss function around $w$.
Define a very small pertubation $\Delta w$, then we have

$$ \mathcal{L}(w + \Delta w) \approx \mathcal{L}(w) + \Delta w^T \nabla \mathcal{L}(w) + \frac{1}{2} \Delta w^T H \Delta w$$

where $H$ is the Hessian matrix of the loss function at $w$.
The gradient is zero at the critical point, thus the first order term is zero, which leads to:

$$ \mathcal{L}(w + \Delta w) \approx \mathcal{L}(w) + \frac{1}{2} \Delta w^T H \Delta w$$

- If $H$ is positive definite (e.g., $\Delta w^T H \Delta w > 0$), then for all $\Delta w$ around $w$, the $\mathcal{L}(w+\Delta w) > \mathcal{L}(w)$, which means the critical point at $w$ is a local minimum.
- If $H$ is negative definite (e.g., $\Delta w^T H \Delta w < 0$), then for all $\Delta w$ around $w$, the $\mathcal{L}(w+\Delta w) < \mathcal{L}(w)$, which means the critical point at $w$ is a local maximum.
- If $H$ is indefinite (e.g., $\Delta w^T H \Delta w > 0$ and $\Delta w^T H \Delta w < 0$), then for some $\Delta w$ around $w$, the $\mathcal{L}(w+\Delta w) > \mathcal{L}(w)$, and for some $\Delta w$ around $w$, the $\mathcal{L}(w+\Delta w) < \mathcal{L}(w)$, which means the critical point at $w$ is a saddle point.

A positive definite matrix has all positive eigenvalues, and a negative definite matrix has all negative eigenvalues.


If the critical point is a saddle point, then it is easy to get out by move along the direction of the negative eigenvalue.

Proof:

- let $H = Q \Lambda Q^T$, where $\Lambda$ is a diagonal matrix with eigenvalues on the diagonal, and $Q$ is the eigenvector matrix.
- if we use the eigenvector as the perturbation of gradient, then we have $\Delta w^T H \Delta w = Q^THQ = Q^T\lambda Q = \lambda Q^TQ$
- With the Taylor expansion, then we have $\mathcal{L}(w + \Delta w) \approx \mathcal{L}(w) + \frac{1}{2} \lambda Q^TQ = \mathcal{L}(w) + \frac{1}{2} \lambda$
  - if we want to escape the saddle point and move towards a minimum value, then we use the negative eigenvalue, which leads to $\mathcal{L}(w + \Delta w) < \mathcal{L}(w)$
  - if we want to escape the saddle point and move towards a maximum value, then we use the positive eigenvalue, which leads to $\mathcal{L}(w + \Delta w) > \mathcal{L}(w)$

But this method is rarely used in practice due to the high computational cost of computing the Hessian matrix and its eigenvalues for deep models.

How easy will a model be stuck in local minima or saddle points?
- local minima is rare in high dimensional space. most time the hessian matrix is indefinite. we can check the ratio of positive eigenvalues of the hessian matrix after training. Mostly likely we will find the ratio is close to 0.5.

**(1). Batch size**

Batch: use a batch of data to compute the gradient, and then update the parameters.

Why batch?
- large batch vs small batch 
  - small batch requires more time for one epoch, gradient is more noisy -> better optimization and generation
  - large batch requires less time for one epoch, gradient is less noisy -> worse training accuracy due to optimziation failure
  - in terms of training time with parallel computation, a good batch size is ~100-1000
  - in terms of training accuracy, noisy gradient is better than less noisy gradient

Thus, smaller batch size is may help escape critical points due to noisy gradient.

**(2). Adaptive learning rate**

See in next section.


### 2.3 Gradient Vanishing

**Gradient vanishing**: 
the gradient becomes smaller and smaller as the number of layers increases. 
This is because the gradient is the product of the gradients of each layer. If the gradient of each layer is smaller than 1, the product will be smaller and smaller.
Thus, the output layer may have large gradients, learns fast and already converge, but the first few layers have small gradients, learns slow and almost random.
This leads to that the deeper the network, the harder the training.

This is usually caused by `Sigmoid` function as the activation. 
`Sigmoid` function compress the infinite range of real numbers to a finite range, which is $[0,1]$. 
From the forward passing point of view, at the first layer, even if we add a large gradient pertubation $\Delta w$, (then $\Delta z$ is large), after the `Sigmoid` function, the changes of `Sigmoid` function outputs $\Delta x$ is small.
After a few laryers of `Sigmoid` function, the large gradient pertubation on the loss function will vanishe.
This means the loss is not sensitive to the weights of the first few layers.

Since the first few layer has small gradients, and the output layers have large gradients, we can use adaptive learning rate to solve this problem. For example, we can use Adam optimizer.

Since `Sigmoid` function has such limiations, what are good alternatives?

- changing of activation functions 
  
***Relu***: fast and easy to compute, and can be considered as an infinite number of Sigmoid functions. 

$$ f(x) = \begin{cases} x & x > 0 \\ 0 & x \leq 0 \end{cases}$$

Relu will not use neurons that have negative gradients, thus it will lead to a thinner linear network.
However, it has a problem called "dead neuron". If the input is negative, the gradient is zero. This means the neuron is dead. This is not a problem for the first few layers, but it is a problem for the last few layers. This is because the last few layers are close to the output layer, and the output layer has large gradients. If the last few layers are dead, the output layer will not learn anything.

***Leaky Relu/Parametric Relu***: 

$$ f(x) = \begin{cases} x & x > 0 \\ \alpha x & x \leq 0 \end{cases}$$

where $\alpha$ is a small positive number, such as 0.01.


***Maxout***: learnable activation function. can represent any activation functions, such as Relu.

$$ f(x) = \max(w_1^T x + b_1, w_2^T x + b_2)$$

where $w_1, w_2, b_1, b_2$ are learnable parameters. If $w_2, b_2$ are zero, it is Relu.

- activation function in maxout network can be any piecewise linear convex function
- how many pieces depends on how many elements in a group. For example, if there are 3 elements in a group, then the activation function is a piecewise linear convex function with 3 pieces.

How to differentiate throught a $max()$ operator?

## 2.4 Error surface is rugged

Training can be difficult even without critical points. The error surface can be rugged, which means the loss function is not smooth.
In this case, the learning rate cannot be one-size-fits-all. We need to use adaptive learning rate.

Training stuck is not equal to small gradient. 
There are cases that the gradient is big but the training is stuck.
  - the gradient may be big, thus the update direction is bouncing around the critical point, which leads to slow training.
  - different parameters may need different learning rates. If the gradient at some direction is too small, we wish the learning rate is big. But if the gradient at some direction is too big, we wish the learning rate is small. Thus, we need to adjust the learning rate for different parameters.
    - can be achieved by using RMSProp or Adam.
  
**(1). Adaptive learning rate**:

- different updates for different parameters based on the magnitude of the gradient, such as RMSProp and Adam.
- learning rate decay: decrease the learning rate as the training goes on.
- warm up: start with a small learning rate, and increase the learning rate, and then decrease the learning rate as the training goes on.
  - not fully explored yet.
  - one explanation is during warmup period, the $\delta$ is not accurate enough.

***RMSProp***: use previous decayed gradients to adjust the learning rate. Only uses the magnititude of the previous gradient, no direction information i used.

$$ w^1 \leftarrow w^0 - \frac{\eta}{\sqrt{v^0 + \epsilon}} g^0$$
$$ w^2 \leftarrow w^1 - \frac{\eta}{\sqrt{v^1 + \epsilon}} g^1, \hspace{5pt} v^1 = \alpha(v^0)^2 + (1-\alpha)(g^1)^2$$

where $v^0$ is the moving average of the square of the gradient, and $\epsilon$ is a small number to avoid division by zero.


***Adam***: momentum + RMSProp, considers the previous movement, which is generally considers all the past gradients, including the direction information.

- start at point $w^0$
- movement of last step: $v^0 = 0$
- compute gradient at $w^0$: $g^0 = \Delta \mathcal{L} (w^0)$
- movement $v^1 = \lambda v^0 -\eta g^0$
- move to $w^1 = w^0 + v^1$
- compute gradient at $w^1$: $g^1 = \Delta \mathcal{L} (w^1)$
- movement $v^2 = \lambda v^1 -\eta g^1$
- move to $w^2 = w^1 + v^2$

where $\lambda$ is the momentum, and $\eta$ is the learning rate.


**(2). Change loss function**

Cross entropy loss: $-\sum_i y_i \log \hat{y}_i$

Softmax loss: $-\log \frac{e^{y_i}}{\sum_j e^{y_j}}$ -> Pytorch implementation is very interesting. when call `cross_entropy`, it will first apply softmax to the input, and then compute the cross entropy loss. Therefore no need add softmax layer for the output layer.

The loss function can affect the difficulty of training.

**(3). Batch normalization**

The error space is rugged, and the gradient is not smooth. Batch normalization can smooth the error space and make the gradient smoother.

Batch normalization is a layer that normalizes the input of the layer to have zero mean and unit variance.
This leads to a very large network in the sense that the updates of mean and variance are also connected to all the inputs of the layers.

The batch normalization layer is usually added after the linear layer and before the activation layer, and used when the batch size is large.


For inference:
    - we dont always have a batch of data at testing stage
    - compute the moving average of mean and variance of the batches during training

Reference:
- how does batch normalization help optimization? https://arxiv.org/pdf/1805.11604.pdf


Nowadays, a lot of normalization methods are proposed, such as layer normalization, instance normalization, group normalization, etc.

### 2.5 Overfitting
overfitting: get good results on training data, but bad results on test data.

- parameter sharing (Fully connected -> Convolutional)
- early stopping
- regularization
- dropout


**(1). Regularization**

- new loss function to be minized. For example, $L_1$ regularization, $L_2$ regularization. 
  - $L_2$ also known as weight decay. The update is like decay the weight by a factor -> close to 0
  - $L_1$ also known as Lasso regression. The updates is like if the weight is positive, then decrease it; if the weight is negative, then increase it. -> close to 0


**(2). Dropout**

- training:
  - randomly drop some neurons
- testing:
  - use all neurons, but scale the output by the dropout probability
  - if the dropout rate at training is p%, then the weights at testing is scaled by (1-p%)
  - dropout rate is for the whole set of neurons, not for each layer?

Dropout is a kind of ensemble method. It is like training multiple sets of networks

## 2.6 Mismatch between train and test distribution
