# Deep learning

A deep neural network is used to extract feature for a linear basis regression. This avoids mannualy design/select features for a given problem.

Typical questions to ask:
- how many layers? how many neurons for each layer?
  - trial and errors + intuition  
- can the structure be automatically determined?
  - yes. some research are ongoing, but not widely applied. For example, evolutionary artificial neural networks
- can we design the network structure?
  - yes. Most of popular deep neural network models have their own structure design, such as residual network, recurrent network, convolution networks, etc. All these models are not fully connected networks.

## Backpropagation


### Chain Rule

**Case 1**: 

given $y = g(x)$, $z=h(y)$, we can do $\Delta x \rightarrow \Delta y \rightarrow \Delta z$, which results in: $\frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx}$


**Case 2**

given $x=g(s), y=h(s), z=k(x,y)$, we have $\frac{dz}{ds} = \frac{\alpha z}{\alpha x} \frac{dx}{ds} + \frac{\alpha z}{\alpha y}\frac{dy}{ds}$.

### Forward and Backward Pass

For a neural network, forward pass is to calculate the output of the network given the input. Backward pass is to calculate the gradient of the loss function with respect to the parameters of the network.

We define a neural netowrk as follows.

$x$: input
$x_i$: input for the $i$-th layer
$z_i$: input for the activation functions $\delta_i$
$w_i$: weight for the $i$-th layer
$\delta_i$: activation function for the $i$-th layer
$\mathcal{L}$: loss function
Therefore,

we have:

$$ z_i = w_i x_{i-1}$$
$$ x_{i+1} = \delta_i(z_i)$$

Procedure: start from the output layer, backpropagate the gradient to the input layer.

$$ \frac{\partial \mathcal{L}}{\partial w_i} = \frac{\partial \mathcal{L}}{\partial x_{i+1}} \frac{\partial x_{i+1}}{\partial z_i} \frac{\partial z_i}{\partial w_i}$$



## Tips for Deep Learning

### Overfitting
overfitting: get good results on training data, but bad results on test data.

- early stopping
- regularization
- dropout


Regularization:
- new loss function to be minized. For example, $L_1$ regularization, $L_2$ regularization. 
  - $L_2$ also known as weight decay. The update is like decay the weight by a factor -> close to 0
  - $L_1$ also known as Lasso regression. The updates is like if the weight is positive, then decrease it; if the weight is negative, then increase it. -> close to 0


Dropout


### Gradient Vanishing
bad results on training data.

Don't always blame overfitting. It could be the model is not good enough, or the training is not good. 

Gradient vanishing: the gradient becomes smaller and smaller as the number of layers increases. This is because the gradient is the product of the gradients of each layer. If the gradient of each layer is smaller than 1, the product will be smaller and smaller.
Thus, the output layer may have large gradients, learns fast and already converge, but the first few layers have small gradients, learns slow and almost random.
This leads to that the deeper the network, the harder the training.

This is usually caused by sigmoid function as the activation. 
Sigmoid function compress the infinite range of real numbers to a finite range, which is $[0,1]$. 
From the forward passing point of view, at the first layer, even if we add a large gradient pertubation $\Delta w$, (then $\Delta z$ is large), after the Sigmoid function, the changes of Sigmoid function outputs $\Delta x$ is small.
After a few laryers of Sigmoid function, the large gradient pertubation on the loss function will vanishe.
This means the loss is not sensitive to the weights of the first few layers.

Since the first few layer has small gradients, and the output layers have large gradients, we can use adaptive learning rate to solve this problem. For example, we can use Adam optimizer.

Since Sigmoid function has such limiations, what are good alternatives?

- changing of activation functions 
  
Relu: fast and easy to compute, and can be considered as an infinite number of sigmoid functions. 

$$ f(x) = \begin{cases} x & x > 0 \\ 0 & x \leq 0 \end{cases}$$

Relu will not use neurons that have negative gradients, thus it will lead to a thinner linear network.
However, it has a problem called "dead neuron". If the input is negative, the gradient is zero. This means the neuron is dead. This is not a problem for the first few layers, but it is a problem for the last few layers. This is because the last few layers are close to the output layer, and the output layer has large gradients. If the last few layers are dead, the output layer will not learn anything.

Leaky Relu/Parametric Relu:: 

$$ f(x) = \begin{cases} x & x > 0 \\ \alpha x & x \leq 0 \end{cases}$$

where $\alpha$ is a small positive number, such as 0.01.


Maxout: learnable activation function. can represent any activation functions, such as Relu.

$$ f(x) = \max(w_1^T x + b_1, w_2^T x + b_2)$$

where $w_1, w_2, b_1, b_2$ are learnable parameters. If $w_2, b_2$ are zero, it is Relu.

- activation function in maxout network can be any piecewise linear convex function
- how many pieces depends on how many elements in a group. For example, if there are 3 elements in a group, then the activation function is a piecewise linear convex function with 3 pieces.

How to differentiate throught a $max()$ operator?



Adaptive learning rate
  
RMSProp: use previous decayed gradients to adjust the learning rate.

$$ w^1 \leftarrow w^0 - \frac{\eta}{\sqrt{v^0 + \epsilon}} g^0$$
$$ w^2 \leftarrow w^1 - \frac{\eta}{\sqrt{v^1 + \epsilon}} g^1, \hspace{5pt} v^1 = \alpha(v^0)^2 + (1-\alpha)(g^1)^2$$

where $v^0$ is the moving average of the square of the gradient, and $\epsilon$ is a small number to avoid division by zero.


Adam: momentum + RMSProp, considers the previous movement, which is generally considers all the past gradients.

- start at point $w^0$
- movement of last step: $v^0 = 0$
- compute gradient at $w^0$: $g^0 = \Delta \mathcal{L} (w^0)$
- movement $v^1 = \lambda v^0 -\eta g^0$
- move to $w^1 = w^0 + v^1$
- compute gradient at $w^1$: $g^1 = \Delta \mathcal{L} (w^1)$
- movement $v^2 = \lambda v^1 -\eta g^1$
- move to $w^2 = w^1 + v^2$

where $\lambda$ is the momentum, and $\eta$ is the learning rate.



### Gradient Exploding
