# general guide
![优化算法的攻略](../img/general_guide_of_DL.png)

# 深度学习组成部分
1. Model
   1. LinearRegression
   2. NeuralNetwork
   3. LogisticRegression
      - sigmoid
   4. Classification
      - softmax
   5. ConvolutionalNeuralNetwork
   6. ......
2. Loss
   - Loss = $\frac{1}{N}\sum_n e_n$
   - ways of calculate loss
     - MeanSquareError
        - $e=\sum_i (\hat y_i - y_i^,)^2$
     - Cross-entropy
        - $e=-\sum_i \hat y_i\ln y_i^,$ 
        - minimise cross-entropy = maximizing likelihood(极大似然估计)
        - usually used with softmax function together
3. Optimizer
   1. GradientDescent

## 梯度下降可能遇到的问题
1. 鞍点
    - 导数为0的点(驻点 critical point)不一定为极值点
    - 通过计算hession矩阵的特征值$\lambda$来确定是否为极小值点
      - $\lambda$有正有负 鞍点
      - $\lambda$均为负 极大值点
      - $\lambda$均为正 极小值点 成立
    - 逃离方法
      - 根据负特征值的特征向量来调整(实际很少使用 很难算)
      - Small Batch
      - Momentum
2. 梯度学习率
    - Loss变得很低后 梯度值在一个范围震荡
    - 采用Adaptive Learning Rate来调整LR值使得损失函数能够达到最小值

## Batch
- hyperParameter
- 完整数据集可以分为若干个Batch，1次epoch中对应了若干个update， 每个update对应一个batch
<table>
    <tr>
        <th></th>
        <th>Small Batch</th>
        <th>Large Batch</th>
    </tr>
    <tr>
        <td>speed for one update(no parallel)</td>
        <td>Faster</td>
        <td>Slow</td>
    </tr>
    <tr>
        <td>speed for one update(with parallel)</td>
        <td>Same</td>
        <td>Same(not too large)</td>
    </tr>
    <tr>
        <td>time for one epoch</td>
        <td>Slower</td>
        <td>Faster</td>
    </tr>
    <tr>
        <td>Gradient</td>
        <td>Noisy</td>
        <td>Stable</td>
    </tr>
    <tr>
        <td>Optimization</td>
        <td>Better</td>
        <td>Worse</td>
    </tr>
    <tr>
        <td>Generalization</td>
        <td>Better</td>
        <td>Worse</td>
    </tr>
</table>

## Momentum
- hyperParameter
- 梯度下降的惯性
- movement with momentum: movement of last step minus gradius at present
- 每一次梯度下降时都需要累加上一步梯度下降的值

## Adaptive Learning Rate

### Root Mean Square方法(used in Adagrad)

$\theta_i^1=\theta_i^0-\frac{\eta}{\sigma_i^0}*g_i^0$, $\sigma_i^0=\sqrt{(g_i^0)^2}=|g_i^0|$ 

$\theta_i^2=\theta_i^1-\frac{\eta}{\sigma_i^1}*g_i^1$, $\sigma_i^1=\sqrt{\frac{1}{2}[(g_i^0)^2+(g_i^1)^2]}$ 

$\theta_i^3=\theta_i^2-\frac{\eta}{\sigma_i^2}*g_i^2$, $\sigma_i^2=\sqrt{\frac{1}{3}[(g_i^0)^2+(g_i^1)^2+(g_i^1)^3]}$ 

......

$\theta_i^{t+1}=\theta_i^t-\frac{\eta}{\sigma_i^t}*g_i^t$, $\sigma_i^t=\sqrt{\frac{1}{t+1}\sum_{i=0}^t(g_i^t)^2}$ 


### RMSProp方法
- hyperParameter: $\alpha \quad (0<\alpha<1)$ 当前梯度对学习率的影响因子
- 梯度较低时 $\sigma_i^t$较小，则$\frac{\eta}{\sigma_i^t}$(学习率)较大, 反之同理

$\theta_i^1=\theta_i^0-\frac{\eta}{\sigma_i^0}*g_i^0$, $\sigma_i^0=\sqrt{(g_i^0)^2}=|g_i^0|$ 

$\theta_i^2=\theta_i^1-\frac{\eta}{\sigma_i^1}*g_i^1$, $\sigma_i^1=\sqrt{\alpha(\sigma_i^0)^2+(1-\alpha)(g_i^1)^2}$ 

$\theta_i^3=\theta_i^2-\frac{\eta}{\sigma_i^2}*g_i^2$, $\sigma_i^2=\sqrt{\alpha(\sigma_i^1)^2+(1-\alpha)(g_i^2)^2}$ 

......

$\theta_i^{t+1}=\theta_i^t-\frac{\eta}{\sigma_i^t}*g_i^t$, $\sigma_i^t=\sqrt{\alpha(\sigma_i^{t-1})^2+(1-\alpha)(g_i^t)^2}$ 

### Learning Rate Scheduling
- Learning Rate Decay: 学习率随时间衰减
- Warm Up: Increase and then decrease

### Adam Optimizer: RMSDrop + Momentum

## Summary of Optimization
### (Vanilla) Gradient Descent
$\theta_i^t+1 \leftarrow \theta_i^t-\eta g_i^t$

### Various Imporvements
$\theta_i^t+1 \leftarrow \theta_i^t-\frac{\eta^t}{\sigma_i^t}m_i^t$
- $\eta^t$--Learning Rate Decay or Warm Up
- $\sigma_i^t$--Root Mean Square
- $m_i^t$--momentum