# Day 12

> [DL Basic] [Optimization](https://github.com/changwoomon/Boostcamp-AI-Tech/blob/main/Week%203/Day%2012/optm.ipynb)

### Gradient Descent
First-order iterative optimization algorithm for finding a local minimum of a differentiable function.

1차 미분한 값만 사용, 반복적으로 최소화, local minimum으로 갈 수 밖에 없다

### Important Concept in **Optimization**
- Generalization (일반화)
    - How well the learned model will behave on unseen data
    - Input data가 달라져도 출력에 대한 성능 차이가 나지 않게 하는 것
- [Under-fitting vs. Over-fitting](https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html)
- [Cross Validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))
    - Cross-validation is a model validation technique for assessing how the model will generalize to an independent (test) dataset
    - [train / validation / test](https://github.com/boostcamp-ai-tech-4/peer-session/issues/47#issuecomment-771590550)
        - train / valid -> 학습에 사용
            - train data : 모델 학습에 직접 영향
            - valid data : 모델 학습에 간접 영향
        - test -> 학습에 사용되면 안됨
            - test data : 영향 없음
- [Bias-variance tradeoff](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff)
    - Bias : 평균적으로 봤을 때 내가 원하는 target에서 벗어난 정도
    - Variance : 비슷한 입력을 넣었을 때 출력이 얼마나 일관된지
    $$cost = bias^{2}+variance+noise$$
- Bootstrapping
    - Bootstrapping is any test or metric that uses random sampling with replacement
- Bagging vs. Boosting
    - Bagging (**B**ootstrapping **agg**regat**ing**)
        - Multiple models are being trained with bootstrapping
        - Base classifiers are fitted on random subset where individual predictions are aggregated (voting or averaging)
    - Boosting
        - It focuses on those specific training samples that are hard to classify
        - A strong model is built by combining weak learners in sequence where each learner learns from the mistakes of the previous weak learner
<br><br>
<img src="https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1542651255/image_2_pu8tu6.png" width="500" height="300">

### Gradient Descent Method
- Stochastic gradient descent
    - Update with the gradient computed from **a single sample**
- Mini-batch gradient descent
    - Update with the gradient computed from **a subset of data**
- Batch gradient descent
    - Update with the gradient computed from **the whole data**

### Optimizer
*[An overview of gradient descent optimization algorithms](https://arxiv.org/pdf/1609.04747.pdf)*

*[[딥러닝]딥러닝 최적화 알고리즘 알고 쓰자. 딥러닝 옵티마이저(optimizer) 총정리](https://hiddenbeginner.github.io/deeplearning/2019/09/22/optimization_algorithms_in_deep_learning.html)*

*[수식과 코드로 보는 경사하강법(SGD,Momentum,NAG,Adagrad,RMSprop,Adam,AdaDelta)](https://twinw.tistory.com/247)*

*[Gradient Descent Optimization Algorithms 정리](http://shuuki4.github.io/deep%20learning/2016/05/20/Gradient-Descent-Algorithm-Overview.html)*

- Gradient Descent
    $$W_{t+1}\leftarrow W_t-\eta g_t$$
    - Stochastic gradient descent
- Momentum
    $$a_{t+1} \leftarrow \beta a_t+g_t$$
    $$W_{t+1}\leftarrow W_t-\eta a_{t+1}$$
- Nesterov accelerated gradient
    $$a_{t+1} \leftarrow \beta a_t+\nabla\mathcal L(W_t-\eta\beta a_t)$$
    $$W_{t+1}\leftarrow W_t-\eta a_{t+1}$$
- Adagrad
    - Adagrad adapts the learning rate, performing larger updates for infrequent and smaller updates for frequent parameters
    $$W_{t+1}=W_t-\frac{\eta}{\sqrt{G_t+\epsilon}}g_t$$
- Adadelta
    - Adadelta extends Adagrad to reduce its monotonically decreasing the learning rate by restricting the accumulation window
    $$G_t=\gamma G_{t-1}+(1-\gamma)g_t^2$$
    $$W_{t+1}=W_t-\frac{\sqrt{H_{t-1}+\epsilon}}{\sqrt{G_t+\epsilon}}g_t$$
    $$H_t = \gamma H_{t-1}+(1-\gamma)(\Delta W_t)^2$$
    
    There is <span style="color:red">no learning rate</span> in Adadelta
    
- RMSprop
    - RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his lecture
    $$G_t=\gamma G_{t-1}+(1-\gamma)g_t^2$$
    $$W_{t+1}=W_t-\frac{\eta}{\sqrt{G_t+\epsilon}}g_t$$
- Adam
    - Adaptive Moment Estimation (Adam) leverages both past gradients and squared gradients
    $$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t$$
    $$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$
    $$W_{t+1}=W_t-\frac{\eta}{\sqrt{v_t+\epsilon}}\frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t}m_t$$
    
    Adam effectively combines momentum with adaptive learning rate approach

### Regularizaion
- Early stopping
- Parameter norm penalty
    - It add smoothness to the function space
    $$total\;cost = loss(\mathcal D;W)+\frac{\alpha}{2}\rVert{W}\rVert^2_2$$
    파라미터가 너무 커지지 않도록 하기 위함
- Data augmentation
    - More data are always welcomed
    - However, in most cases, training data are given in advance
    - In such cases, we need data augmentation
- Noise robustness
    - Add random noises inputs or weights
- Label smoothing
    - [**Mix-up**](https://arxiv.org/pdf/1710.09412.pdf) constructs augmented training examples by mixing both input and output of two randomly selected training data
    - [**CutMix**](https://arxiv.org/pdf/1905.04899.pdf) constructs augmented training examples by mixing inputs with cut and paste and outputs with soft labels of two randomly selected training data
- Dropout
    - In each forward pass, randomly set some neurons to zero
- [Batch normalization](https://arxiv.org/pdf/1502.03167.pdf)
    - Batch normalization compute the empirical mean and variance independently for each dimension (layers) and normalize
    $$\mu_B=\frac{1}{m}\sum_{i=1}^mx_i$$
    $$\sigma_B^2=\frac{1}{m}\sum_{i=1}^m(x_i-\mu_B)^2$$
    $$\widehat{x}_i=\frac{x_i-\mu_B}{\sqrt{\sigma^2_B+\epsilon}}$$
    - There are difference variances of normalizations
        - [Group Normalization, 2018](https://arxiv.org/pdf/1803.08494.pdf)

> [AI Math] CNN 첫걸음

### Convolution 연산 이해하기
다층신경망(MLP)은 각 뉴런들이 선형모델과 활성함수로 모두 연결된 (fully connected) 구조
$$h_i=\sigma\left(\sum_{j=1}^pW_{ij}x_j\right)$$

이와 달리 convolution 연산은 커널(kernel)을 입력벡터 상에서 움직여가면서 선형모델과 합성함수가 적용되는 구조
$$h_i=\sigma\left(\sum_{j=1}^kV_jx_{i+j-1}\right)$$

Convolution 연산의 수학적인 의미 : 신호(signal)를 <span style="color:red">커널을 이용해 국소적으로 증폭 또는 감소</span>시켜서 정보를 추출 또는 필터링하는 것
- Convolution 연산
    - continuous
        $$\left[f*g\right](x)=\int_{\mathbb R^d}^{}f(z)g(x-z)dz=\int_{\mathbb R^d}^{}f(x-z)g(z)dz=\left[g*f\right](x)$$
    - discrete
        $$\left[g*f\right](i)=\sum_{a\in\mathbb Z^d}f(a)g(i-a)=\sum_{a\in\mathbb Z^d}f(i-a)g(a)=\left[g*f\right](i)$$
- cross-correlation 연산
    - continuous
        $$\left[f*g\right](x)=\int_{\mathbb R^d}^{}f(z)g(x+z)dz=\int_{\mathbb R^d}^{}f(x+z)g(z)dz=\left[g*f\right](x)$$
    - discrete
        $$\left[g*f\right](i)=\sum_{a\in\mathbb Z^d}f(a)g(i+a)=\sum_{a\in\mathbb Z^d}f(i+a)g(a)=\left[g*f\right](i)$$

CNN에서 사용하는 연산은 사실 convolution이 아닌 cross-correlation

커널은 정의역 내에서 움직여도 변하지 않고(translation invariant) 주어진 신호에 국소적(local)으로 적용

### 다양한 차원에서의 Convolution
- 1D-Conv
$$\left[f*g\right](i)=\sum_{p=1}^df(p)g(i+p)$$
- 2D-Conv
$$\left[f*g\right](i,j)=\sum_{p,q}^{}f(p,q)g(i+p,j+q)$$

입력 크기를 $(H,W)$, 커널 크기를 $(K_H,K_W)$, 출력 크기를 $(O_H,O_W)$라 하면 출력 크기는 다음과 같이 계산한다
$$O_H=H-K_H+1$$
$$O_W=W-K_W+1$$
- 3D-Conv
$$\left[f*g\right](i,j,k)=\sum_{p,q,r}^{}f(p,q,r)g(i+p,j+q,k+r)$$

3차원 Convolution의 경우 2차원 Convolution을 3번 적용한다고 생각하면 된다

3D-Conv 연산에서 *커널의 채널수와 입력의 채널수가 같아야한다*

### Convolution 연산의 역전파 이해하기
Convolution 연산은 커널이 모든 입력데이터에 공통으로 적용되기 때문에 역전파를 계산할 때도 Convolution 연산이 나오게 된다
$$\frac{\partial}{\partial x}\left[f*g\right](x)=\frac{\partial}{\partial x}\int_{\mathbb R^d}f(y)g(x-y)dy=\int_{\mathbb R^d}f(y)\frac{\partial g}{\partial x}(x-y)dy=\left[f*g^\prime\right](x)$$

*Discrete 일 때도 마찬가지로 성립한다*