<a href="https://colab.research.google.com/github/aaolcay/Traditional-Machine-Learning-Techniques/blob/main/Optimizer_algorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###**Optimization Algorithms in Model Learning**
Different optimizer algorithms in deep learning are used to update the model parameters (such as weight and bias terms and more) during the training process in order to minimize the loss function. There are some commonly used optimization methods, which will be shown how to implement in this tutorial, are listed below:

1.   **Gradient Descent:** It is a basic optimization algorithm that updates the model parameters in the opposite direction of the gradient of the loss function. It can be further divided into variants such as Batch Gradient Descent, Mini-batch Gradient Descent, and Stochastic Gradient Descent (SGD), which randomly selects a single data point to compute the gradient.
  *   **Batch Gradient Descent:** The number of examples used from training dataset to train the model is equal to the number of total examples.
  *   **Stochastic Gradient Descent:** The number of examples used from training dataset to train the model is only 1.
  *   **Mini-batch Gradient Descent:** The number of examples used from training dataset to train the model is between 1 and total number of examples in training dataset.

2.   **Gradient Descent with Momentum:** Momentum is an extension of gradient descent that incorporates a momentum term in the update rule. It accumulates gradients over time and utilizes the accumulated gradient to update the parameters. This accelerates convergence, particularly in regions with high curvature, as it largely avoids having large steps in the vertical direction. In Gradient Descent with Momentum, a weighted moving average is employed to accumulate previous gradients. It introduces a momentum term that governs the impact of prior gradients on the current update. The weighted moving average is updated by calculating a weighted average of the current gradient and the previous weighted moving average.

3. **RMSProp:** RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm. In RMSProp, the weighted moving average is used to estimate the variance of the gradients. It maintains a separate exponentially decaying average of squared gradients. This average is then used to normalize the gradients by dividing them by the square root of the average.

4. **Adam:** Adam extends RMSProp by incorporating momentum. It uses weighted moving averages to track both the first-order moment (mean) and the second-order moment (variance) of the gradients. These averages are then used to compute adaptive learning rates for each parameter.

*Note that, in all these optimization methods, the weighted moving average helps to smooth out the noise in the gradients and provide a more stable estimation of the gradient statistics. By incorporating the weighted moving average, these algorithms can improve convergence speed, handle noisy gradients, and adaptively adjust the learning rates during the training process.*



####**The Mathematical Expressions of Optimizer Algorithms**

-> **Gradient Descent:**

**Batch Gradient Descent:** 
  
  $\theta = \theta - \alpha \cdot \nabla J(\theta)$

**Stochastic Gradient Descent:** 

  $\theta = \theta - \alpha \cdot \nabla J(\theta;x^{(i)}, y^{(i)})$

**Mini-batch Gradient Descent:** 

  $\theta = \theta - \alpha \cdot \nabla J(\theta;x^{(i:i+b)}, y^{(i:i+b)})$

-> **Gradient Descent with Momentum:** 

  $v = \beta \cdot v + (1 - \beta) \cdot \nabla J(\theta)$

  $\theta = \theta - \alpha \cdot v$

-> **RMSProp:** 

  $s = \beta \cdot s + (1 - \beta) \cdot (\nabla J(\theta))^2$

  $\theta = \theta - \frac{\alpha}{\sqrt{s + \epsilon}} \cdot \nabla J(\theta)$


-> **Adam:**

  $v = \beta_1 \cdot v + (1 - \beta_1) \cdot \nabla J(\theta)$

  $s = \beta_2 \cdot s + (1 - \beta_2) \cdot (\nabla J(\theta))^2$

  $v_{\text{corrected}} = \frac{v}{1 - \beta_1^t}$

  $s_{\text{corrected}} = \frac{s}{1 - \beta_2^t}$

  $\theta = \theta - \frac{\alpha}{\sqrt{s_{\text{corrected}} + \epsilon}} \cdot v_{\text{corrected}}$


**Note:** The symbols used in the expressions are as follows:


*   $\theta$ represents the model parameters (weights and biases)
*   $\alpha$ denotes the learning rate
*   $J(\theta)$ represents the cost function
*   $\nabla J(\theta)$ denotes the gradient of the cost function with respect to the parameters
*   $x$ and $y$ represent the input features and corresponding labels, respectively
*   $v$ denotes the velocity term in momentum-based optimization algorithms
*   $s$ represents the exponentially weighted average of squared gradients in RMSProp and Adam
*   $\beta_1$ and $\beta_2$ are the exponential decay rates for the moment estimates in Adam
*   $t$ represents the iteration or time step
*   $\epsilon$ is a small value added for numerical stability.
