<img src="../holberton_logo.png" alt="logo" width="500"/>

# Optimisation

**Optimization is the process of finding the best solution to a problem**. In the context of Machine Learning, optimization is used to **find the best set of parameters for a model that minimizes a specific objective function**. The objective function is a measure of how well the model performs on a given task, such as classification or regression.

Optimization is crucial in Machine Learning because it allows us to train models that can make accurate predictions on new data. Without optimization, it would be difficult to find the best set of parameters for a model, which would lead to poor performance and inaccurate predictions

There are many different optimization algorithms that can be used in Machine Learning, such as gradient descent, stochastic gradient descent, and Adam. Each algorithm has its own strengths and weaknesses, and the choice of algorithm depends on the specific problem being solved and the characteristics of the data.


## What is a hyperparameter

A **hyperparameter is a parameter that is set before the training of a machine learning model begins**. These parameters **cannot be learned from the data during training, but instead, they must be set by the programmer** based on their domain knowledge, experience, and intuition.

Hyperparameters *control the behavior of the training algorithm and the structure of the resulting model*. Some common hyperparameters (of a neural network) include the 

- learning rate

- regularization strength

- number of hidden layers

- number of neurons per layer

To find the best hyperparameters, a common approach is to perform a grid search or a randomized search over a range of values for each hyperparameter

## How and why do you normalize your input data?

**Normalization is the process of scaling the input data to have zero mean and unit variance**. This is often done before training a machine learning model to improve performance and prevent numerical issues.

<img src="figs/features.webp" alt="logo" width="300"/>

Normalization can improve performance by ensuring that all features are treated equally during training. This is because features with larger magnitudes can dominate the training process and overshadow other features. Normalization can also prevent numerical issues, such as overflow or underflow, that can occur when working with very large or very small numbers.

There are several ways to normalize features

#### 1. Standartization

Standardisation replaces the values by their `Z` scores. This redistributes the features with their mean $\mu = 0$ and standard deviation $\sigma = 1$.

$$
x' = \frac{x - x_{mean}}{\sigma}
$$

#### 2. Mean Normalisation

This distribution will have values between `-1` and `1` with $\mu = 0$.

Standardisation and Mean Normalization can be used for algorithms that assumes zero centric data like Principal Component Analysis(PCA).

$$
x' = \frac{x - mean(x)}{max(x) - min(x)}
$$

### 3. Min Max Scaling

This scaling brings the value between `0` and `1`.

$$
x' = \frac{x - min(x)}{max(x) - min(x)}
$$

## What is a saddle point?

A saddle point is a critical point in the optimization landscape of a function where the gradient is zero, i.e., the **function has a stationary point** in one direction and a maximum or minimum point in another direction.

At a saddle point, **the gradient descent algorithm will get stuck and cannot make progress towards the minimum**. This is because the gradient is zero, and the direction of steepest descent does not point towards the minimum. Saddle points are common in high-dimensional optimization problems, especially in deep neural networks.

<img src="figs/saddlegradient.png" alt="logo" width="500"/>


To deal with saddle points, several techniques have been proposed, such as momentum-based optimization, adaptive learning rates, and second-order optimization methods. These techniques can help the optimization algorithm escape saddle points and converge to the minimum.

## What is stochastic gradient descent?

Stochastic Gradient Descent (SGD) is an optimization algorithm commonly used in machine learning to minimize the loss function of a model. **It is a variant of the Gradient Descent algorithm that updates the model parameters based on the gradient of the loss function estimated from a randomly selected subset of the training data, rather than the full dataset**.

The key idea behind SGD is to use a small subset of the training data, known as a mini-batch, to estimate the gradient of the loss function. This is computationally more efficient than computing the gradient on the entire dataset, especially when the dataset is large. After computing the gradient on the mini-batch, the algorithm updates the model parameters in the direction of the negative gradient.

<img src="figs/sgd1.png" alt="logo" width="400"/>

The resulting plot clearly illustrates how stochastic gradient descent converges to the minimum point, but with a more erratic path due to the randomness in selecting the data points.


To further improve the performance of SGD, several variants have been proposed, such as momentum-based SGD, Adagrad, RMSProp, and Adam. These variants adjust the learning rate or the update direction of the algorithm based on the history of the gradients, which can improve convergence and stability.

## What is mini-batch gradient descent?

Mini-batch Gradient Descent is a variant of the Gradient Descent optimization algorithm used in machine learning. It is similar to Stochastic Gradient Descent (SGD) but updates the model parameters based on the gradient of the loss function estimated from a small random subset of the training data, known as a mini-batch, rather than a single training example.

**The idea behind mini-batch Gradient Descent is to strike a balance between the efficiency of SGD and the stability of batch Gradient Descent, which computes the gradient on the entire training dataset**. By using mini-batches, the algorithm can make more frequent updates to the model parameters than batch Gradient Descent, which can speed up convergence and avoid getting stuck in local optima.

<img src="figs/gradients.png" alt="logo" width="400"/>


**The size of the mini-batch is a hyperparameter that needs to be chosen carefully**. A larger mini-batch can provide a more accurate estimate of the gradient but may slow down the training process and require more memory. A smaller mini-batch can be faster and use less memory but may result in noisy updates to the model parameters


#### Summary of Mini-Batch Gradient Descent Algorithm:

##### Initialization

Load the neural network model from a saved path using TensorFlow. Get tensors and operations from the restored graph.

##### Loop over Epochs:

- For each epoch, calculate training and validation metrics.
- Print the training and validation cost and accuracy after each epoch.

##### Mini-Batch Processing:

- Shuffle the training data and labels.
- Calculate the number of batches based on the training data size and batch size.
- Loop over each batch:
    - Extract a mini-batch of data and labels.
    - Run a training step with the mini-batch data using TensorFlow's sess.run() function.
    - Optionally, print mini-batch results every 100 batches, including the cost and accuracy.
- Save Model: Save the trained model to a specified path using TensorFlow's saver object

## What is the moving average

In the context of neural networks, a **moving average refers to a technique used to improve the stability of the network during training**. The moving average is applied to the network's weights or other trainable parameters, such as the mean and variance of the batch normalization layers, to smooth out fluctuations and reduce the impact of outliers in the training data.

The basic idea behind the moving average is to **keep track of the past values of the parameters and update them based on a weighted average of the current and past values**. The weights are typically chosen to give more weight to the current value and less weight to the past values, with the goal of adapting to the changes in the training data while maintaining stability.

The moving average is often used in conjunction with other optimization techniques, such as momentum-based stochastic gradient descent (SGD) and Adam optimizer, to improve the convergence and generalization performance of the network.

<img src="figs/mavg.png" alt="logo" width="400"/>


#### Math behind it

The **moving average calculates the weighted moving average of a dataset**. 


It iterates through the data points, updating the weighted average using the current data point and a weighting factor (`beta`). 

The formula used for the calculation is
$$
MA = (\beta \cdot w + (1 - \beta) \cdot d)
$$

- `w` represents the previous weighted average
- `d` is the current data point
- $\beta$ is the weighting factor

Additionally, the function applies bias correction to adjust the weighted average, ensuring accuracy over time. Formula for bias correction is

$$
w_{new} = \frac{w}{1 - \beta^{i+1}}
$$

- $w_{new}$ is the corrected weighted average
- $w$ is the current weighted average
- $\beta$ is the weight parameter
- `i` is the index of the current data point

The resulting list contains the moving averages of the input data.


## What is gradient descent with momentum? 

**Gradient descent with momentum is an optimization algorithm that is used to speed up the convergence of the standard gradient descent algorithm**. The basic idea behind gradient descent with momentum is to accumulate a running average of the gradients over time and use this information to update the parameters of the model.

In the standard gradient descent algorithm, the parameters are updated in the direction of the negative gradient of the cost function with respect to the parameters. However, the update can be quite noisy and fluctuate in different directions, which can slow down the convergence of the algorithm, especially in the presence of noisy or sparse data.

In contrast, **gradient descent with momentum takes into account the previous gradients and uses a weighted average of the current and past gradients to update the parameters**. This helps to smooth out the update direction and speed up the convergence of the algorithm.

The formula for updating the variables with momentum is:

$$
dW = \beta_1 \cdot v + (1 - \beta_1) \cdot grad
$$

where

- $dW$ is the update term
- $\beta_1$ is the momentum weight
- $v$ is the previous first moment of the variable
- $grad$ is the gradient of the variable

## What is RMSProp?

RMSProp stands for Root Mean Square Propagation, and it is an **optimization algorithm** commonly used in deep learning. RMSProp is designed to **handle the problem of diminishing learning rates**, which can lead to slow convergence or even a complete halt in the training process.

#### Intuition in Simple Words

- RMSProp changes how quickly the model learns for each part of the problem. If some parts are changing a lot, it slows down the learning to avoid overreacting. If other parts change slowly, it speeds up learning to not fall behind.
- It does this by keeping track of how much each part has changed recently. If something changes a lot, it gets less attention. If something changes slowly, it gets more attention.
- This helps the model learn steadily without getting too distracted by sudden changes or missing out on slower changes.

#### Key Idea

The basic idea behind RMSProp is to **maintain a moving average of the squared gradient for each weight in the network. This average is then used to scale the learning rate for each weight in the network**. In other words, the algorithm adapts the learning rate for each weight based on the magnitude of its gradients.

RMSProp can be implemented in various deep learning frameworks such as TensorFlow, PyTorch, or Keras by using their built-in optimization algorithms that support RMSProp.


Here are the key formulas

##### Squared Gradient Calculation

This formula calculates the squared gradient, which is a weighted sum of the squared gradients of previous time steps and the current gradient
$$
sq_{gradient} = \beta_2 \cdot s + (1 - \beta_2) \cdot grad^2
$$

- $sq_{gradient}$ represents the squared gradient, which is a weighted sum of the squared gradients of previous time steps and the current gradient
- $\beta_2$ is the RMSProp weight, controlling the contribution of the previous squared gradients
- $s$ denotes the previous second moment of the variable
- $grad$ is the gradient of the variable

##### Variable update

This formula updates the variable using the RMSProp update rule. It adjusts the variable based on the learning rate $\alpha$, the gradient, the square root of the squared gradient, and a small constant 
$\epsilon$ to avoid division by zero.

$$
updated_{variable} = var - \frac{\alpha \cdot grad}{\sqrt{sq_{gradient}} + \epsilon}
$$


## What is Adam optimization?

Adam optimization is an algorithm used for stochastic gradient descent that is widely used in deep learning for optimizing neural networks. Adam stands for Adaptive Moment Estimation and it combines the best aspects of two other popular optimization algorithms, namely RMSProp and momentum.


#### Intuition in Simple Words

- Adam combines the benefits of two other optimization techniques: RMSProp and Momentum.
- Like RMSProp, it adjusts learning rates for different parameters, but it also keeps track of past gradients' momentum.
- This combination allows Adam to adaptively adjust the learning rate for each parameter while efficiently handling sparse gradients and noisy data.


Like RMSProp, Adam computes the moving average of the squared gradients to adapt the learning rate for each parameter. And like momentum, Adam keeps track of the exponentially decaying average of past gradients to smooth out the parameter updates.

#### Key elements

The algorithm is called Adam. It is not an acronym and is not written as ADAM. 
- The name Adam is derived from adaptive moment estimation.
- Straightforward to implement.
- Computationally efficient.
- Little memory requirements.

*The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients*


The authors describe Adam as combining the advantages of two other extensions of stochastic gradient descent. Specifically:

- **Adaptive Gradient Algorithm (AdaGrad)** that maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).
- **Root Mean Square Propagation (RMSProp)** that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).

Adam realizes the benefits of both AdaGrad and RMSProp.

Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance).

<img src="figs/adam.png" alt="adam" width="600"/>

#### Mathematical reasoning and formulas

##### Initialization

Adam optimizer initializes two variables, `v` and `s`, which represent the exponentially decaying average of past gradients and the exponentially decaying average of past squared gradients, respectively.

##### Update rule

At each iteration, Adam computes the exponentially decaying average of past gradients  `v` using a decay factor $\beta_1$ and the exponentially decaying average of past squared gradients 
`s` using decay factor $\beta_2$. These averages are updated with the current gradient of the variable.

$$
v_t = \beta_1 \cdot v_{t-1} + (1 - \beta_1) \cdot grad
$$

and 

$$
s_t = \beta_2 \cdot s_{t-1} + (1 - \beta_2) \cdot grad^2
$$

##### Bias correction

To compensate for the initialization bias towards zero, Adam performs bias correction by scaling the computed averages `v` and `s` by factors

$$
\frac{1}{1 - \beta_1^t}
$$

and

$$
\frac{1}{1 - \beta_2^t}
$$

where `t` is the current time step.

##### Variable update

Variable Update: Finally, Adam updates the variable using the corrected averages `v` and `s`, along with a small constant  $\epsilon$ to prevent division by zero. The variable is updated in the direction of the gradient, with a learning rate $\alpha$ scaled by the square root of the corrected squared gradient.


## What is learning rate decay?

Learning rate decay is a technique used in optimization algorithms to decrease the learning rate during training. The idea behind learning rate decay is to gradually reduce the learning rate as training progresses, so that the updates become smaller and finer as the optimizer gets closer to the optimal solution. 

The main motivation behind learning rate decay is to balance the trade-off between convergence speed and convergence accuracy. A high learning rate can lead to fast convergence, but it can also result in overshooting the optimal solution and bouncing around it. A low learning rate can lead to a more accurate convergence, but it can also result in very slow convergence and getting stuck in local optima.

<img src="figs/lr.png" alt="adam" width="600"/>


#### Learning rate decay 

is a technique used in training machine learning models to gradually reduce the learning rate over time. This can help fine-tune the model's parameters more precisely as training progresses. Here's a summary of the learning rate decay method implemented in the provided code

#### Factor Calculation: 

The decay rate is applied to the learning rate based on the global step divided by the decay step. This factor increases over time as the global step increases.

$$
factor = 1 + decay \cdot \frac{global_{step}}{decay_{step}}
$$


#### Learning Rate Scaling

The original learning rate ($\alpha$) is scaled by the inverse of the calculated factor. This scaled learning rate is then used for training the model.

$$
\alpha = \frac{\alpha}{factor}
$$


## What is batch normalization?

Batch normalization is a technique used in deep learning to normalize the input of each layer of a neural network by adjusting and scaling the activations. The aim is to accelerate the training process and improve the overall performance of the network by reducing the internal covariate shift.

The internal covariate shift refers to the change in the distribution of the input of each layer during training, which can slow down the convergence of the network and make it harder to optimize. By normalizing the input of each layer, batch normalization can help to reduce the internal covariate shift and make the training process more stable and efficient.

Batch normalization works by computing the mean and variance of the activations of each layer over a mini-batch of inputs during training. Then, it centers and scales the activations using the mean and variance, and applies a linear transformation to the normalized inputs. The transformation includes learnable parameters that can be updated during backpropagation to optimize the network.

<img src="figs/batch.webp" alt="adam" width="800"/>


### Happy coding