# B"H

## Learning Rate

Let’s forget, for a while, that we are performing gradient descent of an n-dimensional function (our loss function), where n is the number parameters (weights and biases) that the model contains, and assume that we have just one dimension to the loss function (a singular input). 


That said, we’ve used a real SGD optimizer on a real function to prepare all of the following examples. 

---

Here’s the function where we want to determine what input to it will result in the lowest possible output:

![](https://drive.google.com/uc?id=1Rw1XGCcTA5XFEq5LhstpmOvIMza9qrJH)

<br>

---

We’ll start descending from the left side of this graph.


### Round 1 - LR too small

![](https://drive.google.com/uc?id=1q2sI7Fg2-RUeUGvq1h7QVTMlL89BrS5h)

> NOTE: 
> 
> How do we know if we’ve reached the global minimum or at least gotten close? 
>
> As long as the loss value is not 0 or very close to 0, and the model stopped learning, we’re at some local minimum.  
>
> In reality, we almost never approach a loss of 0 for various reasons.

### Round 2 - LR Still Too Small

![](https://drive.google.com/uc?id=1PEll2HvrfyKNkILD7Mqe1ynKwYczqhP6)




### Round 3 - Needs momentum or similar

![](https://drive.google.com/uc?id=1e1tfv8FXI4maDQdYGn_ZExDEmu_EymFX)

The model was able to escape the “deeper” local minimums, so it might be counter-intuitive why it is stuck here. Remember, the model follows the direction of steepest descent of the loss function, no matter how large or slight the descent is. For this reason, we’ll introduce **momentum** and the other techniques to prevent such situations.

### Momentum

Momentum, in an optimizer, adds to the gradient what, in the physical world, we could call **inertia** — for example, we can throw a ball uphill and, with a small enough hill or big enough applied force, the ball can roll-over to the other side of the hill.

### Round 4 - With momentum yet LR too small

![](https://drive.google.com/uc?id=1UYwJJa9c4o3XHNCGwvaeYAmk3UppwX66)

We used a very small learning rate here with a large momentum. 

The color change from green, through orange to red presents the advancement of the gradient descent process, the steps. We can see that the model achieved the goal and found the global minimum, but this took many steps. 

### Round 5 - Good momentum and LR

![](https://drive.google.com/uc?id=1GHIp3LJa77q9gWs2gzqqHBAJ7BYg-WTP)

### Round 6 - LR too high

![](https://drive.google.com/uc?id=1AJaCdGKdUT7nluurFr7gWqdpuknhW5-J)

With the learning rate set too high, the model might not be able to find the global minimum. 

Even, at some point, if it does, further adjustments could cause it to **jump out** of this minimum. 


### Round 7 - LR way too high

![](https://drive.google.com/uc?id=13PHWk5BWboLjEQjTD9a3mnrx4i0M73q9)

In this extreme situation we have a **gradient explosion**.


### Gradient Explosion

A gradient explosion is a situation where the parameter updates cause the function’s output to **rise** instead of fall, and, with each step, the loss value and gradient become larger. 

At some point, the floating-point variable limitation causes an overflow as it cannot hold values of this size anymore, and the model is no longer able to train. 

<br>

It’s crucial to recognize this situation forming during training, especially for large models, where the training can take days, weeks, or more. It is possible to tune the model’s hyper-parameters in time to save the model and to continue training.

### Round 8 - Great hyperparameters

![](https://drive.google.com/uc?id=1rz2F8zWH_yySJkjyQBqrYq-RHsn0S4_Z)

This time the model needed just a few steps to find the global minimum. 


### Summary

![](https://drive.google.com/uc?id=1zFBAb4vKe6JoXr36Rcsk_0c0-_HcApp0)



## Learning Rate Decay

### Idea

The idea of a learning rate decay is to **start with a large learning rate**, say 1.0 in our case, and **then decrease** it during training. 

The model needs **small updates near the end** of training to be able to get as close to the minimum point as possible.

<br>

---

There are a few methods for doing this:
- One is to decrease the learning rate in response to the loss **across epochs**  
    - For example, if the loss begins to level out/plateau or starts “jumping” over large deltas. 
    - You can either program this behavior-monitoring logically or simply track your loss over time and manually decrease the learning rate when you deem it appropriate. 
- Another option, **which we will implement**, is to program a **Decay _Rate_**, which **steadily decays** the learning rate per batch or epoch.

### Our Decay Rate

Let’s plan to **decay per step**. 

This can also be referred to as **1/t decaying** or **exponential decaying**. 

<br>

<u>**Details**</u>

We’re going to **update the learning rate each step** by the **reciprocal of the step count fraction**. 

This **fraction** is a new hyper-parameter that we’ll add to the optimizer, called the **learning rate decay**. 

<br>

$\large rate = startingRate * ( \frac {1}  {1 + rateDecay * step})$

The added 1 makes sure that the resulting algorithm never raises the learning rate. 



### Example

Note, in practice, 0.1 would be considered a fairly aggressive decay rate, but this should give you a sense of the concept.


In [5]:
starting_learning_rate = 1.
learning_rate_decay = 0.1

prev_learning_rate = starting_learning_rate

for step in range(30):
    
    learning_rate = starting_learning_rate * (1. / (1 + learning_rate_decay * step))
    
    # -- --------------------------------
    diff_from_prev = prev_learning_rate - learning_rate
    print(f'learning rate: {learning_rate:.4f}, diff from prev: {diff_from_prev:.4f}')

    prev_learning_rate = learning_rate

learning rate: 1.0000, diff from prev: 0.0000
learning rate: 0.9091, diff from prev: 0.0909
learning rate: 0.8333, diff from prev: 0.0758
learning rate: 0.7692, diff from prev: 0.0641
learning rate: 0.7143, diff from prev: 0.0549
learning rate: 0.6667, diff from prev: 0.0476
learning rate: 0.6250, diff from prev: 0.0417
learning rate: 0.5882, diff from prev: 0.0368
learning rate: 0.5556, diff from prev: 0.0327
learning rate: 0.5263, diff from prev: 0.0292
learning rate: 0.5000, diff from prev: 0.0263
learning rate: 0.4762, diff from prev: 0.0238
learning rate: 0.4545, diff from prev: 0.0216
learning rate: 0.4348, diff from prev: 0.0198
learning rate: 0.4167, diff from prev: 0.0181
learning rate: 0.4000, diff from prev: 0.0167
learning rate: 0.3846, diff from prev: 0.0154
learning rate: 0.3704, diff from prev: 0.0142
learning rate: 0.3571, diff from prev: 0.0132
learning rate: 0.3448, diff from prev: 0.0123
learning rate: 0.3333, diff from prev: 0.0115
learning rate: 0.3226, diff from p

## Momentum

### Intro

Stochastic Gradient Descent with **learning rate decay** can do fairly well but is still a fairly basic optimization method.

One option for improving the SGD optimizer is to introduce **momentum**.

### Idea

Momentum creates a **rolling average of gradients** over some number of updates and uses this average with the unique gradient at each step. 

<br>

---

Another way of understanding this is to imagine a ball going down a hill — even if it finds a small hole or hill, momentum will let it go straight through it towards a lower minimum — the bottom of this hill. This can help in cases where you’re stuck in some local minimum (a hole), bouncing back and forth. With momentum, a model is more likely to pass through local minimums, further decreasing loss. 

Simply put, **momentum may still point towards the global gradient descent direction**.

<br>

---

With regular updates, the SGD optimizer might determine that the **next best step** is one that keeps the model in a **local minimum**. 

The step may decrease loss for that update but might not get us out of the local minimum. We might wind up with a gradient that points in one direction and then the opposite direction in the next update; the gradient could continue to bounce back and forth around a local minimum like this, keeping the optimization of the loss stuck.

Instead, **momentum uses the previous update’s direction to influence the next update’s direction**, minimizing the chances of bouncing around and getting stuck.

<br>

---

We utilize momentum by setting a parameter between 0 and 1, representing the fraction of the previous parameter update to retain.

The update contains a portion of the gradient from preceding steps as our momentum (direction of previous changes) and only a portion of the current gradient; together, these portions form the actual change to our parameters and **the bigger the role that momentum takes in the update, the slower the update can change the direction**. 

When we set the momentum fraction too high, the model might stop learning at all since the direction of the updates won’t be able to follow the global gradient descent. 

---

The momentum = the previous update to the parameters.

See code for details.


### Takeaway

The SGD optimizer with momentum is usually one of 2 main choices for an optimizer in practice next to the Adam optimizer, which we’ll talk about shortly. 

## AdaGrad

### NOTE

This optimizer is **not** widely used

### Idea

**AdaGrad**, short for **adaptive gradient**, institutes a **per-parameter learning rate** rather than a **globally-shared rate** (as we did before). 

Overall, the impact is the learning rates for parameters with smaller gradients are decreased slowly, while the parameters with larger gradients have their learning rates decreased faster.

<br>

<u>Details</u>

During the training process, some weights can rise significantly, while others tend to not change by much. It is usually better for weights to not rise too high compared to the other weights (we’ll talk about this with regularization techniques in later chapters). 

AdaGrad provides a way to normalize parameter updates by keeping a history of previous updates — the bigger the sum of the updates is, in either direction (positive or negative), the smaller updates are made further in training. 

This lets less-frequently updated parameters to keep-up with changes, effectively utilizing more neurons for training. 

<br>

---


The concept of AdaGrad can be contained in the following two lines of code:

```py
cache += parm_gradient ** 2
parm_updates = learning_rate * parm_gradient / (sqrt(cache) + eps)
```

- `cache`: holds a history of squared gradients
- `parm_updates`: is a function of the learning rate multiplied by the gradient (basic SGD so far) and then is divided by the square root of the cache plus some epsilon value. 

The division operation performed with a constantly rising cache might also cause the learning to stall as updates become smaller with time, due to the monotonic nature of updates. **That’s why this optimizer is not widely used, except for some specific applications.** 

The epsilon is a hyperparameter (pre-training control knob setting) preventing division by 0. The epsilon value is usually a small value, such as 1e-7, which we’ll be defaulting to. 

### Side note

You might also notice that we are summing the squared value, only to calculate the square root later, which might look counter-intuitive as to why we do this. 

We are adding squared values and taking the square root, which is not the same as just adding the value, for example:

![](https://drive.google.com/uc?id=12xz0Y1uf_jy1ikhs917O1-U_UR-UTh2i)

The resulting cache value grows slower, and in a different way, taking care of the negative numbers (we would not want to divide the update by the negative number and flip its sign).

## RMSProp

### Idea

**RMSProp** is short for **Root Mean Square Propagation**. 

Similar to AdaGrad, RMSProp calculates an adaptive learning rate per parameter; it’s just calculated in a different way than AdaGrad.

<br>

<u>Details</u>

Where AdaGrad calculates the cache as:

```py
cache += gradient ** 2
```

RMSProp calculates the cache as:

```py
cache = rho * cache + (1 - rho) * gradient ** 2
```


Note that this is similar to both **momentum with the SGD** and **cache with the AdaGrad**. 

RMSProp adds a mechanism similar to momentum but also adds a per-parameter adaptive learning rate, so the learning rate changes are smoother. 

This helps to **retain the global direction of changes and slows changes in direction**. 

<br>

---

Instead of continually adding squared gradients to a cache (like in Adagrad), it uses a moving average of the cache. Each update to the cache retains a part of the cache and updates it with a fraction of the new, squared, gradients. In this way, cache contents “move” with data in time, and learning does not stall. In the case of this optimizer, the per-parameter learning rate can either fall or rise, depending on the last updates and current gradient. 

---


The new hyperparameter here is `rho`. Rho is the cache memory decay rate. 

> Note: Because this optimizer, with default values, carries over so much momentum of gradient and the adaptive learning rate updates, even small gradient updates are enough to keep it going; therefore, a default learning rate of 1 is far too large and causes instant model instability. 
>
> A learning rate that becomes stable again and gives fast enough updates is around 0.001 (that’s also the default value for this optimizer used in well-known machine learning frameworks). 


## Adam

### Idea

**Adam**, short for **Adaptive Momentum**, is currently the **most widely-used optimizer**

**It is built atop RMSProp, with the momentum concept from SGD added back in.** 

This means that, instead of applying current gradients, we’re going to apply momentums like in the SGD optimizer with momentum, then apply a per-weight adaptive learning rate with the cache as done in RMSProp.



### Bias Correction Mechanism

Do not confuse this with the layer’s bias. 

The **bias correction mechanism** is applied to the **cache** and **momentum**, compensating for the initial zeroed values before they warm up with initial steps. 

> NOTE: There are some nitty gritty details here, see book page 304, for details.

<br>

<u>The Main Gist</u>

This mechanism significantly **speeds up training in the initial stages** before its finally "warmed up" after many steps. 


## Summary of Optimizers

**Adam** is usually the best optimizer.

That’s not always the case though. It’s usually a good idea to try the Adam optimizer first but to also try the others, especially if you’re not getting the results you hoped for. Sometimes simple **SGD** or **SGD + momentum** performs better than Adam. 



## How to choose hyperparams

It is not always an easy task. 

It is usually best to start with the optimizer **defaults**, perform a few steps, and observe the training process when tuning different settings. 

---

For **SGD**'s **learning rate**, a good rule is that your initial training will benefit from a **larger** learning rate to take initial steps faster. If you start with steps that are too small, you might get stuck in a local minimum and be unable to leave it due to not making large enough updates to the parameters.

---

There is no single, best way to set hyper-parameters, but experience usually helps :)


---
**General Guidelines on LR**

- Starting LR for SGD is 1.0, with a decay down to 0.1. 
- For Adam, a good starting LR is 0.001 (1e-3), decaying down to 0.0001 (1e-4). 
