## The Optimizer

With a loss function, we can tell how well (or poorly!) our neural network is performing. To improve on our model’s performance, we need to adjust the weights and biases. This is what the **optimizer algorithm** does!

### Gradient Descent Optimization

There are many different optimization algorithms. One of the most common is called **gradient descent**.

Imagine we’re on top of a mountain, and our loss function tells us how high up we are. Our goal is to get down the mountain, making the loss function as small as possible. Unfortunately, it’s a dark night, so even with perfect sight we couldn’t see very far.

Suppose (using sight or some kind of radar) we can determine how the mountain slopes around us for about a meter in any direction. With only that information, we might pick where the mountain slopes down the most and move in that direction.

We don’t want to move too far at once, because the mountain slope might change as we move. So we’ll only move a short distance in our chosen direction before pausing and re-evaluating which direction goes downhill the most.

This strategy is essentially how gradient descent works! It uses calculus to determine the gradients of the loss function. These **gradients** are the direction signs that indicate which way to adjust the weights and biases in order to decrease the loss function.

### Learning Rate

How far to move at each step is called the **learning rate**. Choosing a learning rate involves some tradeoffs:

- a learning rate too high may cause the model to move too quickly and miss the lowest point
- a learning rate too small may cause the model to learn slowly or get stuck

The learning rate is a classic example of what we call **hyperparameters** – values tuned and tweaked by ML engineers during training to improve performance of the model. The process of adjusting hyperparameters in search of the best model performance is called **hyperparameter** tuning.

### Using Optimizers in PyTorch

A popular optimizer in PyTorch, called **Adam**, uses gradient descent with a few extra bells and whistles (like adjusting the learning rate dynamically during training). To use Adam, we’ll use the syntax

```python
import torch.optim as optim
optimizer = optim.Adam(model.parameters(), lr=0.01)
```

where
- `model.parameters()` tells Adam what our current weights and biases are
- `lr=0.01` tells Adam to set the learning rate to `0.01`

To apply Adam to a neural network, we need to perform the:
1. **backwards pass**: calculate the gradients of the loss function (these determine the “downward” direction)
2. **step**: use the gradients to update the weights and biases.

The syntax is
```python
# compute the loss
MSE = loss(predictions, y)
# backward pass to determine "downward" direction
MSE.backward()
# apply the optimizer to update weights and biases
optimizer.step()
```

Note that `backward` is applied to the computed loss, not the loss function. This is why the output of the `loss` function includes the parameter `grad_fn=<MseLossBackward0>`. This parameter is the function used to perform the backwards pass.