# Optimizers

In the last few lessons, we learned how to build several neural network architectures.  We used gradient descent, a technique we originally learned in the [dense neural network](https://github.com/VikParuchuri/zero_to_gpt/blob/master/explanations/dense.ipynb) lesson, to adjust model parameters.

Gradient descent is a type of optimizer.  Optimizers adjust neural network parameters to try to get loss to (hopefully) a global minimum value.  In this lesson, we'll learn more about optimizers.  We'll first go into more depth on gradient descent, and discuss batch size, learning rate schedules, weight decay, and momentum.  We'll then discuss the AdamW optimizer, which is a popular optimizer that combines momentum and weight decay.  AdamW is the most commonly used optimizer in large language models like GPT.

## Stochastic Gradient Descent

The type of optimizer we've used so far is called stochastic gradient descent, or SGD.  In SGD, we take a minibatch of training examples, then compute the average gradient across the minibatch.  We then adjust the parameters using this average gradient.

Let's define a two-layer dense neural network, then explore the effect of batch size on our loss over time.  We'll first load in a dataset of weather observations that we've used in previous lessons.

Each row in this dataset is a weather observation from a given day.  We'll use one day's max temperature, min temperature, and rainfall to predict the next day's max temperature.  We have 3 predictors, and one value to predict.

The predictor columns have all been scaled using the scikit-learn `StandardScaler`.  This gives each column a mean of 0 and a standard deviation of 1.  This makes it easier to activate our nonlinearities and have the network learn.

In [50]:
import sys, os
sys.path.append(os.path.abspath('../data'))
from csv_data import WeatherDatasetWrapper

# Load the data with 3 target values instead of the binary value from earlier
wrapper = WeatherDatasetWrapper()
train_data, valid_data, test_data = wrapper.get_flat_datasets()

In [51]:
# Show the train predictors
train_data[0][:2]

array([[-0.72725587, -2.27150212, -0.25366126],
       [-1.68779357, -1.6825982 , -0.25366126]])

In [52]:
# Show the train target
train_data[1][:2]

array([[52.],
       [52.]])

Next, we'll define a single layer of our neural network.  We'll make a few modifications from earlier lessons:

- Instead of directly updating the parameters, we'll return the weight and bias gradients
- We'll add an update method that updates the weights later

This will enable us to swap different optimizers in and out of our network.

In [53]:
import math
import numpy as np

class Dense():
    def __init__(self, input_size, output_size, activation=True, seed=0):
        self.add_activation = activation
        self.hidden = None
        self.prev_hidden = None

        # Initialize the weights.  They'll be in the range -sqrt(k) to sqrt(k), where k = 1 / input_size
        np.random.seed(seed)
        k = math.sqrt(1 / input_size)
        self.weights = np.random.rand(input_size, output_size) * (2 * k) - k

        # Our bias will be initialized to 1
        self.bias = np.ones((1,output_size))

    def forward(self, x):
        # Copy the layer input for backprop
        self.prev_hidden = x.copy()
        # Multiply the input by the weights, then add the bias
        x = np.matmul(x, self.weights) + self.bias
        # Apply the activation function
        if self.add_activation:
            x = np.maximum(x, 0)
        # Copy the layer output for backprop
        self.hidden = x.copy()
        return x

    def backward(self, grad):
        # "Undo" the activation function if it was added
        if self.add_activation:
            grad = np.multiply(grad, np.heaviside(self.hidden, 0))

        # Calculate the parameter gradients
        w_grad = self.prev_hidden.T @ grad # This is not averaged across the batch, due to the way matrix multiplication sums
        b_grad = np.mean(grad, axis=0) # This is averaged across the batch
        param_grads = [w_grad, b_grad]

        # Calculate the next layer gradient
        grad = grad @ self.weights.T
        return param_grads, grad

    def update(self, w_grad, b_grad):
        # Update the weights given an update matrix
        self.weights += w_grad
        self.bias += b_grad

We can then use the `Dense` class to create a 3-layer neural network.  The first layer will take in our predictors and generate `10` hidden features, the second layer combine those features into `10` new features, and the final layer will make a prediction.

In [54]:
layers = [
    Dense(3, 10),
    Dense(10, 10),
    Dense(10, 1, activation=False)
]

We'll define functions that run forward and backward passes across all the layers together.  `forward` will do a full forward pass across all 3 layers, and `backward` will do a full backward pass across all 3 layers.

The backward pass will return the gradients instead of updating the parameters.  We'll use these gradients to update the parameters in our optimizer.

In [55]:
def forward(x, layers):
    # Loop through each layer
    for layer in layers:
        # Run the forward pass
        x = layer.forward(x)
    return x

def backward(grad, layers):
    # Save the gradients for each layer
    layer_grads = []
    # Loop through each layer in reverse order (starting from the output layer)
    for layer in reversed(layers):
        # Get the parameter gradients and the next layer gradient
        param_grads, grad = layer.backward(grad)
        layer_grads.append(param_grads)
    return layer_grads

We'll then define our SGD optimizer, which will take in a set of gradients, and use it to update the network parameters.

The process for SGD is:

- Normalize the gradient by batch size, so that the gradient is the average gradient across the batch (in our case, our bias gradient is already normalized, but the weight gradient is not)
- Multiply the gradient by learning rate
- Multiply the gradient by `-1` so it is subtracted from the parameters in the `update` function

In [56]:
def sgd(layer_grads, layers, lr, batch_size):
    for layer_grad, layer in zip(layer_grads, reversed(layers)):
        w_grad, b_grad = layer_grad

        # Normalize the weight gradient by batch size
        w_grad = w_grad / batch_size

        # Calculate the update sizes
        w_update = -lr * w_grad
        b_update = -lr * b_grad

        layer.update(w_update, b_update)

## Monitoring

We can now write a function to train the network using the SGD optimizer and our data.  Since we'll be testing different batch sizes and optimizers, we want a way to monitor our network and compare different runs to wach other.  So far, we've been using print statements to monitor per-epoch accuracy in our network, which is hard to compare to other runs.

We'll use a tool called [Weights & Biases](https://wandb.ai) to monitor our network.  We can use it to track the loss and accuracy of our network, and compare different runs to each other.  W&B is a free tool, and you can sign up for an account [here](https://wandb.ai).  If you don't want to use W&B, you can also use TensorBoard, but it's harder to setup and use.

We'll start by importing `wandb` and logging in.  We'll also set the `WANDB_SILENT` environment variable to `True`, so that W&B doesn't print out a lot of system messages.

In [57]:
%env WANDB_SILENT=True

import wandb
wandb.login()

env: WANDB_SILENT=True


True

With W&B, we track each training run separately.  W&B will track the loss in each epoch, or anything else we want to keep track of.  It will also render graphs for us, so we can compare different runs against each other.

The key W&B functions are:

- `wandb.init` - This will initialize a new run.  We can use the `config` parameter to pass in run-specific information we want to view.
- `wandb.log` - This will log a dictionary of metrics to the current run.
- `wandb.define_metric` - This will define a metric that we want to track.  You normally don't need to define wandb metrics upfront, but we'll use a custom `step_metric` to ensure that results from runs with different batch sizes line up.

We'll use W&B to log a running training loss, the final training set loss each epoch, and the validation set loss.  We'll also track how long each epoch takes to run.

## The Training Loop

We can now write a function that will train our network using SGD, and log the loss to W&B.  This function will be very similar to training loops that we've written in previous lessons.

We didn't do this previously, but we'll shuffle our training data each epoch.  This will ensure that our batches are different each epoch.  Random shuffling is important when our batch size is greater than `1`.  As we saw earlier, the gradient is averaged across all the examples in a batch.  Random shuffling will put different training examples together each time, allowing the model to see more combinations of gradients.  This reduces overfitting, and improves the generalization of our network.  It can also make SGD converge faster (get to the global minimum error).

In [60]:
import time

def training_run(epochs, batch_size, lr, optimizer, train_data, valid_data):
    # Initialize a new W&B run, with the right parameters
    run = wandb.init(project="optimizers",
               config={"batch_size": batch_size,
                       "lr": lr,
                       "epochs": epochs,
                       "optimizer": optimizer.__name__})

    # Setup the metrics we want to track with wandb
    wandb.define_metric("batch_step") # This will ensure that results from runs with different batch sizes line up
    wandb.define_metric("epoch") # This will ensure that results from runs with different batch sizes line up
    wandb.define_metric("valid_loss", step_metric="epoch")
    wandb.define_metric("train_loss", step_metric="epoch")
    wandb.define_metric("runtime", step_metric="epoch")
    wandb.define_metric("running_loss", step_metric="batch_step")

    # Setup the layers for the training run
    layers = [
        Dense(3, 10),
        Dense(10,10),
        Dense(10, 1, activation=False)
    ]

    # Set the numpy random seed so the random shuffles proceed in the same order every run
    np.random.seed(0)

    # Split the training and valid data into x and y
    train_x, train_y = train_data
    valid_x, valid_y = valid_data

    for epoch in range(epochs):
        running_loss = 0
        start = time.time() # The start time of our run

        np.random.shuffle(train_x) # Shuffle the training data
        for i in range(0, len(train_x), batch_size):
            # Get the x and y batches
            x_batch = train_x[i:(i+batch_size)]
            y_batch = train_y[i:(i+batch_size)]
            # Make a prediction
            pred = forward(x_batch, layers)

            # Run the backward pass
            loss = pred - y_batch
            layer_grads = backward(loss, layers)

            # Run the optimizer
            optimizer(layer_grads, layers, lr, batch_size)

            # Update running loss
            running_loss += np.mean(loss ** 2)

            batch_idx = i + batch_size # Get the last index of the current batch
            batch_step = batch_idx + epoch * len(train_x)
            # Log running loss.  We multiply by batch size to offset the mean from earlier.
            wandb.log({"running_loss": running_loss / batch_idx * batch_size, "batch_step": batch_step})

        # Calculate and log validation loss
        valid_preds = forward(valid_x, layers)
        valid_loss = np.mean((valid_preds - valid_y) ** 2)
        train_loss = running_loss / len(train_x) * batch_size
        wandb.log({
            "valid_loss": valid_loss,
            "epoch": epoch,
            "train_loss": train_loss,
            "runtime": time.time() - start
        })

    # Mark the run as complete
    run.finish()

We can then initialize the parameters we want to adjust (like batch size), and start the run.

In [None]:
# Setup our parameters
epochs = 50
batch_size = 2
lr = 5e-6

# Run our training loop
training_run(epochs, batch_size, lr, sgd, train_data, valid_data)

You should now be able to see your run in W&B.  You will need to click on the `optimizers` project in your [dashboard](https://wandb.ai/home).  Here's a screenshot of the run summary page:

![W&B Dashboard](images/optimizers/wandb_dash.png)

We can compare this run to a run with a higher batch size:

In [None]:
# Setup our parameters
epochs = 50
batch_size = 32
lr = 5e-6

# Run our training loop
training_run(epochs, batch_size, lr, sgd, train_data, valid_data)

You should end up with a dashboard that looks like this:

![W&B Dashboard](images/optimizers/wandb_comp.png)

We can draw a few conclusions from this:

- The higher batch size descends more slowly that a lower batch size.
- The higher batch size runs much faster (around 16x faster)

Batch size can make a big impact on the final accuracy of your model.  In general, a higher batch size is a form of regularization (which we'll talk about in depth in a later lesson).  Since you're averaging the gradients, you can't fit as tightly to any single training example.