# Optimizers

In the last few lessons, we learned how to build several neural network architectures.  We used gradient descent, a technique we originally learned in the [dense neural network](https://github.com/VikParuchuri/zero_to_gpt/blob/master/explanations/dense.ipynb) lesson, to adjust model parameters.

Gradient descent is a type of optimizer.  Optimizers adjust neural network parameters to try to get loss to (hopefully) a global minimum value.  In this lesson, we'll learn more about optimizers.  We'll first go into more depth on gradient descent, and discuss batch size, learning rate schedules, weight decay, and momentum.  We'll then discuss the AdamW optimizer, which is a popular optimizer that combines momentum and weight decay.  AdamW is the most commonly used optimizer in large language models like GPT.

## Stochastic Gradient Descent

The type of optimizer we've used so far is called stochastic gradient descent, or SGD.  In SGD, we take a minibatch of training examples, then compute the average gradient across the minibatch.  We then adjust the parameters using this average gradient.

Let's define a two-layer dense neural network, then explore the effect of batch size on our loss over time.  We'll first load in a dataset of weather observations that we've used in previous lessons.

Each row in this dataset is a weather observation from a given day.  We'll use one day's max temperature, min temperature, and rainfall to predict the next day's max temperature.  We have 3 predictors, and one value to predict.

The predictor columns have all been scaled using the scikit-learn `StandardScaler`.  This gives each column a mean of 0 and a standard deviation of 1.  This makes it easier to activate our nonlinearities and have the network learn.

In [1]:
import sys, os
sys.path.append(os.path.abspath('../data'))
from csv_data import WeatherDatasetWrapper

# Load the data with 3 target values instead of the binary value from earlier
wrapper = WeatherDatasetWrapper()
[train_x, train_y], [valid_x, valid_y], [test_x, test_y] = wrapper.get_flat_datasets()

In [10]:
train_x[:2]

array([[-0.72725587, -2.27150212, -0.25366126],
       [-1.68779357, -1.6825982 , -0.25366126]])

In [11]:
train_y[:2]

array([[52.],
       [52.]])

Next, we'll define a single layer of our neural network.  We'll make a few modifications from earlier lessons:

- Instead of directly updating the parameters, we'll return the weight and bias gradients
- We'll add an update method that updates the weights later

This will enable us to swap different optimizers in and out of our network.

In [2]:
import math
import numpy as np

class Dense():
    def __init__(self, input_size, output_size, activation=True, seed=0):
        self.add_activation = activation
        self.hidden = None
        self.prev_hidden = None

        # Initialize the weights.  They'll be in the range -sqrt(k) to sqrt(k), where k = 1 / input_size
        np.random.seed(seed)
        k = math.sqrt(1 / input_size)
        self.weights = np.random.rand(input_size, output_size) * (2 * k) - k

        # Our bias will be initialized to 1
        self.bias = np.ones((1,output_size))

    def forward(self, x):
        # Copy the layer input for backprop
        self.prev_hidden = x.copy()
        # Multiply the input by the weights, then add the bias
        x = np.matmul(x, self.weights) + self.bias
        # Apply the activation function
        if self.add_activation:
            x = np.maximum(x, 0)
        # Copy the layer output for backprop
        self.hidden = x.copy()
        return x

    def backward(self, grad):
        # "Undo" the activation function if it was added
        if self.add_activation:
            grad = np.multiply(grad, np.heaviside(self.hidden, 0))

        # Calculate the parameter gradients
        w_grad = self.prev_hidden.T @ grad # This is not averaged across the batch, due to the way matrix multiplication sums
        b_grad = np.mean(grad, axis=0) # This is averaged across the batch
        param_grads = [w_grad, b_grad]

        # Calculate the next layer gradient
        grad = grad @ self.weights.T
        return param_grads, grad

    def update(self, w_grad, b_grad):
        # Update the weights given an update matrix
        self.weights += w_grad
        self.bias += b_grad

We can then use the `Dense` class to create a 3-layer neural network.  The first layer will take in our predictors and generate `10` hidden features, the second layer combine those features into `10` new features, and the final layer will make a prediction.

In [3]:
layers = [
    Dense(3, 10),
    Dense(10, 10),
    Dense(10, 1, activation=False)
]

We'll define functions that run forward and backward passes across all the layers together.  `forward` will do a full forward pass across all 3 layers, and `backward` will do a full backward pass across all 3 layers.

The backward pass will return the gradients instead of updating the parameters.  We'll use these gradients to update the parameters in our optimizer.

In [4]:
def forward(x, layers):
    # Loop through each layer
    for layer in layers:
        # Run the forward pass
        x = layer.forward(x)
    return x

def backward(grad, layers):
    # Save the gradients for each layer
    layer_grads = []
    # Loop through each layer in reverse order (starting from the output layer)
    for layer in reversed(layers):
        # Get the parameter gradients and the next layer gradient
        param_grads, grad = layer.backward(grad)
        layer_grads.append(param_grads)
    return layer_grads

We'll then define our SGD optimizer, which will take in a set of gradients, and use it to update the network parameters.

The process for SGD is:

- Normalize the gradient by batch size, so that the gradient is the average gradient across the batch (in our case, our bias gradient is already normalized, but the weight gradient is not)
- Multiply the gradient by learning rate
- Multiply the gradient by `-1` so it is subtracted from the parameters in the `update` function

In [5]:
def sgd(layer_grads, layers, lr, batch_size):
    for layer_grad, layer in zip(layer_grads, reversed(layers)):
        w_grad, b_grad = layer_grad

        # Normalize the weight gradient by batch size
        w_grad = w_grad / batch_size

        # Calculate the update sizes
        w_update = -lr * w_grad
        b_update = -lr * b_grad

        layer.update(w_update, b_update)

## Monitoring

We can now write a function to train the network using the SGD optimizer and our data.  Since we'll be testing different batch sizes and optimizers, we want a way to monitor our network and compare different runs to wach other.  So far, we've been using print statements to monitor per-epoch accuracy in our network, which is hard to compare to other runs.

We'll use a tool called [Weights & Biases](https://wandb.ai) to monitor our network.  We can use it to track the loss and accuracy of our network, and compare different runs to each other.  W&B is a free tool, and you can sign up for an account [here](https://wandb.ai).  If you don't want to use W&B, you can also use TensorBoard, but it's harder to setup and use.

We'll start by importing `wandb` and logging in.  We'll also set the `WANDB_SILENT` environment variable to `True`, so that W&B doesn't print out a lot of system messages.

In [6]:
%env WANDB_SILENT=True

import wandb
wandb.login()

env: WANDB_SILENT=True


True

In [7]:
epochs = 10
batch_size = 2
lr = 5e-6

run = wandb.init(project="optimizers",
                 config={"batch_size": batch_size,
                         "lr": lr,
                         "optimizer": "sgd"})

In [8]:
%%wandb

layers = [
    Dense(3, 10),
    Dense(10,10),
    Dense(10, 1, activation=False)
]

for epoch in range(epochs):
    running_loss = 0

    for i in range(0, len(train_x), batch_size):
        x_batch = train_x[i:(i+batch_size)]
        y_batch = train_y[i:(i+batch_size)]
        pred = forward(x_batch, layers)

        loss = pred - y_batch
        running_loss += np.mean(loss ** 2)

        layer_grads = backward(loss, layers)
        sgd(layer_grads, layers, lr, batch_size)

        wandb.log({"running_loss": running_loss / (i+1) * batch_size})

    valid_preds = forward(valid_x, layers)
    valid_loss = np.mean((valid_preds - valid_y) ** 2)
    train_loss = running_loss / len(train_x) * batch_size
    wandb.log({"valid_loss": valid_loss, "epoch": epoch, "train_loss": train_loss})

In [9]:
valid_preds - valid_y

array([[-6.50761231],
       [ 8.03445872],
       [ 2.60728585],
       ...,
       [-1.02646218],
       [-5.6792079 ],
       [ 8.22534764]])