# Optimizers

In the last few lessons, we learned how to build different types of neural network architectures.  In the [dense neural network](https://github.com/VikParuchuri/zero_to_gpt/blob/master/explanations/dense.ipynb) lesson, we learned how to adjust our neural network parameters with gradient descent.  We then used this technique in the subsequent lessons.

Gradient descent is a type of optimizer.  Optimizers adjust neural network parameters to try to get loss to a minimum value.  In this lesson, we'll learn more about optimizers.  We'll first go into more depth on gradient descent, and discuss batch size, learning rate schedules, weight decay, and momentum.  We'll then discuss the AdamW optimizer, which is a popular optimizer that combines momentum and weight decay.  AdamW is the most commonly used optimizer in large language models like GPT.

## Stochastic Gradient Descent

The type of optimizer we've used so far is called stochastic gradient descent, or SGD.  In SGD, we take a minibatch of training examples, then compute the average gradient across the minibatch.  We then adjust the parameters using this average gradient.

Let's define a two-layer dense neural network, then explore the effect of batch size on our loss over time.

In [3]:
import sys, os
sys.path.append(os.path.abspath('../data'))
from csv_data import WeatherDatasetWrapper

# Load the data with 3 target values instead of the binary value from earlier
wrapper = WeatherDatasetWrapper()
[train_x, train_y], [valid_x, valid_y], [test_x, test_y] = wrapper.get_flat_datasets()

In [5]:
import math
import numpy as np

class Dense():
    def __init__(self, input_size, output_size, activation=True, seed=0):
        self.add_activation = activation
        self.hidden = None
        self.prev_hidden = None

        np.random.seed(seed)
        k = math.sqrt(1 / input_size)
        self.weights = np.random.rand(input_size, output_size) * (2 * k) - k

    def forward(self, x):
        self.prev_hidden = x.copy()
        x = np.matmul(x, self.weights)
        if self.add_activation:
            x = np.maximum(x, 0)
        self.hidden = x.copy()
        return x

    def backward(self, grad, lr):
        if self.add_activation:
            grad = np.multiply(grad, np.heaviside(self.hidden, 0))

        w_grad = self.prev_hidden.T @ grad
        grad = grad @ self.weights.T
        return w_grad, grad

    def update(self, w_grad):
        self.weights -= w_grad

In [6]:
def forward(x, layers):
    for layer in layers:
        x = layer.forward(x)
    return x

def backward(x, layers, lr):
    w_grads = []
    for layer in reversed(layers):
        w_grad, _ = layer.backward(x, lr)
        w_grads.append(w_grad)
    return w_grads

layers = [Dense(3, 10), Dense(10, 1, activation=False)]