# Regularization

You may be surprised to learn that the goal of training a neural network is not to minimize the training loss.  This is counterintuitive - of course, we've been training models using gradient descent, where loss decreases over time.

However, the training data is just a sample of the population data.  For example, let's say that we have data on house prices, where each row in this dataset represents a single house.  The other columns are:

- `interest`: The interest rate
- `vacancy`: The vacancy rate
- `cpi`: The consumer price index
- `price`: The price of a house
- `value`: The value of a house
- `adj_price`: The price of a house, adjusted for inflation
- `adj_value`: The value of a house, adjusted for inflation

We have about `700` different rows of data - this is called our sample.  There are many more houses that aren't in our training data - this is called the population.  What we actually want to do is train the model on our sample of data, but use it to make predictions on the population.

If our model makes good predictions in our sample, but bad predictions in the population, then that's called overfitting.  Overfitting means that the model won't be useful in the real world.  Neural networks are prone to overfitting, and there are many techniques that have been developed to prevent this.

These techniques are broadly called regularization, which adds constraints or penalties to the model's parameters, in order to encourage it to learn simpler and more generalizable representations.  These will usually increase loss in the sample, but decrease loss in the population.

Of course, we don't have access to data from the population.  This means that we need to split our sample up into a training set, a validation set, and a test set.  While we're training the model, we'll evaluate on the validation set.  After we've optimized our training method and parameters, we'll evaluate on the test set.  This ensures that we're not using knowledge of the test set when we tune our parameters (this could cause overfitting to the test set).

We'll learn three forms of regularization:

- Weight decay, which decreases the magnitude of the weights in the optimizer.  This pushes most of the weights towards zero, which encourages the model to learn simpler representations.
- Dropout, which randomly sets activations to zero.  This prevents the model from relying on any single activation, and encourages it to learn more generalizable representations.
- Early stopping, which stops training when the validation loss starts to increase.  This prevents overfitting by stopping training before the model starts to memorize the training data.

There are other forms of regularization, like data augmentation, but these are the most common when working with text.

## Loading the data

We'll first load in the data.  We'll be using the same data from the last lesson:

In [1]:
import sys, os
sys.path.append(os.path.abspath('../data'))
sys.path.append(os.path.abspath('../nnets'))
from dense import DenseManualUpdate as Dense, forward, backward
from csv_data import HousePricesDatasetWrapper
import numpy as np
from optimizer import Optimizer

wrapper = HousePricesDatasetWrapper()
train_data, valid_data, test_data = wrapper.get_flat_datasets()

## Setting up a training run

As you can see, we split the data into three sets.  We'll use the training set to train the model, the validation set to evaluate the model, and the test set to evaluate the model after we've finished training.

We'll write a training loop function, which will allow us to test different types of regularization.  The function:

- Sets up a new W&B run for monitoring
- Loops through the training data batch by batch:
    - Makes a prediction
    - Finds the error
    - Computes the gradient
    - Updates the parameters
- Logs the loss

In [2]:
%env WANDB_SILENT=True

import wandb
wandb.login()

def training_run(epochs, regularization, layers, optimizer, train_data, valid_data, name=None):
    # Initialize a new W&B run, with the right parameters
    wandb.init(project="regularization",
               name=name,
               config={"regularization": regularization})

    # Split the training and valid data into x and y
    train_x, train_y = train_data
    valid_x, valid_y = valid_data

    for epoch in range(epochs):
        running_loss = 0
        for i in range(len(train_x)):
            # Get the x and y batches
            x_batch = train_x[i:(i+1)]
            y_batch = train_y[i:(i+1)]
            # Make a prediction
            pred = forward(x_batch, layers, training=True)

            # Run the backward pass
            loss = pred - y_batch
            layer_grads = backward(loss, layers)
            running_loss += np.mean(loss ** 2)

            # Run the optimizer
            optimizer(layer_grads, layers, 1)

        # Calculate and log validation loss
        valid_preds = forward(valid_x, layers, training=False)
        valid_loss = np.mean((valid_preds - valid_y) ** 2)
        train_loss = running_loss / len(train_x)
        wandb.log({
            "valid_loss": valid_loss,
            "epoch": epoch,
            "train_loss": train_loss,
        })

    # Mark the run as complete
    wandb.finish()

env: WANDB_SILENT=True


# Weight decay

Weight decay is l2 regularization.  Goal is to shrink the weights towards 0 (lower the l2 norm).

l1 regularization lowers the l1 norm (sum of absolute values of weights).

In [3]:
class SGDW(Optimizer):
    def __init__(self, lr, decay):
        self.lr = lr
        self.decay = decay
        super().__init__()

    def __call__(self, layer_grads, layers, batch_size):
        # Loop through the layer grads.  Reverse the layers to match the grads (from output backward to input).
        for layer_grad, layer in zip(layer_grads, reversed(layers)):
            if layer_grad is None:
                # Account for dropout layers
                continue
            w_grad, b_grad = layer_grad

            # Normalize the weight gradient by batch size
            w_grad /= batch_size

            # Calculate the update sizes
            w_update = w_grad + self.decay * layer.weights
            w_update *= -self.lr
            # We don't usually decay the bias
            b_update = -self.lr * b_grad

            # Actually do the update
            layer.update(w_update, b_update)

        self.save_vector(layers)

In [4]:
layers = [
    Dense(7, 25),
    Dense(25, 10),
    Dense(10, 1, activation=False)
]
# No decay is equal to SGD
sgd = SGDW(1e-4, 0)
# Normal SGD
training_run(10, "None", layers, sgd, train_data, valid_data, name="sgd")

In [5]:
layers = [
    Dense(7, 25),
    Dense(25, 10),
    Dense(10, 1, activation=False)
]

# No decay is equal to SGD
sgd = SGDW(1e-4, .1)
# Weight decay
training_run(10, "Weight Decay", layers, sgd, train_data, valid_data, name="sgdw")

# Dropout

Dropout prevents overfitting.

In [12]:
class Dropout():
    def __init__(self, drop_p):
        self.drop_p = drop_p
        self.training = True

    def forward(self, input):
        if self.training:
            # Generate a mask of 0s and 1s
            self.mask = np.random.binomial(1, 1-self.drop_p, input.shape)
        else:
            # No dropout in inference
            self.mask = np.ones_like(input)
        # Apply the mask.  If the mask is 0, the input is set to 0
        return np.where(self.mask, input, 0)

    def backward(self, grad):
        # Use np.where to apply the mask
        return None, np.where(self.mask, grad, 0)

In [15]:
layers = [
    Dense(7, 25),
    Dense(25, 10),
    Dropout(.02),
    Dense(10, 1, activation=False)
]

sgd = SGDW(1e-4, .1)
# Weight decay and dropout
training_run(10, "Weight Decay + Dropout", layers, sgd, train_data, valid_data, name="dropout")

# Early Stopping

Early stopping can prevent overfitting by stopping training when the validation loss is plateauing or increasing.

It's common to save checkpoints regularly and then choose the best one.

In [17]:
layers = [
    Dense(7, 25),
    Dense(25, 10),
    Dropout(.05),
    Dense(10, 1, activation=False)
]

sgd = SGDW(1e-4, .1)
# Weight decay and dropout
training_run(4, "Early Stopping", layers, sgd, train_data, valid_data, name="early_stopping")

# Training convergence

Not strictly regularization, but can help with overfitting.  Also help the model converge better.

In [None]:
# Improve generalization and convergence
# Not strictly regularization

# Layernorm

Normalize the values in a layer.  Stabilize training.

Next, compare this with PyTorch's layernorm.  You can do that using something like this - https://discuss.pytorch.org/t/how-to-call-the-backward-function-of-a-custom-module/7853/2

In [38]:
class LayerNorm():
    def __init__(self, embed_dim, eps):
        self.embed_dim = embed_dim
        self.eps = eps

    def forward(self, input):
        # Cache for backward pass
        self.input = input
        # Calculate the mean and standard deviation
        self.mean = np.sum(input, axis=1, keepdims=True) / self.embed_dim
        self.normed = (input - self.mean)
        variance = np.sum(self.normed**2, axis=1, keepdims=True) / self.embed_dim
        self.std = np.sqrt(variance + self.eps)
        # Normalize the input
        return self.normed / self.std

    def backward(self, grad):
        # Find the derivative of numerator (normed)
        grad_normed_1 = grad * 1 / self.std

        # Derivative of denominator (std)
        grad_std = grad * self.normed
        # std is a single number
        grad_std = np.sum(grad_std, axis=1, keepdims=True)
        # Derivative of 1 / std
        grad_std = grad_std * -1 / (self.std**2)

        # Find gradient against the variance
        grad_variance = grad_std * .5 * 1 / self.std

        # Find gradient against normed
        grad_normed_2 = grad_variance * 1 / self.embed_dim
        grad_normed_2 = np.ones_like(self.normed, dtype=self.input.dtype) * grad_normed_2
        grad_normed_2 = grad_normed_2 * 2 * self.normed

        # Combine two gradients against normed
        grad_normed = grad_normed_1 + grad_normed_2

        # Find gradient against mean
        grad_mean = grad_normed * -1
        grad_mean = np.sum(grad_mean, axis=1, keepdims=True)

        # Find gradient against input
        grad_input_1 = grad_normed
        grad_input_2 =  grad_mean * 1 / self.embed_dim
        grad_input_2 = grad_input_2 * np.ones_like(self.input, dtype=self.input.dtype)

        # Combine two gradients against input
        grad_input = grad_input_1 + grad_input_2
        return None, grad_input

In [50]:
layers = [
    Dense(7, 25),
    LayerNorm(25, 1e-6),
    Dense(25, 10),
    Dense(10, 1, activation=False)
]

sgd = SGDW(5e-4, .1)
# Weight decay and dropout
training_run(10, "Layer Norm", layers, sgd, train_data, valid_data, name="layer_norm")

In [None]:
layer_norm = LayerNorm(512)
layer_norm.forward(input_embed.forward(data[0]["en"]))

In [None]:
## Residual connections