<a href="https://colab.research.google.com/github/ftomovski/hw2.github.io/blob/main/classdemo/TrainingDynamics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practical Training Dynamics

This notebook sets up a tiny training problem to demonstrate many of the techniques that are important in getting neural network training to work well in practical situations.

It shows you how to
* Run an ordinary neural network training loop.
* Monitor and plot performance on held-out data, to understand and fix lack of convergence or overfitting.
* Examine and plot gradient distributions, to understand and fix vanishing or exploding gradients.
* Experiment with different nonlinearities, optimizers, losses, regularizers, and neural network architectures.

In [None]:
import numpy, torch
train_data, train_labels, test_data, test_labels = [
    torch.tensor(m[k]).float()
    for m in [numpy.load('tiny-classification.npz')]
    for k in 'train_data train_labels val_data val_labels'.split()]

print(f'The training data has {train_data.size(0)} samples, each a vector of {train_data.size(1)} numbers along with')
print(f'a corresponding set of {train_labels.size(0)} labels, assigning {train_labels.min()} or {train_labels.max()} to each sample.')

print(f'The test data has {test_data.size(0)} samples and labels that are disjoint from the training data.')


The training data has 8000 samples, each a vector of 36 numbers along with
a corresponding set of 8000 labels, assigning 0.0 or 1.0 to each sample.
The test data has 1000 samples and labels that are disjoint from the training data.


## Download training data.

In [None]:
%%capture
!wget -r https://ds4440.baulab.info/data/tiny-classification.npz
!wget -r https://ds4440.baulab.info/data/hard-classification.npz
!pip install git+https://github.com/davidbau/baukit

## Define a "Supervise" module

It evaluates a loss over a network given pairs of input and target output data.

The code should look like this.  It just hold on to a network and a loss function that it calls the "criterion".

Then when it is given both input and output, it runs the network on the input; then it checks the "criterion" to compare the output to the target output data.

In [None]:
from torch import nn

class Supervise(nn.Module):
    def __init__(self, criterion, net):
        super().__init__()
        self.net = net
        self.criterion = criterion
    def forward(self, x, y):
        out = self.net(x).squeeze()
        return self.criterion(out, y)

## Create a supervised neural network

This one has a few linear layers, each followed by a Sigmoid.

In [None]:
from collections import OrderedDict

input_size = 36
hidden_dims = 64
output_dims = 1

net = Supervise(
        nn.MSELoss(),
        nn.Sequential(OrderedDict([
            ('layer1', nn.Linear(input_size, hidden_dims)),
            ('sigma1', nn.Sigmoid()),
            ('layer2', nn.Linear(hidden_dims, hidden_dims)),
            ('sigma2', nn.Sigmoid()),
            ('layer3', nn.Linear(hidden_dims, hidden_dims)),
            ('sigma3', nn.Sigmoid()),
            ('layer4', nn.Linear(hidden_dims, hidden_dims)),
            ('sigma4', nn.Sigmoid()),
            ('layer5', nn.Linear(hidden_dims, output_dims)),
            ('sigma5', nn.Sigmoid())
        ]))
    )

## Train the network

Here is a typical training loop.

It is very minimal right now.  Let's add some visualization code to it.

In [None]:
optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
num_iterations = 200
test_every = 10

for epoch in range(200):
    loss = net(train_data.float(), train_labels.float())
    net.zero_grad()
    loss.backward()
    optimizer.step()  # Update model parameters using the optimizer's update rule


## Adding visualization code

We will examine the basic training loop above together.

What it's missing is any code that might allow us to debug how well (or badly) our training is going

Add the following visualization code:

### Tracking Peformance in Arrays

Before the training loop:
```
train_losses, train_accs, test_accs = [], [], []
```


Inside the training loop, add a track of all the training losses.
```
    train_losses.append([epoch, loss.item()])
```


### Testing performance on both training and held-out test data

And then also some testing code.

Why is it important to test on different data?

```
    if epoch % test_every == test_every - 1:
        grads = torch.stack([p.grad.abs().max() for p in net.parameters()])
        maxg, ming = grads.abs().max(), grads.abs().min()
        net.eval()
        train_outputs = net.net(train_data.float())
        net.train()
        train_preds = (train_outputs.squeeze() > 0.5).float()
        train_accuracy = (train_preds == train_labels).float().mean()
        train_accs.append([epoch + 1, train_accuracy])
        test_outputs = net.net(test_data.float())
        test_preds = (test_outputs.squeeze() > 0.5).float()
        test_accuracy = (test_preds == test_labels).float().mean()
        test_accs.append([epoch + 1, test_accuracy])
        print(
            f"Epoch {epoch+1}, Loss: {loss.item():.5f}, Grad range {maxg:.1e} to {ming:.1e}, "
            f"Train Accuracy: {train_accuracy.item()}, Test Accuracy: {test_accuracy.item()}")
        if test_accuracy.item() == 1.0:
            break
```

### Adding some graphing code to visualize

After the training loop, add this.

```
# Test the Model
with torch.no_grad():
    train_outputs = net.net(train_data.float())
    train_preds = (train_outputs.squeeze() > 0.5).float()
    train_accuracy = (train_preds == train_labels).float().mean()
    test_outputs = net.net(test_data.float())
    test_preds = (test_outputs.squeeze() > 0.5).float()
    test_accuracy = (test_preds == test_labels).float().mean()
    print(
        f"\nTrain Accuracy: {train_accuracy.item():.5f}, Test Accuracy: {test_accuracy.item():.5f}"
    )

# Visualization
fig, ax = plt.subplots()
ax2 = ax.twinx()
ax.plot(*zip(*train_losses), label="Training loss")
ax.set_yscale("log")
ax2.plot(*zip(*train_accs), color="orange", label="Training accuracy")
ax2.plot(*zip(*test_accs), color="red", label="Test accuracy")
ax2.set_ylim(0.0, 1.0)
for a in [ax, ax2]:
    for pos in "top right bottom left".split():
        a.spines[pos].set_visible(False)
ax.set_xlabel("Epochs")
ax.set_ylabel("Loss")
ax2.set_ylabel("Accuracy")
fig.legend(loc="lower left", bbox_to_anchor=(0, 0), bbox_transform=ax.transAxes)
fig.show()
```

The code above is similar to the homework setup.

## Plotting gradients

A common cause of "frozen training" is that derivatives go to near zero.

Let's plot the derivatives to see what they look like for our network.

Here I am using a package called `baukit` has a Trace utility that keeps copies of all the activations while we run the network.

In [None]:
from baukit import TraceDict

with TraceDict(net, [n for n, _ in net.named_modules()], retain_grad=True) as trace:
    loss = net(train_data[0], train_labels[0])
net.zero_grad()
loss.backward()

The code below is a nice histogram plotting utility.

In [None]:
%matplotlib inline
%config InlineBackend.figure_format='retina'

from matplotlib import pyplot as plt
def plot_histograms(title, datalist):
    fig, axes = plt.subplots(len(datalist), 1, figsize=(10, 1.5 * len(datalist)), sharex=True)
    fig.suptitle(title)
    for i, (name, data) in enumerate(datalist):
        axes[i].hist(data.flatten().detach().numpy(), bins=100)
        axes[i].set_title(name)
    fig.tight_layout()
    fig.show()

## Plotting parameter gradients



In [None]:
datalist = [(n, p.grad)
    for n, p in net.named_parameters() if 'weight' in n]
plot_histograms('Parameter gradients', datalist)

## Plotting neuron activations and their gradients

Let's discuss exactly where these derivatives are in the backpropagation.

In [None]:
datalist = [(n, trace[n].output)
    for n, p in net.named_modules() if 'layer' in n]
plot_histograms('Activations', datalist[:-1])


datalist = [(n, trace[n].output.grad)
    for n, p in net.named_modules() if 'layer' in n]
plot_histograms('Activation gradients', datalist[:-1])

## Make an experimental harness

Now we will experiment with all these things by making a little test harness that does all the above on a network that we choose.

In [None]:
def run_test(net, opt=lambda x: torch.optim.SGD(x, lr=0.01), num_iterations=200, test_every=10):
    # Set up the Loss Function and Optimizer
    optimizer = opt(
        net.parameters()
    )  # Initialize the optimizer with model parameters
    print(f"{sum([p.numel() for p in net.parameters()])} parameters")
    train_losses, train_accs, test_accs = [], [], []

    for epoch in range(num_iterations):
        loss = net(train_data.float(), train_labels.float())
        loss.backward()
        train_losses.append([epoch, loss.item()])
        optimizer.step()  # Update model parameters using the optimizer's update rule
        if epoch % test_every == test_every - 1:
            grads = torch.stack([p.grad.abs().max() for p in net.parameters()])
            maxg, ming = grads.abs().max(), grads.abs().min()
            net.eval()
            train_outputs = net.net(train_data.float())
            net.train()
            train_preds = (train_outputs.squeeze() > 0.5).float()
            train_accuracy = (train_preds == train_labels).float().mean()
            train_accs.append([epoch + 1, train_accuracy])
            test_outputs = net.net(test_data.float())
            test_preds = (test_outputs.squeeze() > 0.5).float()
            test_accuracy = (test_preds == test_labels).float().mean()
            test_accs.append([epoch + 1, test_accuracy])
            print(
                f"Epoch {epoch+1}, Loss: {loss.item():.5f}, Grad range {maxg:.1e} to {ming:.1e}, "
                f"Train Accuracy: {train_accuracy.item()}, Test Accuracy: {test_accuracy.item()}",
                end="   \r",
            )
            if test_accuracy.item() == 1.0:
                break
        optimizer.zero_grad()

    # Test the Model
    with torch.no_grad():
        train_outputs = net.net(train_data.float())
        train_preds = (train_outputs.squeeze() > 0.5).float()
        train_accuracy = (train_preds == train_labels).float().mean()
        test_outputs = net.net(test_data.float())
        test_preds = (test_outputs.squeeze() > 0.5).float()
        test_accuracy = (test_preds == test_labels).float().mean()
        print(
            f"\nTrain Accuracy: {train_accuracy.item():.5f}, Test Accuracy: {test_accuracy.item():.5f}"
        )

    # Visualization
    if len(train_losses) > 0:
        fig, ax = plt.subplots()
        ax2 = ax.twinx()
        ax.plot(*zip(*train_losses), label="Training loss")
        ax.set_yscale("log")
        ax2.plot(*zip(*train_accs), color="orange", label="Training accuracy")
        ax2.plot(*zip(*test_accs), color="red", label="Test accuracy")
        ax2.set_ylim(0.0, 1.0)
        for a in [ax, ax2]:
            for pos in "top right bottom left".split():
                a.spines[pos].set_visible(False)
        ax.set_xlabel("Epochs")
        ax.set_ylabel("Loss")
        ax2.set_ylabel("Accuracy")
        fig.legend(loc="lower left", bbox_to_anchor=(0, 0), bbox_transform=ax.transAxes)
        fig.show()

    # One last pass, for plotting gradients
    with TraceDict(net, [n for n, _ in net.named_modules() if 'layer' in n],
                   retain_grad=True) as trace:
        loss = net(train_data[0], train_labels[0])
        net.zero_grad()
        loss.backward()

    plot_histograms('Parameter gradients', [(n, p.grad)
        for n, p in net.named_parameters() if 'weight' in n])

    plot_histograms('Activations', [(n, trace[n].output)
        for n, p in net.named_modules() if 'layer' in n])

    plot_histograms('Activation gradients', [(n, trace[n].output.grad)
        for n, p in net.named_modules() if 'layer' in n])

In [None]:
run_test(Supervise(
        nn.MSELoss(),
        nn.Sequential(OrderedDict([
            ('layer1', nn.Linear(input_size, hidden_dims)),
            ('sigma1', nn.Sigmoid()),
            ('layer2', nn.Linear(hidden_dims, hidden_dims)),
            ('sigma2', nn.Sigmoid()),
            ('layer3', nn.Linear(hidden_dims, hidden_dims)),
            ('sigma3', nn.Sigmoid()),
            ('layer4', nn.Linear(hidden_dims, hidden_dims)),
            ('sigma4', nn.Sigmoid()),
            ('layer5', nn.Linear(hidden_dims, output_dims)),
            ('sigma5', nn.Sigmoid())
        ]))
    )
)

## Now: experiment.

Depending on time:

* Try Tanh vs Sigmoid vs ReLU nonlinearities
* Try Cross-Entropy instead of MSE loss
* Try different initizations
* Try different learning rates
* Try ADAM instead of SGD
* Try adding Weight Decay to reduce overfitting
* Try Residuals and Batchnorms