# Modules and MLPs

We've seen how the internals of simple linear classifier work. However, we still had to set a lot of things manually. It's much better to have a higher-level API that encapsulates the classifier.

We are going to see that now, with pytorch Module objects. Then, it will allow us to build more complex models, like a multilayer perceptron.

We begin by loading the data again:

In [None]:
import torch
import numpy as np
import mnist
from matplotlib import pyplot as pl

train_x = mnist.train_images()
train_y = mnist.train_labels()
test_x = mnist.test_images()
test_y = mnist.test_labels()

num_features = 28 * 28
num_classes = len(np.unique(train_y))
new_shape = [-1, num_features]
train_x_vectors = train_x.reshape(new_shape)
test_x_vectors = test_x.reshape(new_shape)

# shorten the names
train_x = train_x_vectors / 255
test_x = test_x_vectors / 255

#### Sequential

Let's create a model similar to the one in the previous notebook, but now with a more organized structure.

In [None]:
linear_layer = torch.nn.Linear(num_features, num_classes)
linear_model = torch.nn.Sequential(linear_layer)
loss_function = torch.nn.CrossEntropyLoss()

The model can be called as function to compute an output. Let's see how it works:

In [None]:
batch_size = 8
batch = torch.tensor(train_x[:batch_size], dtype=torch.float)
labels = torch.tensor(train_y[:batch_size], dtype=torch.long)

answers = linear_model(batch)
answers

#### Optimizer

The answers and loss are pretty much in the same way as in our last notebook. Now let's define an optimizer that will update weights more efficiently.

In [None]:
learning_rate = 1e-2

# the optimizer needs to be told which are the parameters to optimize
optimizer = torch.optim.SGD(linear_model.parameters(), lr=learning_rate)

#### Training loop

Now we write the main training loop. This is the basic skeleton for training pytorch models.

In [None]:
def train_model(model, train_x, train_y, num_epochs, batch_size, optimizer):
    losses = []

    for epoch in range(num_epochs):
        print('Starting epoch %d' % epoch)
        batch_index = 0
        total_loss = 0

        while batch_index < len(train_x):
            # get the data for this batch
            next_index = batch_index + batch_size
            batch_x = torch.tensor(train_x[batch_index:next_index], dtype=torch.float)
            batch_y = torch.tensor(train_y[batch_index:next_index], dtype=torch.long)
            batch_index = next_index

            # forward pass
            logits = model(batch_x)

            # compute the loss
            loss = loss_function(logits, batch_y)
            loss_value = loss.item()
            total_loss += loss_value
            losses.append(loss_value)

            # important: zero the gradients before recomputing them again
            model.zero_grad()
            loss.backward()

            # after determining the gradients, take a step toward their direction
            optimizer.step()

        avg_loss = total_loss / len(train_x)
        print('Epoch loss: %.4f' % avg_loss)
    
    return np.array(losses)

In [None]:
losses = train_model(linear_model, train_x, train_y, 5, 8, optimizer)

Knowing the loss decreases is good, but in classification problems, we usually want to know other metrics such as accuracy or F1.

**Exercise:** Include accuracy report!

Graphics are good to understand the performance of a model. Let's plot the loss curve by batch:

In [None]:
%matplotlib inline
pl.rcParams['figure.figsize'] = [10, 5]
pl.plot(losses)

That might be too dense, although we can see that the loss doesn't decrease smoothly. Let's downsample the array, picking only every 10th value, remove the lines and try again.

In [None]:
pl.plot(losses[::10], 'b.')

Now it is clearer to see that the bulk of the batches have a lower loss. Interestingly, some patterns of hard examples to classify are repeated every epoch.

## Multilayer Perceptron

We can now proceed to a more sofisticated classifier: a multilayer perceptron. Let's build one using the Sequential API.

In [None]:
hidden_size = 200
learning_rate = 1e-2

linear_layer1 = torch.nn.Linear(num_features, hidden_size)
linear_layer2 = torch.nn.Linear(hidden_size, num_classes)
mlp = torch.nn.Sequential(linear_layer1, 
                          torch.nn.ReLU(), 
                          linear_layer2)

optimizer = torch.optim.SGD(mlp.parameters(), lr=learning_rate)

Now let's train the model. How do the loss and accuracy compare with the linear model?

You probably also noticed a difference in running time!

In [None]:
losses = train_model(mlp, train_x, train_y, 5, 8, optimizer)

Notice the different concentration of dots in the MLP and Linear graphics!

In [None]:
pl.plot(losses[::10], 'b.')

### Validation data

Evaluating the performance on training data is important to understand if the model is actually learning, but if we want to know if our model has any usefulness, we should evaluate its performance on validation or test data.



In [None]:
def evaluate_model(model, test_x, test_y):
    test_x = torch.tensor(test_x, dtype=torch.float)
    test_y = torch.tensor(test_y, dtype=torch.long)
    loss_function = torch.nn.CrossEntropyLoss()
    logits = model(test_x)
    loss = loss_function(logits, test_y)
    return loss

In [None]:
evaluate_model(mlp, test_x, test_y)

In [None]:
evaluate_model(linear_model, test_x, test_y)

Validation loss is way higher than training loss: that's plain overfitting.

How can we remedy that? There are two things to be done:

1. Regularize, i.e., add some kind of penalty to the model that encourages it to find a more general solution. Examples: L2-norm weight regularization, dropout.
1. Early stop! Evaluate the model on validation data after each epoch or some number of batches; only save it when validation performance increases.