# Deep Learning (Continued)

In [1]:
# When you import it, you may want to comment out the slow neural network training code in the Python script file
from Week12R import *

[3]
[3, 2]
True
False
6
10
[2, 3, 4]
[[2, 4], [6, 8]]
[0.0, 0.0, 0.0]
[[0.0, 0.0], [0.0, 0.0]]
[5, 7, 9]
[4, 10, 18]


ModuleNotFoundError: No module named 'Week11M'

## Dropout

Like most machine learning models, neural networks are prone to overfitting to their
training data. We’ve previously seen ways to mitigate this; for example, using regularization in regression by penalizing large weights and that helped prevent overfitting.

A common way of regularizing neural networks is using *dropout*. At training time, we
randomly turn off each neuron (that is, replace its output with 0) with some fixed
probability. This means that the network can’t learn to depend on any individual neuron,
which seems to help with overfitting.

At evaluation time, we don’t want to dropout any neurons, so a `Dropout` layer will
need to know whether it’s training or not. In addition, at training time a `Dropout`
layer only passes on some random fraction of its input. To make its output comparable
during evaluation, we’ll scale down the outputs (uniformly) using that same fraction:

In [None]:
class Dropout(Layer):
    def __init__(self, p: float) -> None:
        self.p = p
        self.train = True

    def forward(self, input: Tensor) -> Tensor:
        if self.train:
            # Create a mask of 0s and 1s shaped like the input
            # using the specified probability.
            self.mask = tensor_apply(
                lambda _: 0 if random.random() < self.p else 1,
                input)
            # Multiply by the mask to dropout inputs.
            return tensor_combine(operator.mul, input, self.mask)
        else:
            # During evaluation just scale down the outputs uniformly.
            return tensor_apply(lambda x: x * (1 - self.p), input)

    def backward(self, gradient: Tensor) -> Tensor:
        if self.train:
            # Only propagate the gradients where mask == 1
            return tensor_combine(operator.mul, gradient, self.mask)
        else:
            raise RuntimeError("don't call backward when not in train mode")

## Example: MNIST

[MNIST](http://yann.lecun.com/exdb/mnist/) is a dataset of handwritten digits that everyone uses to learn deep learning. It is available in a somewhat tricky binary format, so we’ll install the mnist library to work with it (note that this package does *not* come pre-installed with anaconda).

In [None]:
!python -m pip install mnist

And then we can load the data:

In [None]:
import mnist

# This will download the data, change this to where you want it.
# You may need to manually create this directory if it doesn't already exists
mnist.temporary_dir = lambda: './data'

In [None]:
# Each of these functions first downloads the data and returns a numpy array.
# We call .tolist() because our "tensors" are just lists.
train_images = mnist.train_images().tolist()
train_labels = mnist.train_labels().tolist()

print(shape(train_images)) # [60000, 28, 28]
print(shape(train_labels)) # [60000]

Let’s plot the first 100 training images to see what they look like

In [None]:
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (12,12)

fig, ax = plt.subplots(10, 10)

for i in range(10):
    for j in range(10):
        # Plot each image in black and white and hide the axes.
        ax[i][j].imshow(train_images[10 * i + j], cmap='Greys')
        ax[i][j].xaxis.set_visible(False)
        ax[i][j].yaxis.set_visible(False)

plt.show()

You can see that indeed they look like handwritten digits.

We also need to load the test images:

In [None]:
# Load the MNIST test data

test_images = mnist.test_images().tolist()
test_labels = mnist.test_labels().tolist()

print(shape(test_images)) # [10000, 28, 28]
print(shape(test_labels)) # [10000]

Each image is 28 × 28 pixels, but our linear layers can only deal with one-dimensional
inputs, so we’ll just flatten them (and also divide by 256 to get them between 0 and 1).

In addition, our neural net will train better if our inputs are 0 on average, so we’ll
subtract out the average value:

In [None]:
# Recenter the images

# Compute the average pixel value
avg = tensor_sum(train_images) / 60000 / 28 / 28

# Recenter, rescale, and flatten
train_images = [[(pixel - avg) / 256 for row in image for pixel in row]
                for image in train_images]
test_images = [[(pixel - avg) / 256 for row in image for pixel in row]
               for image in test_images]

print(shape(train_images)) # [60000, 784]
print(shape(test_images))  # [10000, 784]

# After centering, average pixel should be very close to 0
print(tensor_sum(train_images))

We also want to one-hot-encode the targets, since we have 10 outputs. First let’s write
a `one_hot_encode` function:

In [None]:
def one_hot_encode(i: int, num_labels: int = 10) -> List[float]:
    return [1.0 if j == i else 0.0 for j in range(num_labels)]

print(one_hot_encode(3)) # [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
print(one_hot_encode(2, num_labels=5)) # [0, 0, 1, 0, 0]

and then apply it to our data:

In [None]:
# One-hot encode the test data

train_labels = [one_hot_encode(label) for label in train_labels]
test_labels = [one_hot_encode(label) for label in test_labels]

print(shape(train_labels)) # [60000, 10]
print(shape(test_labels))  # [10000, 10]

One of the strengths of our abstractions is that we can use the same training/evaluation
loop with a variety of models. So let’s write that first. We’ll pass it our model, the
data, a loss function, and (if we’re training) an optimizer.

It will make a pass through our data, track performance, and (if we passed in an optimizer)
update our parameters:

In [None]:
import tqdm

# Training loop

def loop(model: Layer,
         images: List[Tensor],
         labels: List[Tensor],
         loss: Loss,
         optimizer: Optimizer = None) -> None:
    correct = 0         # Track number of correct predictions.
    total_loss = 0.0    # Track total loss.

    with tqdm.trange(len(images)) as t:
        for i in t:
            predicted = model.forward(images[i])             # Predict.
            if argmax(predicted) == argmax(labels[i]):       # Check for
                correct += 1                                 # correctness.
            total_loss += loss.loss(predicted, labels[i])    # Compute loss.

            # If we're training, backpropagate gradient and update weights.
            if optimizer is not None:
                gradient = loss.gradient(predicted, labels[i])
                model.backward(gradient)
                optimizer.step(model)

            # And update our metrics in the progress bar.
            avg_loss = total_loss / (i + 1)
            acc = correct / (i + 1)
            t.set_description(f"mnist loss: {avg_loss:.3f} acc: {acc:.3f}")

As a baseline, we can use our deep learning library to train a simple model which consists of just a single linear layer followed by a softmax. This model (in essence) just looks for 10 linear functions such that if the input represents, say, a 5,
then the 5th linear function produces the largest output.

One pass through our 60,000 training examples should be enough to learn the model:

In [None]:
random.seed(0)

# baseline model is just a linear layer followed by softmax
model = Linear(784, 10)
loss = SoftmaxCrossEntropy()

# This optimizer seems to work
optimizer = Momentum(learning_rate=0.01, momentum=0.99)

# Train on the training data
loop(model, train_images, train_labels, loss, optimizer)

# Test on the test data (no optimizer means just evaluate)
loop(model, test_images, test_labels, loss)

This should get about 89% accuracy. Let’s see if we can do better with a deep neural network.
We’ll use two hidden layers, the first with 30 neurons, and the second with 10
neurons. And we’ll use our `Tanh` activation:

In [None]:
# A deep neural network for MNIST

random.seed(0)

# Name them so we can turn train on and off
dropout1 = Dropout(0.1)
dropout2 = Dropout(0.1)

model = Sequential([
    Linear(784, 30),  # Hidden layer 1: size 30
    dropout1,
    Tanh(),
    Linear(30, 10),   # Hidden layer 2: size 10
    dropout2,
    Tanh(),
    Linear(10, 10)    # Output layer: size 10
])

And we can just use the same training loop!

In [None]:
# Training the deep model for MNIST

optimizer = Momentum(learning_rate=0.01, momentum=0.99)
loss = SoftmaxCrossEntropy()

# Enable dropout and train (this will take a while)
dropout1.train = dropout2.train = True
loop(model, train_images, train_labels, loss, optimizer)

# Disable dropout and evaluate
dropout1.train = dropout2.train = False
loop(model, test_images, test_labels, loss)

Our deep model gets better than 92% accuracy on the test set, which is a nice
improvement from the simple baseline model.

The [MNIST website](http://yann.lecun.com/exdb/mnist/) describes a variety of models that outperform these. Many of
them could be implemented using the machinery we’ve developed so far but would
take an extremely long time to train in our lists-as-tensors framework. Some of the
best models involve convolutional layers, which are important but unfortunately quite
out of scope for an introductory course on data science.

## Saving and Loading Models

These models take a long time to train, so it would be nice if we could save them so
that we don’t have to train them every time. Luckily, we can use the `json` module to
easily serialize model weights to a file.

For saving, we can use `Layer.params` to collect the weights, stick them in a list, and
use `json.dump` to save that list to a file:

In [None]:
import json

def save_weights(model: Layer, filename: str) -> None:
    weights = list(model.params())
    with open(filename, 'w') as f:
        json.dump(weights, f)

Loading the weights back is only a little more work. We just use `json.load` to get the
list of weights back from the file and slice assignment to set the weights of our model.

(In particular, this means that we have to instantiate the model ourselves and then
load the weights. An alternative approach would be to also save some representation
of the model architecture and use that to instantiate the model. That’s not a terrible
idea, but it would require a lot more code and changes to all our Layers, so we’ll stick
with the simpler way.)

Before we load the weights, we’d like to check that they have the same shapes as the
model params we’re loading them into. (This is a safeguard against, for example, trying
to load the weights for a saved deep network into a shallow network, or similar
issues.)

In [None]:
def load_weights(model: Layer, filename: str) -> None:
    with open(filename) as f:
        weights = json.load(f)

    # Check for consistency
    assert all(shape(param) == shape(weight)
               for param, weight in zip(model.params(), weights))

    # Then load using slice assignment:
    for param, weight in zip(model.params(), weights):
        param[:] = weight

JSON stores your data as text, which makes it an extremely inefficient
representation. In real applications you’d probably use the
`pickle` serialization library, which serializes things to a more efficient
binary format. Here I decided to keep it simple and human-readable.