<a href="https://colab.research.google.com/github/ccarpenterg/LearningMXNet/blob/master/02_getting_started_with_mxnet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Getting Started with MXNet: Training a NN on MNIST

In this notebook, we train an artificial neural network on the MNIST dataset. We'll build a very simple neural network of 3 layers (input, hidden and output), and use dropout for regularization.

As we saw in the previous notebook, Mxnet is not installed by default in Colab. So first, we need to find out the CUDA version Colab is using and then install the right Mxnet package for the CUDA version, as we did before:

In [0]:
!nvcc --version

Colab is using CUDA 10.0 so we need to install mxnet-cu100:

In [0]:
!pip install mxnet-cu100

Now we'll import a couple of standard modules:

- **mxnet** is the framework that we import as **mx**
- **nd** is short for NDarray and is MXNet's primary tool for working with tensors
- **gluon** includes several modules that we'll be using for training our network, such as **data** for downloading the dataset and loading the data into tensors, and **loss** for calculating the loss on each iteration.
- **autograd** is the tool we use to automatically calculate the network's gradients w.r.t. the parameters
- **nn** is a high-level API that will help us build our neural network

In [0]:
from __future__ import print_function

import mxnet as mx
from mxnet import nd, gluon, autograd
from mxnet.gluon import nn

from mxnet.gluon.data.vision import transforms

import statistics

print(mx.__version__)

1.5.1


### MNIST Dataset

We are going to work with the MNIST dataset. Basically it contains images of handwritten digits in grayscale, and its corresponding labels (one, two, three, etc).



In [0]:
transform = transforms.Compose([
    transforms.ToTensor()
])

MNIST = gluon.data.vision.MNIST

train_data = MNIST(train=True).transform_first(transform)
valid_data = MNIST(train=False).transform_first(transform)

print(len(train_data))
print(len(valid_data))

60000
10000


In [0]:
train_loader = gluon.data.DataLoader(train_data, shuffle=True, batch_size=64)
valid_loader = gluon.data.DataLoader(valid_data, shuffle=False, batch_size=64)

dataiter = iter(train_loader)

batch, labels = dataiter.__next__()

print(batch.shape)
print(labels.shape)


(64, 1, 28, 28)
(64,)


### Building the Neural Network

We use the Sequential container, which provides an API similar to Keras. We put together 4 different layers:

- **Flatten:** before feed forwarding the MNIST images we need to stretch them out. So this layer gets a 28x28 matrix and turn it into a 784-elements array/vector.
- **Dense (hidden layer):** this is our first fully connected layer. Each of its neurons connects to all 784 input neurons, and each has a bias. Also each neuron in this layer has ReLU as the activation function.
- **Dropuout:** this is the regularization method we'll use when training our network. Dropout works by, in each iteration, dropping some of the neurons in the previous layer.
- **Dense (output layer):** the MNIST dataset has 10 classes, each for each one of the digits. So we'll have 10 neurons in this layer, representing each of the digits.

In [0]:
drop_prob = 0.2

net = nn.Sequential()
net.add(nn.Flatten(),
        nn.Dense(128, activation='relu'),
        nn.Dropout(drop_prob),
        nn.Dense(10))

net

Sequential(
  (0): Flatten
  (1): Dense(None -> 128, Activation(relu))
  (2): Dropout(p = 0.2, axes=())
  (3): Dense(None -> 10, linear)
)

Before initializing our network we'll setup a GPU device. We can either train our model via a CPU or a GPU. GPUs are designed and optimized for processing tensors (or arrays in general), and we can borrow a GPU from Colab:

In [0]:
ctx = mx.gpu(0) if mx.context.num_gpus() > 0 else mx.cpu(0)
net.initialize(mx.init.Xavier(), ctx=ctx)

Now we call the summary method an take a look at our neural network's architecture. As we see, our basic neural network has 101,700 parameters to train, including weights and biases:

In [0]:
net.summary(nd.zeros((1, 1, 28, 28), ctx=ctx))

--------------------------------------------------------------------------------
        Layer (type)                                Output Shape         Param #
               Input                              (1, 1, 28, 28)               0
           Flatten-1                                    (1, 784)               0
        Activation-2                    <Symbol dense0_relu_fwd>               0
        Activation-3                                    (1, 128)               0
             Dense-4                                    (1, 128)          100480
           Dropout-5                                    (1, 128)               0
             Dense-6                                     (1, 10)            1290
Parameters in forward computation graph, duplicate included
   Total params: 101770
   Trainable params: 101770
   Non-trainable params: 0
Shared params in forward computation graph: 0
Unique parameters in model: 101770
---------------------------------------------------

### Trainer: Stochastic Gradient Descent

In [0]:
trainer = gluon.Trainer(
    params=net.collect_params(),
    optimizer='sgd',
    optimizer_params={'learning_rate': 0.04},
)

**Train function**

The train function will train our artificial neural network by finding the parameters that minimize the loss function. Also we are keeping track of the losses as scalars, and at the end we calculate the mean loss. Here are the train function's steps:

**(i)** get a batch of training examples and its labels, and send them to the GPU (CUDA), **(ii)** capture the code whose gradients will be calculated through autograd, **(iii)** forward propagate the batch through the NN and calculate the loss, **(iv)** backpropagate the loss through the NN and update the parameters (weights and biases).



In [0]:
def train(model, loss_function, optimizer):
    
    train_batch_losses = []
    
    for batch, labels in train_loader:
        batch = batch.as_in_context(ctx)
        labels = labels.as_in_context(ctx)
        
        with autograd.record():
            #these are the output layer's values before applying softmax
            output = model(batch)
            #the loss function applies softmax to the output
            loss = loss_function(output, labels)
            
        loss.backward()
        
        optimizer.step(batch_size=batch.shape[0])
        
        train_batch_losses.append(float(nd.sum(loss).asscalar()))
        
    batch_loss = statistics.mean(train_batch_losses)
    
    return batch_loss

**Validation function**

Once we have trained our neural network we are ready to validate our model using our validation/test set. The validation function goes through the validation set and outputs the mean loss. At this point we are only working with the loss, we'll calculate the accuracy using a different function.

In [0]:
def validate(model, loss_function, optimizer):
    
    validation_batch_losses = []
    
    for batch, labels in valid_loader:
        batch = batch.as_in_context(ctx)
        labels = labels.as_in_context(ctx)
        
        #these are the output layer's values before applying softmax
        output = model(batch)
        #the loss function applies softmax to the output
        loss = loss_function(output, labels)
        
        validation_batch_losses.append(float(nd.sum(loss).asscalar()))
        
        mean_loss = statistics.mean(validation_batch_losses)
        
    return mean_loss

**Accuracy function**

We need to know how well is doing our model at predicting the digits for each image. In the accuracy function we use the Accuracy metric that is included in mxnet.

Since the loss function includes the Softmax activation, our neural network's outputs are raw numbers. So we use **nd.softmax** to get the NN's probabilities for each class/digit, and use **nd.argmax** to get the prediction for each training or validation example:

In [0]:
def accuracy(model, loader):
    
    metric = mx.metric.Accuracy()
    
    for batch, labels in loader:
        batch = batch.as_in_context(ctx)
        labels = labels.as_in_context(ctx)
        
        class_probabilities = nd.softmax(model(batch), axis=1)
        
        predictions = nd.argmax(class_probabilities, axis=1)
        
        metric.update(labels, predictions)
        
    _, accuracy_metric = metric.get()
    
    return accuracy_metric

### Training the Neural Network

Now it's time to train our NN and the first step is to define the loss function. We then define the number of epochs we'll use; in this case an epoch is a training cycle which means that we go through the whole training set and get the parameters at the end:

In [0]:
loss_function = gluon.loss.SoftmaxCrossEntropyLoss()

epochs = 10

for epoch in range(1, 1 + epochs):
    
    print('Epoch {}/{}'.format(epoch, epochs))
    
    train_loss = train(net, loss_function, trainer)
    train_accuracy = accuracy(net, train_loader)
    
    print('Training loss: {}'.format(train_loss))
    print('Training accuracy: {}%'.format(train_accuracy * 100))
    
    valid_loss = validate(net, loss_function, trainer)
    valid_accuracy = accuracy(net, valid_loader)
    
    print('Validation loss: {}'.format(valid_loss))
    print('Validation accuracy: {}%'.format(valid_accuracy * 100))

Epoch 1/10
Training loss: 12.570566440187791
Training accuracy: 95.66666666666667%
Validation loss: 9.933800904803975
Validation accuracy: 95.49%
Epoch 2/10
Training loss: 11.382317298892211
Training accuracy: 96.06%
Validation loss: 9.063731764152552
Validation accuracy: 95.93%
Epoch 3/10
Training loss: 10.39353093638349
Training accuracy: 96.42166666666667%
Validation loss: 8.31701706776953
Validation accuracy: 96.26%
Epoch 4/10
Training loss: 9.69043880917116
Training accuracy: 96.71666666666667%
Validation loss: 7.8181378755030355
Validation accuracy: 96.48%
Epoch 5/10
Training loss: 9.099464911896028
Training accuracy: 96.985%
Validation loss: 7.436347416157176
Validation accuracy: 96.59%
Epoch 6/10
Training loss: 8.548601573337116
Training accuracy: 97.19166666666666%
Validation loss: 6.99420712537067
Validation accuracy: 96.8%
Epoch 7/10
Training loss: 8.11385058886461
Training accuracy: 97.38666666666667%
Validation loss: 6.658243758758163
Validation accuracy: 96.97%
Epoch 8/10