<a href="https://colab.research.google.com/github/ccarpenterg/LearningMXNet/blob/master/02_getting_started_with_mxnet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Getting Started with MXNet: Training a NN on MNIST

In this notebook, we train an artificial neural network on the MNIST dataset. We'll build a very simple neural network of 3 layers (input, hidden and output), and use dropout for regularization.

As we saw in the previous notebook, Mxnet is not installed by default in Colab. So first, we need to find out the CUDA version Colab is using and then install the right Mxnet package for the CUDA version, as we did before:

In [0]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


Colab is using CUDA 10.0 so we need to install mxnet-cu100:

In [0]:
!pip install mxnet-cu100

Now we'll import a couple of standard modules:

- **mxnet** is the framework that we import as **mx**
- **nd** is short for NDarray and is MXNet's primary tool for working with tensors
- **gluon** includes several modules that we'll be using for training our network, such as **data** for downloading the dataset and loading the data into tensors, and **loss** for calculating the loss on each iteration.
- **autograd** is the tool we use to automatically calculate the network's gradients w.r.t. the parameters
- **nn** is a high-level API that will help us build our neural network

In [0]:
from __future__ import print_function

import mxnet as mx
from mxnet import nd, gluon, autograd
from mxnet.gluon import nn

import statistics

print(mx.__version__)

1.5.1


### MNIST Dataset

We are going to work with the MNIST dataset. Basically it contains images of handwritten digits in grayscale, and its corresponding labels (one, two, three, etc).



In [0]:

# MXNet's default data convention is NCHW whereas
# the MNIST Tensor's dimensions are NHWC

def data_convention_normalization(data):
    """HWC -> CHW; Move the channel axis (2) to the first axis (0)"""
    return nd.moveaxis(data, 2, 0).astype('float32') / 255


train_data = gluon.data.vision.MNIST(train=True).transform_first(data_convention_normalization)
val_data = gluon.data.vision.MNIST(train=False).transform_first(data_convention_normalization)

print(len(train_data))
print(len(val_data))

60000
10000


In [0]:
train_loader = gluon.data.DataLoader(train_data, shuffle=True, batch_size=64)
val_loader = gluon.data.DataLoader(val_data, shuffle=False, batch_size=64)

for X, y in train_loader:
    pass

print(X.shape)
print(y.shape)


(32, 1, 28, 28)
(32,)


### Building the Neural Network

We use the Sequential container, which provides an API similar to Keras. We put together 4 different layers:

- **Flatten:** before feed forwarding the MNIST images we need to stretch them out. So this layer gets a 28x28 matrix and turn it into a 784-elements array/vector.
- **Dense (hidden layer):** this is our first fully connected layer. Each of its neurons connects to all 784 input neurons, and each has a bias. Also each neuron in this layer has ReLU as the activation function.
- **Dropuout:** this is the regularization method we'll use when training our network. Dropout works by, in each iteration, dropping some of the neurons in the previous layer.
- **Dense (output layer):** the MNIST dataset has 10 classes, each for each one of the digits. So we'll have 10 neurons in this layer, representing each of the digits.

In [0]:
drop_prob = 0.2

net = nn.Sequential()
net.add(nn.Flatten(),
        nn.Dense(128, activation='relu'),
        nn.Dropout(drop_prob),
        nn.Dense(10))

net

Sequential(
  (0): Flatten
  (1): Dense(None -> 128, Activation(relu))
  (2): Dropout(p = 0.2, axes=())
  (3): Dense(None -> 10, linear)
)

Before initializing our network we'll setup a GPU device. We can either train our model via a CPU or a GPU. GPUs are designed and optimized for processing tensors (or arrays in general), and we can borrow a GPU from Colab:

In [0]:
ctx = mx.gpu(0) if mx.context.num_gpus() > 0 else mx.cpu(0)
net.initialize(mx.init.Xavier(), ctx=ctx)

Now we call the summary method an take a look at our neural network's architecture. As we see, our basic neural network has 101,700 parameters to train, including weights and biases:

In [0]:
net.summary(nd.zeros((1, 1, 28, 28), ctx=ctx))

--------------------------------------------------------------------------------
        Layer (type)                                Output Shape         Param #
               Input                              (1, 1, 28, 28)               0
           Flatten-1                                    (1, 784)               0
        Activation-2                    <Symbol dense2_relu_fwd>               0
        Activation-3                                    (1, 128)               0
             Dense-4                                    (1, 128)          100480
           Dropout-5                                    (1, 128)               0
             Dense-6                                     (1, 10)            1290
Parameters in forward computation graph, duplicate included
   Total params: 101770
   Trainable params: 101770
   Non-trainable params: 0
Shared params in forward computation graph: 0
Unique parameters in model: 101770
---------------------------------------------------

### Trainer: Stochastic Gradient Descent

In [0]:
trainer = gluon.Trainer(
    params=net.collect_params(),
    optimizer='sgd',
    optimizer_params={'learning_rate': 0.04},
)

Now we set up the training accuracy, and the loss function. In this case, we use cross entropy with softmax as our loss function:

In [0]:
training_accuracy = mx.metric.Accuracy()
loss_function = gluon.loss.SoftmaxCrossEntropyLoss()

Now we will train our network:

In [0]:
num_epochs = 10

for epoch in range(num_epochs):
    
    batch_train_loss = []
    
    for batch, labels in train_loader:
        
        batch = batch.as_in_context(ctx)
        labels = labels.as_in_context(ctx)
        
        with autograd.record():
            predictions = net(batch)
            loss = loss_function(predictions, labels)
            
        loss.backward()
        training_accuracy.update(labels, predictions)
        
        trainer.step(batch_size=batch.shape[0])
        
        batch_train_loss.append(float(nd.sum(loss).asscalar()))
        
    batch_loss = statistics.mean(batch_train_loss)
    
    name, train_accuracy = training_accuracy.get()
    
    
    print('Loss on epoch {}: {}'.format(epoch + 1, batch_loss))
    print('Training accuracy on epoch {}: {}'.format(epoch + 1, train_accuracy))
    training_accuracy.reset()

Loss on epoch 1: 7.28733363664989
Training accuracy on epoch 1: 0.96725
Loss on epoch 2: 7.00827365554472
Training accuracy on epoch 2: 0.96865
Loss on epoch 3: 6.838372502372717
Training accuracy on epoch 3: 0.9687666666666667
Loss on epoch 4: 6.499931328395791
Training accuracy on epoch 4: 0.97085
Loss on epoch 5: 6.231258062538561
Training accuracy on epoch 5: 0.9714833333333334
Loss on epoch 6: 6.0827142206717655
Training accuracy on epoch 6: 0.97295
Loss on epoch 7: 5.839221737659308
Training accuracy on epoch 7: 0.9736666666666667
Loss on epoch 8: 5.636277662093706
Training accuracy on epoch 8: 0.9739333333333333
Loss on epoch 9: 5.594816475217022
Training accuracy on epoch 9: 0.9735833333333334
Loss on epoch 10: 5.429911271944992
Training accuracy on epoch 10: 0.9757333333333333


In [0]:
metric = mx.metric.Accuracy()
for batch, labels in val_loader:
    batch = batch.as_in_context(ctx)
    labels = labels.as_in_context(ctx)
    metric.update(labels, net(batch))
    
print('Validation: {} = {}'.format(*metric.get()))

Validation: accuracy = 0.9777
