#  Lab: Training on multiple GPUs with `gluon`

This lab demonstrates the concepts of how to split up the training of a model across multiple GPUs using Gluon. Dat a parallelism will be used where each batch is split into equal portions, a forward and backward pass is performed, and the gradients are summed and the parameters are updated. A complete copy of all the parameters is present on each GPU.

This lab has been adapted from https://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html.

This lab has been tested using a ml.p3.8xlarge SageMaker notebook instance. It requires an instance with multiple GPUs.

The key steps are:

* Choose “local“ or “device“ kvstore

* Initialise the parameters and copy all of them to each GPU
* Split up the batch into portions and copy each portion onto a GPU
* Run forward and backward

The following steps are run automatically by Gluon when parameters on multiple devices are detected:
* Sum the gradients across all GPUs and broadcast to all GPUs
* Update the weights

Start by defining a simple convolutional neural network for image classification:

In [1]:
import mxnet as mx
from mxnet import nd, gluon, autograd
from time import time
net = gluon.nn.Sequential(prefix='cnn_')
with net.name_scope():
    net.add(gluon.nn.Conv2D(channels=20, kernel_size=3, activation='relu'))
    net.add(gluon.nn.MaxPool2D(pool_size=(2,2), strides=(2,2)))
    net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation='relu'))
    net.add(gluon.nn.MaxPool2D(pool_size=(2,2), strides=(2,2)))
    net.add(gluon.nn.Flatten())
    net.add(gluon.nn.Dense(128, activation="relu"))
    net.add(gluon.nn.Dense(10))
    
loss = gluon.loss.SoftmaxCrossEntropyLoss()

In [2]:
class Profile:
    
    checkpoints = {}
    
    def __init__(self,silent=False):
        self.silent = silent
    
    def start(self,event_name='default'):
        self.checkpoints[event_name] = time()
        return self
        
    def stop(self,event_name='default',print_out=True):
        current_time = time()
        checkpoint = self.checkpoints[event_name]
        period_seconds = current_time - checkpoint
        if print_out and not self.silent:
            print('{} {:.4f} sec'.format(event_name,period_seconds))
        return period_seconds

## Initialize on multiple devices

Gluon supports initialization of network parameters over multiple devices. This is done by passing in an array of device contexts, instead of a single context.
When we pass in an array of contexts, the parameters are initialized 
to be identical across all of our devices.

In [3]:
GPU_COUNT = 2
ctx = [mx.gpu(i) for i in range(GPU_COUNT)]
net.collect_params().initialize(ctx=ctx,force_reinit=True)

Each batch of input data is split into parts (one for each GPU) 
by calling `gluon.utils.split_and_load(batch, ctx)`.
The `split_and_load` function also loads each part onto the appropriate device context. 

When the forward and backwards passes are computed later, this is executed on the device with the version of the parameters which have been stored there.

In [4]:
from mxnet.test_utils import get_mnist

mnist = get_mnist()
batch = mnist['train_data'][0:GPU_COUNT*2, :]
data = gluon.utils.split_and_load(batch, ctx)
print(net(data[0]))
print(net(data[1]))


[[-0.00481914  0.00191422  0.0207928  -0.01032626  0.01591926 -0.00201488
  -0.00969938 -0.01089185 -0.00311436  0.00096904]
 [ 0.00226963 -0.00863828  0.01155043 -0.03112908  0.0468794  -0.01447409
  -0.00345835 -0.01686898 -0.0088306  -0.00481959]]
<NDArray 2x10 @gpu(0)>

[[-0.01494185  0.00641098  0.02740018 -0.01331076  0.01718415 -0.02049804
  -0.0095628  -0.0193483   0.0079064  -0.01144796]
 [-0.00765511  0.00055097  0.00661377 -0.01600864  0.02302845 -0.0043479
  -0.01212487 -0.01363103  0.00415655 -0.00635323]]
<NDArray 2x10 @gpu(1)>


At any time, we can access the version of the parameters stored on each device. 
Recall from the first Chapter that our weights may not actually be initialized
when we call `initialize` because the parameter shapes may not yet be known. 
In these cases, initialization is deferred pending shape inference. 

In [5]:
weight = net.collect_params()['cnn_conv0_weight']

for c in ctx:
    print('=== channel 0 of the first conv on {} ==={}'.format(
        c, weight.data(ctx=c)[0]))
    

=== channel 0 of the first conv on gpu(0) ===
[[[ 0.0068339   0.01299825  0.0301265 ]
  [ 0.04819721  0.01438687  0.05011239]
  [ 0.00628365  0.04861524 -0.01068833]]]
<NDArray 1x3x3 @gpu(0)>
=== channel 0 of the first conv on gpu(1) ===
[[[ 0.0068339   0.01299825  0.0301265 ]
  [ 0.04819721  0.01438687  0.05011239]
  [ 0.00628365  0.04861524 -0.01068833]]]
<NDArray 1x3x3 @gpu(1)>


Similarly, we can access the gradients on each of the GPUs. Because each GPU gets a different part of the batch (a different subset of examples), the gradients on each GPU vary. 

In [6]:
def forward_backward(net, data, label, p):
    with autograd.record():
        p.start('forward')
        losses = [loss(net(X), Y) for X, Y in zip(data, label)]
        p.stop('forward')
    p.start('backward')
    for l in losses:
        l.backward()
    p.stop('backward')
        
label = gluon.utils.split_and_load(mnist['train_label'][0:4], ctx)
forward_backward(net, data, label, Profile())
for c in ctx:
    print('=== grad of channel 0 of the first conv2d on {} ==={}'.format(
        c, weight.grad(ctx=c)[0]))

forward 0.0032 sec
backward 0.0014 sec
=== grad of channel 0 of the first conv2d on gpu(0) ===
[[[ 0.01956055 -0.00624949 -0.01131491]
  [ 0.00674033 -0.00630837 -0.00842709]
  [ 0.02528841  0.00113929 -0.0189858 ]]]
<NDArray 1x3x3 @gpu(0)>
=== grad of channel 0 of the first conv2d on gpu(1) ===
[[[-0.07402535 -0.06295478 -0.02819332]
  [-0.08098993 -0.0621792  -0.01296096]
  [-0.03471686 -0.02846123 -0.00763959]]]
<NDArray 1x3x3 @gpu(1)>


## Put all things together

Now we can implement the remaining functions. Most of them are the same as [when we did everything by hand](./chapter07_distributed-learning/multiple-gpus-scratch.ipynb); one notable difference is that if a `gluon` trainer recognizes multi-devices, it will automatically aggregate the gradients and synchronize the parameters. 

In [7]:
from mxnet.io import NDArrayIter

def train_batch(batch, ctx, net, trainer, p):
    # split the data batch and load them on GPUs
    p.start('split_and_load')
    data = gluon.utils.split_and_load(batch.data[0], ctx)
    label = gluon.utils.split_and_load(batch.label[0], ctx)
    p.stop('split_and_load')
    # compute gradient
    p.start('forward_backward')
    forward_backward(net, data, label, p)
    p.stop('forward_backward')
    # update parameters
    p.start('update')
    trainer.step(batch.data[0].shape[0])
    p.stop('update')
    
def valid_batch(batch, ctx, net):
    data = batch.data[0].as_in_context(ctx[0])
    pred = nd.argmax(net(data), axis=1)
    return nd.sum(pred == batch.label[0].as_in_context(ctx[0])).asscalar()    

def run(num_gpus, batch_size, lr):    
    
    p = Profile(silent=True)
    
    # the list of GPUs will be used
    ctx = [mx.gpu(i) for i in range(num_gpus)]
    print('Running on {}'.format(ctx))
    
    # data iterator
    mnist = get_mnist()
    train_data = NDArrayIter(mnist["train_data"], mnist["train_label"], batch_size)
    valid_data = NDArrayIter(mnist["test_data"], mnist["test_label"], batch_size)
    print('Batch size is {}'.format(batch_size))
    
    net.collect_params().initialize(force_reinit=True, ctx=ctx)
    # Here the kvstore can be set to 'local' where the gradients are summed and synced on the cpu
    # or 'device' on the GPUs. If 'device' is selected mxnet uses GPU to GPU comms where possible.
    trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr}, kvstore='local')
    for epoch in range(5):
        # train
        p.start('epoch')
        train_data.reset()
        for batch in train_data:
            train_batch(batch, ctx, net, trainer, p)
        nd.waitall()  # wait until all computations are finished to benchmark the time
        print('Epoch %d, training time = %.1f sec'%(epoch, p.stop('epoch',False)))
        
        # validating
        valid_data.reset()
        correct, num = 0.0, 0.0
        for batch in valid_data:
            correct += valid_batch(batch, ctx, net)
            num += batch.data[0].shape[0]                
        print('         validation accuracy = %.4f'%(correct/num))
                
run(GPU_COUNT, 512*GPU_COUNT, .3) # a larger batch size is used so each GPU has enough data 
run(1, 512, .3)

Running on [gpu(0), gpu(1)]
Batch size is 1024
Epoch 0, training time = 1.5 sec
         validation accuracy = 0.7913
Epoch 1, training time = 1.0 sec
         validation accuracy = 0.9471
Epoch 2, training time = 1.0 sec
         validation accuracy = 0.9623
Epoch 3, training time = 1.0 sec
         validation accuracy = 0.9702
Epoch 4, training time = 1.0 sec
         validation accuracy = 0.9753
Running on [gpu(0)]
Batch size is 512
Epoch 0, training time = 1.9 sec
         validation accuracy = 0.9262
Epoch 1, training time = 1.9 sec
         validation accuracy = 0.9612
Epoch 2, training time = 1.9 sec
         validation accuracy = 0.9729
Epoch 3, training time = 1.9 sec
         validation accuracy = 0.9776
Epoch 4, training time = 1.9 sec
         validation accuracy = 0.9809


## Conclusion

We have successfully run multi-gpu training using gluon. Experiment with the GPU_COUNT and batch_size. Set silent = False when instantiating the Profile class to see times of each training step.