#  Lab: Training on multiple GPUs with `gluon`

This lab demonstrates the concepts of how to split up the training of a model across multiple GPUs using Gluon. Dat a parallelism will be used where each batch is split into equal portions, a forward and backward pass is performed, and the gradients are summed and the parameters are updated. A complete copy of all the parameters is present on each GPU.

This lab has been adapted from https://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html.

This lab has been tested using a ml.p3.8xlarge SageMaker notebook instance. It requires an instance with multiple GPUs.

The key steps are:

* Choose “local“ or “device“ kvstore

* Initialise the parameters and copy all of them to each GPU
* Split up the batch into portions and copy each portion onto a GPU
* Run forward and backward

The following steps are run automatically by Gluon when parameters on multiple devices are detected:
* Sum the gradients across all GPUs and broadcast to all GPUs
* Update the weights

Start by defining a simple convolutional neural network for image classification:

In [1]:
import mxnet as mx
from mxnet import nd, gluon, autograd
net = gluon.nn.Sequential(prefix='cnn_')
with net.name_scope():
    net.add(gluon.nn.Conv2D(channels=20, kernel_size=3, activation='relu'))
    net.add(gluon.nn.MaxPool2D(pool_size=(2,2), strides=(2,2)))
    net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation='relu'))
    net.add(gluon.nn.MaxPool2D(pool_size=(2,2), strides=(2,2)))
    net.add(gluon.nn.Flatten())
    net.add(gluon.nn.Dense(128, activation="relu"))
    net.add(gluon.nn.Dense(10))
    
loss = gluon.loss.SoftmaxCrossEntropyLoss()

## Initialize on multiple devices

Gluon supports initialization of network parameters over multiple devices. This is done by passing in an array of device contexts, instead of a single context.
When we pass in an array of contexts, the parameters are initialized 
to be identical across all of our devices.

In [2]:
GPU_COUNT = 2
ctx = [mx.gpu(i) for i in range(GPU_COUNT)]
net.collect_params().initialize(ctx=ctx,force_reinit=True)

Each batch of input data is split into parts (one for each GPU) 
by calling `gluon.utils.split_and_load(batch, ctx)`.
The `split_and_load` function also loads each part onto the appropriate device context. 

When the forward and backwards passes are computed later, this is executed on the device with the version of the parameters which have been stored there.

In [3]:
from mxnet.test_utils import get_mnist
mnist = get_mnist()
batch = mnist['train_data'][0:GPU_COUNT*2, :]
data = gluon.utils.split_and_load(batch, ctx)
print(net(data[0]))
print(net(data[1]))


[[-0.00482063  0.00191563  0.02079182 -0.01033005  0.01591893 -0.00201501
  -0.00970065 -0.0108907  -0.00311239  0.0009627 ]
 [ 0.00226819 -0.00864111  0.01155005 -0.03112838  0.04687387 -0.01447218
  -0.00345993 -0.01687234 -0.00883009 -0.00482067]]
<NDArray 2x10 @gpu(0)>

[[-0.01494103  0.00641203  0.02740048 -0.01331081  0.01718488 -0.02049927
  -0.00956308 -0.01934761  0.00790679 -0.0114479 ]
 [-0.00764965  0.00055102  0.00661528 -0.01601085  0.02302293 -0.00435258
  -0.01212662 -0.01362745  0.00415612 -0.00635127]]
<NDArray 2x10 @gpu(1)>


At any time, we can access the version of the parameters stored on each device. 
Recall from the first Chapter that our weights may not actually be initialized
when we call `initialize` because the parameter shapes may not yet be known. 
In these cases, initialization is deferred pending shape inference. 

In [4]:
weight = net.collect_params()['cnn_conv0_weight']

for c in ctx:
    print('=== channel 0 of the first conv on {} ==={}'.format(
        c, weight.data(ctx=c)[0]))
    

=== channel 0 of the first conv on gpu(0) ===
[[[ 0.0068339   0.01299825  0.0301265 ]
  [ 0.04819721  0.01438687  0.05011239]
  [ 0.00628365  0.04861524 -0.01068833]]]
<NDArray 1x3x3 @gpu(0)>
=== channel 0 of the first conv on gpu(1) ===
[[[ 0.0068339   0.01299825  0.0301265 ]
  [ 0.04819721  0.01438687  0.05011239]
  [ 0.00628365  0.04861524 -0.01068833]]]
<NDArray 1x3x3 @gpu(1)>


Similarly, we can access the gradients on each of the GPUs. Because each GPU gets a different part of the batch (a different subset of examples), the gradients on each GPU vary. 

In [5]:
def forward_backward(net, data, label):
    with autograd.record():
        losses = [loss(net(X), Y) for X, Y in zip(data, label)]
    for l in losses:
        l.backward()
        
label = gluon.utils.split_and_load(mnist['train_label'][0:4], ctx)
forward_backward(net, data, label)
for c in ctx:
    print('=== grad of channel 0 of the first conv2d on {} ==={}'.format(
        c, weight.grad(ctx=c)[0]))

=== grad of channel 0 of the first conv2d on gpu(0) ===
[[[ 0.01843166 -0.007361   -0.01329759]
  [ 0.00464045 -0.00820222 -0.01046472]
  [ 0.02330405 -0.00166359 -0.01982304]]]
<NDArray 1x3x3 @gpu(0)>
=== grad of channel 0 of the first conv2d on gpu(1) ===
[[[-0.07402541 -0.06295483 -0.02819332]
  [-0.08098998 -0.06217923 -0.01296097]
  [-0.03471691 -0.02846127 -0.0076396 ]]]
<NDArray 1x3x3 @gpu(1)>


## Put all things together

Now we can implement the remaining functions. Most of them are the same as [when we did everything by hand](./chapter07_distributed-learning/multiple-gpus-scratch.ipynb); one notable difference is that if a `gluon` trainer recognizes multi-devices, it will automatically aggregate the gradients and synchronize the parameters. 

In [6]:
from mxnet.io import NDArrayIter
from time import time

def train_batch(batch, ctx, net, trainer):
    # split the data batch and load them on GPUs
    data = gluon.utils.split_and_load(batch.data[0], ctx)
    label = gluon.utils.split_and_load(batch.label[0], ctx)
    # compute gradient
    forward_backward(net, data, label)
    # update parameters
    trainer.step(batch.data[0].shape[0])
    
def valid_batch(batch, ctx, net):
    data = batch.data[0].as_in_context(ctx[0])
    pred = nd.argmax(net(data), axis=1)
    return nd.sum(pred == batch.label[0].as_in_context(ctx[0])).asscalar()    

def run(num_gpus, batch_size, lr):    
    # the list of GPUs will be used
    ctx = [mx.gpu(i) for i in range(num_gpus)]
    print('Running on {}'.format(ctx))
    
    # data iterator
    mnist = get_mnist()
    train_data = NDArrayIter(mnist["train_data"], mnist["train_label"], batch_size)
    valid_data = NDArrayIter(mnist["test_data"], mnist["test_label"], batch_size)
    print('Batch size is {}'.format(batch_size))
    
    net.collect_params().initialize(force_reinit=True, ctx=ctx)
    # Here the kvstore can be set to 'local' where the gradients are summed and synced on the cpu
    # or 'device' on the GPUs. If 'device' is selected mxnet uses GPU to GPU comms where possible.
    trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr}, kvstore='local')
    for epoch in range(5):
        # train
        start = time()
        train_data.reset()
        for batch in train_data:
            train_batch(batch, ctx, net, trainer)
        nd.waitall()  # wait until all computations are finished to benchmark the time
        print('Epoch %d, training time = %.1f sec'%(epoch, time()-start))
        
        # validating
        valid_data.reset()
        correct, num = 0.0, 0.0
        for batch in valid_data:
            correct += valid_batch(batch, ctx, net)
            num += batch.data[0].shape[0]                
        print('         validation accuracy = %.4f'%(correct/num))
        
run(1, 64, .3)        
run(GPU_COUNT, 64*GPU_COUNT, .3) # a larger batch size is used so each GPU has enough data 

Running on [gpu(0)]
Batch size is 64
Epoch 0, training time = 2.3 sec
         validation accuracy = 0.9733
Epoch 1, training time = 2.2 sec
         validation accuracy = 0.9846
Epoch 2, training time = 2.3 sec
         validation accuracy = 0.9877
Epoch 3, training time = 2.2 sec
         validation accuracy = 0.9881
Epoch 4, training time = 2.2 sec
         validation accuracy = 0.9866
Running on [gpu(0), gpu(1)]
Batch size is 128
Epoch 0, training time = 2.3 sec
         validation accuracy = 0.9508
Epoch 1, training time = 2.4 sec
         validation accuracy = 0.9639
Epoch 2, training time = 2.4 sec
         validation accuracy = 0.9821
Epoch 3, training time = 2.3 sec
         validation accuracy = 0.9860
Epoch 4, training time = 2.3 sec
         validation accuracy = 0.9873


## Conclusion

In this example the network is relatively small. This makes the communication overhead higher than the computational gain from having 2 GPUs, hence there is no speed up. The network also takes longer to converge due to the larger batch size.