# MNIST Multi GPU with Keras (Tensorflow backend)

Multi GPU example with Keras (utilising local tower architecture of TensorFlow for each GPU). Keras introduced the `multi_gpu_model` in v 2.0.9 which utilises the MultiGPU code from: https://github.com/kuza55/keras-extras

Specifically, the `keras.utils.multi_gpu_model(model, gpus) function implements single-machine multi-GPU data parallelism. It works in the following way:

- Divide the model's input(s) into multiple sub-batches.
- Apply a model copy on each sub-batch. Every model copy is executed on a dedicated GPU.
- Concatenate the results (on CPU) into one big batch.

E.g. if our batch_size is 64 and we use gpus=2, then we will divide the input into 2 sub-batches of 32 samples, process each sub-batch on one GPU, then return the full batch of 64 processed samples.

This function is only available with the TensorFlow backend for the time being.

Here we test a ConvNet for MNIST digit classification. Using multi_gpu_model induces a quasi-linear speedup on up to 8 GPUs.

This notebook is compiled from the folowing tutorials
https://keras.io/utils/
https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/
https://github.com/normanheckscher/mnist-multi-gpu/blob/master/mnist_multi_gpu_keras.py

## Training a Model Using Multiple GPU Cards

Modern workstations may contain multiple GPUs for scientific computation.
TensorFlow can leverage this environment to run the training operation
concurrently across multiple cards.

Training a model in a parallel, distributed fashion requires
coordinating training processes. For what follows we term *model replica*
to be one copy of a model training on a subset of data.

Naively employing asynchronous updates of model parameters
leads to sub-optimal training performance
because an individual model replica might be trained on a stale
copy of the model parameters. Conversely, employing fully synchronous
updates will be as slow as the slowest model replica.

In a workstation with multiple GPU cards, each GPU will have similar speed
and contain enough memory to run an entire MNIST model. Thus, we opt to
design our training system in the following manner:

* Place an individual model replica on each GPU.
* Update model parameters synchronously by waiting for all GPUs to finish
processing a batch of data.

Here is a diagram of this model:

<div style="width:40%; margin:auto; margin-bottom:10px; margin-top:20px;">
  <img style="width:100%" src="./images/Parallelism.png">
</div>

Note that each GPU computes inference as well as the gradients for a unique
batch of data. This setup effectively permits dividing up a larger batch
of data across the GPUs.

This setup requires that all GPUs share the model parameters. A well-known
fact is that transferring data to and from GPUs is quite slow. For this
reason, we decide to store and update all model parameters on the CPU (see
green box). A fresh set of model parameters is transferred to the GPU
when a new batch of data is processed by all GPUs.

The GPUs are synchronized in operation. All gradients are accumulated from
the GPUs and averaged (see green box). The model parameters are updated with
the gradients averaged across all model replicas.

In [1]:
from keras.datasets import mnist
from keras.models import Sequential, Model
from keras.layers import Input, Dense, Dropout, Flatten, Activation
from keras.layers import Conv2D, MaxPooling2D, BatchNormalization
from keras.utils import np_utils
from keras import backend as K
from keras.callbacks import TensorBoard, ModelCheckpoint

import numpy as np

import time

Using TensorFlow backend.


In [2]:
# check how many GPUs are available in the box
import tensorflow as tf
from tensorflow.python.client import device_lib
def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

In [3]:
print(get_available_gpus())

['/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3', '/device:GPU:4', '/device:GPU:5', '/device:GPU:6', '/device:GPU:7']


In [4]:
np.random.seed(42)  # for reproducibility

In [5]:
# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [6]:
ngpus = len(get_available_gpus()) # int(1)
print("Using %i GPUs." %ngpus)

Using 8 GPUs.


In [7]:
# input image dimensions
img_rows, img_cols = 28, 28

In [8]:
if K.image_dim_ordering() == 'th':
    X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
    X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
    X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

In [9]:
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

In [10]:
# normalize inputs from 0-255 to 0-1
X_train /= 255
X_test /= 255

In [11]:
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

X_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples


In [12]:
# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, 10)
Y_test = np_utils.to_categorical(y_test, 10)

In [13]:
def create_model():
    model = Sequential()

    model.add(Conv2D(32, (3, 3), padding='valid', input_shape=input_shape))
    model.add(Activation('relu'))
    model.add(Conv2D(256, (3, 3)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Conv2D(128, (3, 3)))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(128))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(10))
    model.add(Activation('softmax'))

    return model

*Note: apparently batch normalization works better in practice after the activation function. https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md*

In [14]:
def train(batch_size, nb_epoch, ngpus=0):
    
    if ngpus >= 2:
        # Instantiate the base model under a CPU device scope,
        # so that the model's weights are hosted on CPU memory.
        # Otherwise they may end up hosted on a GPU, which would
        # complicate weight sharing.
        with tf.device('/cpu:0'):
            model = create_model()
            
        from keras.utils import multi_gpu_model
        # Replicates the model on 8 GPUs. This was run on an AWS p2.8xlarge instance.
        print('Using Multi-GPU: %i GPUs' %ngpus)
        parallel_model = multi_gpu_model(model, gpus=ngpus)
        parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
        
        start_time = time.time()
        # This `fit` call will be distributed on 8 GPUs.
        # if the batch size is 128, each GPU will process 16 samples.
        parallel_model.fit(X_train, Y_train, batch_size=batch_size*ngpus, epochs=nb_epoch,
                  verbose=2, validation_data=(X_test, Y_test))
        score = parallel_model.evaluate(X_test, Y_test, verbose=0)
        print('Test score:', score[0])
        print('Test accuracy:', score[1])
        duration = time.time() - start_time
        print('Total Duration (%.3f sec)' % duration)

    else:
        model = create_model()
        
        print('NOT Using Multi-GPU')
        model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
        
        start_time = time.time()
        model.fit(X_train, Y_train, batch_size=batch_size, epochs=nb_epoch,
                  verbose=2, validation_data=(X_test, Y_test))
        score = model.evaluate(X_test, Y_test, verbose=0)
        print('Test score:', score[0])
        print('Test accuracy:', score[1])
        duration = time.time() - start_time
        print('Total Duration (%.3f sec)' % duration)
        
    # Save model via the base model (which shares the same weights) and not the parallel model:
    model.save('model.h5')

In [15]:
batch_size = 128
nb_epoch = 12

In [16]:
# train on one gpu
train(batch_size, nb_epoch)

NOT Using Multi-GPU
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
 - 49s - loss: 0.3144 - acc: 0.9281 - val_loss: 1.5856 - val_acc: 0.9184
Epoch 2/12
 - 47s - loss: 0.0782 - acc: 0.9786 - val_loss: 0.0339 - val_acc: 0.9899
Epoch 3/12
 - 47s - loss: 0.0657 - acc: 0.9817 - val_loss: 0.0452 - val_acc: 0.9883
Epoch 4/12
 - 47s - loss: 0.0587 - acc: 0.9846 - val_loss: 0.0425 - val_acc: 0.9891
Epoch 5/12
 - 47s - loss: 0.0546 - acc: 0.9856 - val_loss: 0.0321 - val_acc: 0.9911
Epoch 6/12
 - 47s - loss: 0.0518 - acc: 0.9872 - val_loss: 0.0398 - val_acc: 0.9904
Epoch 7/12
 - 47s - loss: 0.0556 - acc: 0.9860 - val_loss: 0.0395 - val_acc: 0.9919
Epoch 8/12
 - 47s - loss: 0.0570 - acc: 0.9859 - val_loss: 0.0418 - val_acc: 0.9894
Epoch 9/12
 - 47s - loss: 0.0552 - acc: 0.9867 - val_loss: 0.0645 - val_acc: 0.9859
Epoch 10/12
 - 47s - loss: 0.0606 - acc: 0.9864 - val_loss: 0.0519 - val_acc: 0.9893
Epoch 11/12
 - 47s - loss: 0.0577 - acc: 0.9862 - val_loss: 0.0445 - val_acc: 0.9890
Epoc

In [17]:
# train on all available gpus
train(batch_size, nb_epoch, ngpus)

Using Multi-GPU: 8 GPUs
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
 - 15s - loss: 1.0668 - acc: 0.7690 - val_loss: 0.7550 - val_acc: 0.9724
Epoch 2/12
 - 7s - loss: 0.1344 - acc: 0.9610 - val_loss: 0.0839 - val_acc: 0.9751
Epoch 3/12
 - 7s - loss: 0.0905 - acc: 0.9741 - val_loss: 0.0458 - val_acc: 0.9857
Epoch 4/12
 - 7s - loss: 0.0636 - acc: 0.9818 - val_loss: 0.0415 - val_acc: 0.9881
Epoch 5/12
 - 7s - loss: 0.0495 - acc: 0.9849 - val_loss: 0.0317 - val_acc: 0.9907
Epoch 6/12
 - 7s - loss: 0.0415 - acc: 0.9884 - val_loss: 0.0325 - val_acc: 0.9910
Epoch 7/12
 - 7s - loss: 0.0341 - acc: 0.9902 - val_loss: 0.0574 - val_acc: 0.9854
Epoch 8/12
 - 7s - loss: 0.0303 - acc: 0.9910 - val_loss: 0.0342 - val_acc: 0.9911
Epoch 9/12
 - 7s - loss: 0.0241 - acc: 0.9923 - val_loss: 0.0293 - val_acc: 0.9926
Epoch 10/12
 - 7s - loss: 0.0239 - acc: 0.9930 - val_loss: 0.0252 - val_acc: 0.9925
Epoch 11/12
 - 7s - loss: 0.0211 - acc: 0.9935 - val_loss: 0.0300 - val_acc: 0.9917
Epoch 12/1

Here we can see the quasi-linear speed up in training: Using 8 GPUs, we are able to decrese each epoch to only 7s as compared to 47s with 1 GPU. With 8 GPUs the entire work finished in ~1.5 minutes whereas it took ~9.5 mins with 1 GPU.

**Note:** *In this case, the single GPU experiment obtained slightly higher accuracy than the multi-GPU experiment. When training any stochastic machine learning model, there will be some variance. If you were to average these results out across hundreds of runs they would be (approximately) the same.*