# Multi-GPU

## What's the point of multi-GPU?

You may need more than one GPU if model training time consumes a significant fraction of execution pipeline time. 
Therefore, if you have several GPUs, you can use all of them to train model. This will speed up training process of the model. 

Parameter `device` allows train model on multiple GPU (Сreates a copy of model on each selected GPU).
Next, batch data is split across available GPUs and gradients compute separately and averaged.

Initialization of large model on a large number of GPU may take some time (minuts or tens of minutes)! 

In [1]:
import os
import sys
import warnings

sys.path.append('../../..')
from batchflow import Pipeline, B, C, V, D, C
from batchflow.opensets import Imagenette320
from batchflow.models.tf import ResNet18

Specify which GPU(s) to be used. More about it in [CUDA documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars).

In [2]:
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=3,4,5,6,7

env: CUDA_DEVICE_ORDER=PCI_BUS_ID
env: CUDA_VISIBLE_DEVICES=3,4,5,6,7


## Create a dataset, define a default model config

In [3]:
dataset = Imagenette320(bar=True)

model_config = {'inputs/images/shape': B.image_shape,
                'inputs/labels/classes': D.num_classes,
                'initial_block/inputs': 'images'}

 50%|█████     | 1/2 [00:15<00:15, 15.82s/it]


In [4]:
BATCH_SIZE = 64

# Train model on single GPU

**By default, if one or more GPUs are visible for the pipeline model, model uses only first GPU!** Now it's `'GPU:0'`. You can also configure it directly and we will talk about it later.

In [5]:
default_config = {'model_config': model_config}

template = (Pipeline()
            .init_variable('loss_history', [])
            .init_model('dynamic', ResNet18,'conv_nn', config=C('model_config'))
            .resize((320, 320))
            .to_array()
            .train_model('conv_nn', fetches='loss',
                         images=B.images, labels=B.labels,
                         save_to=V('loss_history', mode='a'), use_lock=True))

pipeline_single = template << dataset.train << default_config

Most of the next cell execution time is spent on the initialization model. Further, we will compare initialization model on one GPU and multiple GPUs.

In [6]:
%%time
template_single.next_batch(BATCH_SIZE)

CPU times: user 12.2 s, sys: 4.21 s, total: 16.4 s
Wall time: 16.8 s


<batchflow.batch_image.ImagesBatch at 0x7f5b15dbf630>

In [7]:
template_single.run(BATCH_SIZE, shuffle=True, n_epochs=1, bar=True, drop_last=True, prefetch=10)

100%|██████████| 197/197 [01:29<00:00,  2.12it/s]


<batchflow.pipeline.Pipeline at 0x7f5b153dd0f0>

# Add GPUs

We could use `device` and set up 2 GPUs to train model:

In [8]:
model_config.update({'device': ['GPU:1', 'GPU:2']})

Parameter `device` can be string, list of strings or regular expression.

Example:
```python
'device': 'GPU:0'                     # Used only GPU:0
'device': ['GPU:0', 'GPU:1', 'GPU:2'] # Used GPU:0, GPU:1 and GPU:2
'device': 'GPU:*'                     # Used all avalible GPU
```

> **BATCH SIZE MUST BE DIVISIBLE BY NUMBER OF DEVICES! (IF MICROBATHING ~~BATCH SIZE~~ MICROBATCH SIZE)**

# Train model on multiple GPU

In [10]:
config_multiple_GPUs = {'model_config': model_config}

pipeline_multi = template << dataset.train << config_multiple_GPUs

In [11]:
%%time
pipeline_multi.next_batch(BATCH_SIZE)

CPU times: user 28.6 s, sys: 3.17 s, total: 31.7 s
Wall time: 30.2 s


<batchflow.batch_image.ImagesBatch at 0x7f595050b6a0>

In [12]:
pipeline_multi.run(BATCH_SIZE, shuffle=True, n_epochs=1, bar=True, drop_last=True, prefetch=10)

100%|██████████| 197/197 [01:01<00:00,  3.06it/s]


<batchflow.pipeline.Pipeline at 0x7f5950496e80>

The model’s training time is about 1 minute on two GPUs. If we add more GPUs, we get even less training time.

# Multi-GPU and microbathing

###  Schematic illustration of the formation of batches to each GPU

<img src="./img/Batch_microbatch_GPU.png" width="700">

We can use `microbatch` and `device` at the same time. If we have huge batches it will be useful.

Add microbathing.

In [13]:
model_config.update({'device': ['GPU:3', 'GPU:4']})
model_config.update({'microbatch': 2})

# Train model with multiple GPUs and microbatching

In [14]:
config_multiple_microbatch = {'model_config': model_config}

pipeline_multi_micro = template_multi_micro << dataset.train << config_multiple_microbatch

In [15]:
%%time
pipeline_multi_micro.next_batch(BATCH_SIZE)

CPU times: user 31.9 s, sys: 3.4 s, total: 35.3 s
Wall time: 32.4 s


<batchflow.batch_image.ImagesBatch at 0x7f55507819b0>

In [16]:
pipeline_multi_micro.run(BATCH_SIZE, shuffle=True, n_epochs=1, bar=True, drop_last=True, prefetch=10)

100%|██████████| 197/197 [02:56<00:00,  1.07it/s]


<batchflow.pipeline.Pipeline at 0x7f5550781b00>

Model training finishes without error, it means that we can use the `device` and the `microbatch` together. Look at the training time of the model with one GPU, the model with two GPUs and the last model. When we add one more GPU to the first model (make the second model) we got speed up training time is reduced by 1.5 times! 
Training time does not decrease by 2 times because overheads appear during information exchange between multiple GPUs. 
When we add `microbatch` to the second model we increasing training time (More about that in [01_microbatch tutorial](./01_microbatch.ipynb)).

As stated at the beginning initialization model on multiple GPUs spent more time than initialization on one GPU (See cells with magic command `%%time`).

If you will use multiple GPUs, you can save tens of hours of training the model or even more!

Next tutorial is about [different training procedures](./03_train_steps.ipynb).