****************************************************************

In [1]:
# Importing the root of this bootcamp
import os.path as osp
import sys

sys.path.append(osp.abspath('..'))

# 1. Scaling UP (mono-node)

A Note on GPU training. We naturally assume that GPU is better than CPU, but it really depends on the workflow. You need to saturate the GPU memory.

Now let's go single-node multi-GPU. The same model will be pushed to all available devices, each of which will
1. Perform forward pass with its specific batch of data
2. Compute the loss and perform backward pass including weights update
4. The weights are then collected are synchronized across all devices for next pass

#### **Cleanup**

You might need to clean up your ghost runs if something fails and break the training logic. You can do this one of two ways:
* If run inside the same PID as the training from a `python train.py`:
<code>
import gc, torch; gc.collect(); torch.cuda.empty_cache()
</code>
* Otherwise, try to kill the job still running on the GPU, by get the ghost job's PID with the command `nvitop`

In [2]:
!nvitop

Wed Feb 16 14:06:27 2022
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╡
│[32m   0  Tesla P100-PCIE...  Off  [0m│[32m 00000000:00:04.0 Off [0m│[32m                    0 [0m│
│[32m MAX   41C    P0    27W / 250W [0m│[32m      2MiB / 16281MiB [0m│[32m      0%      Default [0m│
├───────────────────────────────┼──────────────────────┼──────────────────────┤
│[32m   1  Tesla P100-PCIE...  Off  [0m│[32m 00000000:00:05.0 Off [0m│[32m                    0 [0m│
│[32m MAX   42C    P0    27W / 250W [0m│[32m      2MiB / 16281MiB [0m│[32m      0%      D

In [4]:
# Replace in the command below the PID=99999999 by the PID number produced by nvitop
!sudo kill -15 99999999

kill: (99999999): No such process


## 1.1. Achieving Data Parallelism

#### **1.1.1. Strategies**

DP consists of parallelizing the model, and training each instance of the model with a different mini-batch of data of size `batch_size // num_parallel_instances`. Each model will converge differently on its mini-batch, so the weights are collected and usually averaged after `p` batches, then synchronized with all instances for the next round of passes.

#### **0. Over CPU**

Let's launch a reference training on CPU. Take a look at the `_trainup_cpu.py` script.

In [None]:
!python _trainup_cpu.py --batch-size 512 \
                        --num-processes 2 > ../logs/trainup_cpu.stdout

#### **1. Launching a training on GPU**

In [15]:
!cat _trainup_gpu.py

# MIT License
#
# Copyright (c) 2022 alxyok
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISI

Let's launch the training with 2 nodes and 1 GPU/node. Since we're on a single node, each node designates an independent process.

In [None]:
%%bash
python _trainup_gpu.py --batch-size 512 \
                       --lr=1e-4 \
                       --strategy 'ddp' \
                       --gpus 1 \
                       --num-nodes 2 > ../logs/trainup_gpu_ddp.stdout

In [5]:
[(i, i**2) for i in range(6)]

[(0, 0), (1, 1), (2, 4), (3, 9), (4, 16), (5, 25)]

#### **1.1.2. Horovod**

With a simple change in the Trainer options, you can rely on horovod backend to perform the computations. Prallelism is achieved by SPMD with MPI: one process per GPU potentially distributed accross multiple nodes, and collective computing is made by process of rank 0.

No need to adjust the learning rate `lr` this time, horovod takes care of that underneath.

In [None]:
%%bash
python _trainup_gpu.py --batch-size 512 \
                       --strategy 'horovod' \
                       --gpus 1 \
                       --num-nodes 2 > ../logs/trainup_gpu_horovod.stdout

## **1.2. A Note on Model Parallelism**

You should really go for model parallelism starting at 500M parameters. No material on that, just know that it exists and it is complex subject that would require an entire session. Lightning comes standard with a series of distrubtion strategies, each with a specific implementation related to the network that first introduced it.

Refer to the Doc for more info.