<img src="./images/DLI_Header.png" style="width: 400px;">

# 2.0 Multi-GPU Training Strategies

In this notebook, we will introduce the basic knowledge of distributed training strategies and experiments with [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), NVIDIA library for training transformer-based language models.


### Learning Objectives

The goals of this notebook are to:
* Understand the mechanisms behind distributed training strategies
* Run a simple distributed training using Megatron-LM scripts on 1 Node with data and tensor parallel distribution
* Understand the basic outputs of Megatron-LM logs

**[2.1 Introduction to Distributed Training Strategies](#1.1-The-hardware-overview)<br>**
**[2.2 Single GPU Training Execution of Megatron-LM GPT Pretraining](#1.1-The-hardware-overview)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[2.2.1 Check The GPT pretraining Script](#1.2.1-Exercise:-Explore-the-Test-Set)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.2.1 Run the GPT pretraining Script](#1.2.1-Exercise:-Explore-the-Test-Set)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.2.3 Understanding Megatron-LM Execution Logs](#1.2.1-Exercise:-Explore-the-Test-Set)<br>
**[2.3 Multi-GPU Training Execution of Megatron-LM GPT Pretraining](#1.2-The-SLURM-Cluster-overview)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[2.3.1 Exercise: Megatron-LM GPT pretraining execution on 2 GPU](#1.2.1-Exercise:-Explore-the-Test-Set)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.3.2 Understanding Multi-GPU Megatron-LM Execution Logs](#1.2.1-Exercise:-Explore-the-Test-Set)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.3.3 Model Distribution Considerations ](#1.2.1-Exercise:-Explore-the-Test-Set)<br>

### Cancel Previous Running/Pending Jobs

Before moving on, check that no jobs are still running or waiting on the SLURM queue. Let's check the SLURM jobs queue by executing the following cell:

In [None]:
# Check the SLURM jobs queue 
!squeue

If there are still jobs running or pending, execute the following cell to cancel all the user's jobs using the `scancel` command. 

In [None]:
# Cancel admin user jobs
!scancel -u $USER

# Check again the SLURM jobs queue (should be either empty, or the status TS column should be CG)
!squeue

---
# 2.1 Introduction to Distributed Training Strategies
In distributed training mode, the goal is to split the training process across multiple machines. The most commonly used distribution strategies are **data** and **model** parallelism.


## Data Distribution 

Neural Networks are usually trained using [Stochastic Gradient Descent algorithms](https://developer.nvidia.com/blog/a-data-scientists-guide-to-gradient-descent-and-backpropagation-algorithms/) consisting of splitting the dataset into batches processed sequentially. At the forward step, the feature maps are computed while at the backward pass, gradients are computed and averaged to determine the parameter updates. Finally, the next batch of data is processed once the model's parameters are updated.

<img src="images/data_parallel.png" width="600"/>

In data parallelism mode, the data is split across multiple machines, each will be processed by a copy of the same neural network hosted by each machine. Parameter updates are averaged from all machines and model updates are reflected on all copies.

Since, with more processors (or alternatively higher data parallelism), the time duration of an epoch (i.e. entire dataset) reduces, this has the effect of speeding up training. Also, because the updated gradients have effectively seen a larger number of samples (due to increased global batch size), this has a positive effect on the time to convergence. The time taken per batch is still the same, with the added cost of gradient exchange communication. 

There are different strategies for implementing the exchange of gradients: 
- **Centralized** way, where a server machine is responsible for distributing data chunks, accumulating gradients, and updating model parameters. 
- **Decentralized** way, where each worker sends and collects gradients from others to aggregate and update the model’s parameters locally. In addition, the workers can deliver the computation at different speeds. So, model parameters can be updated in a synchronous way based on the worker's synchronization points. We can also use a relaxed strategy, allowing workers to operate with outdated parameters. This strategy may introduce inconsistency during the training.

Several libraries offer data parallelism implementations such as [Horovod](https://github.com/horovod/horovod) which is compatible with several Deep Learning Frameworks such as TensorFlow, Keras, PyTorch, Apache MXNet. [NVIDIA APEX](https://nvidia.github.io/apex/) is a Pytorch extension library that offers utilities to streamline distributed training and Mixed Precision.




## Model Distribution Strategies

Model parallelism is the process of splitting a model’s parameters across multiple machines. This allows training bigger models that do not fit into 1 GPU, with the cost of additional communications due to feature maps exchange. 

We can distinguish 2 types of model distributions:

### Pipeline Parallelism


<img src="images/pipeline_parallel1.png" width="600"/>

Pipeline Parallelism is the process of cutting sequentially a model into pieces and assigning each part to a specific worker. For instance layers 1,2 on device_1 and 3,4 on device_2, and so on. 

There are different pipelining strategies such as the Micro-batch pipeline parallelism [GPipe](https://arxiv.org/pdf/1811.06965.pdf), which is an optimized implementation of model pipelining to minimize the time machines wait for their peers to communicate their outputs. It consists in partitioning data chunks into micro batches enabling different machines to process different micro batches simultaneously.

![title](images/pipeline_parallel.png)

Instead of a single sequential set of layers per device, the [Interleaved pipeline parallelism](https://github.com/NVIDIA/Megatron-LM/) assigns multiple pipeline stages per device, each with less computation. 
For instance, layers 1,2 and 9,10 on device_1, layers 3,4 and 11,12 on device_2, and so on. 


### Tensor Parallelism

<img src="images/tensor_parallel1.png" width="500"/>


Tensor Parallelism is the process of dividing matrix operations across workers. [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/) is NVIDIA's open-source library for efficient training of transformer-based networks and multi-node pre-training of transformer-based models such as GPT, BERT, and T5 using mixed precision. In the Megatron's  transformer implementation, the Self-Attention and MLP operations are divided into parallel blocks with a total cost of 4 all-reduce operations per transformer cell (2 on the forward pass and 2 on the backward pass).

<img src="images/tensor_parallel.png" width="250"/>

Megatron-LM is built on top of PyTorch and it integrates data, pipeline and tensor parallelism for pre-training of GPT and BERT transformers architectures using mixed precision. In this lab, we will be using implemented distribution strategies provided by Megatron-LM library. In particular, we will focus in 1 Node execution on 1 or 2 devices and in the second case, we will use data, tensor and pipeline parallelism. Several examples can be found in the [Megatron-LM repository](https://github.com/NVIDIA/Megatron-LM/tree/main/examples). 


---
# 2.2 Single GPU Training Execution of Megatron-LM GPT Pretraining 

Let's first get familiarized with a simple Megatron-LM GPT execution script. 

For distributed training mode, scripts use the [PyTorch distributed launcher](https://pytorch.org/docs/stable/distributed.html). The PyTorch distributed module is used with the Python flag `-m torch.distributed.launch`. 

The resources are configured with the arguments `--nnodes` and `--nproc_per_node` specify respectively the number of nodes and GPUs per node to use.

With Megatron-LM library, there are 2 types of distributed data parallel implementations: 
- `local` performing gradient all-reduce at the end of the back propagation step 
- `torch` distributed data parallel wrapper that overlaps the gradient reduction computation with the back propagation computation (more efficient at larger model sizes).

In this section, we will run a simple Megatron-LM GPT Pretraining Execution on 1 GPU by running the Megatron-LM pretrain_gpt.py script with the corresponding arguments.

<img src="images/Megatron_run.PNG" width="600"/>
We can specify the distributed data parallel implementation using the argument `--DDP-impl`.

Distributed strategies are configured with the arguments `--tensor-model-parallel-size` and `--pipeline-model-parallel-size`
Learn more about the distributed strategies arguments in the [Megatron-LM documentation](https://github.com/NVIDIA/Megatron-LM#distributed-pretraining)

We have prepared the script [pretrain_gpt_1GPU.sh](/dli/code/pretrain_gpt_1GPU.sh) that will run GPT pretraining on only 1 GPU (with no distribution strategy applied).

This script assumes that the compute resources are already allocated. Thus, for the execution, we will need to first allocate the required GPU by connecting to a worker node in an interactive session.

## 2.2.1 Check The GPT pretraining Script

Let's have a look at the script before allocating the resources and executing it. 

Notice the model architecture and training arguments. 

In [None]:
# Have a look at the Megaton-LM GPT pretraining execution on 1 GPU script
!cat /dli/code/pretrain_gpt_1GPU.sh


## 2.2.2 Run the GPT pretraining Script

Now, let's run the pretrain_gpt_1GPU.sh script in an interactive session. To do so, follow the 3 steps:
1. Launch a terminal session
2. Run an interactive session by executing `srun -N 1 --pty /bin/bash`
3. Run the Megatron GPT-3 pretraining on 1 GPU by executing `bash ./code/pretrain_gpt_1GPU.sh`


<img src="images/interactive_launch0.png" width="1050"/>

Run the following cell to generate a link to open a terminal session and the instructions to run interactive session. Then, submit a GPT pretraining job on 1 GPU.

In [None]:
%%html

<pre>
   Step 1: Open a terminal session by following the <a href="", data-commandlinker-command="terminal:create-new">Terminal link</a>
   Step 2: Run an interactive session: <font color="green">srun -N 1 --pty /bin/bash</font>
   Step 3: Run the megatron gpt3 pretraining on 1 GPU: <font color="green">bash ./code/pretrain_gpt_1GPU.sh</font>
</pre>


While the GPT pretraining on 1 GPU is running. We can check the SLURM queue by running this cell:

In [None]:
# Check the SLURM queue
!squeue

We can also check the GPUs using the `nvidia-smi` command. We should see only GPU 0 utilized as shown in the figure bellow. Please notice that the first time Megatron-LM is running, the code will need about 6 minutes to be compiled. Until there, you will not be able to see any GPU activity.

<img src="images/1N_1gpu_utilization.png" width="650"/>

In [None]:
# Check GPU utilization on the master node after Megatron-LM is compiled
!sleep 6m
!nvidia-smi

## 2.2.3  Understanding Megatron-LM Execution Logs 

As specified in the pretrain_gpt_1GPU.sh script, the world size Megatron-LM is executed should be as follows
```
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
```

![title](images/interactive_launch1.png)

To understand the performance of the GPT pretraining, we can check the generated [log file](./megatron/logs/log_1GPU.txt) during the execution.

In [None]:
# Check the Megatron GPT3 pretraining logs.
!grep iteration /dli/megatron/logs/log_1GPU.txt

From the extract, the outputs should be similar to:

```
 iteration      100/     100 | consumed samples:          200 | elapsed time per iteration (ms): 271.6 | learning rate: 5.822E-05 | global batch size:     2 | lm loss: 7.587920E+00 | loss scale: 1.0 | grad norm: 1.468 | number of skipped iterations:   0 | number of nan iterations:   0 |
```   

In this example, notice the training speed of 271.6ms to process 2 samples (global batch size).


Great! Before moving on, let's release some disk space by deleting the unnecessary checkpoints generated by the previous execution and cancel the remaining interactive session.

In [None]:
# Clean the checkpoints 
!rm -rf /dli/megatron/checkpoints/*  

----

# 2.3 Multi-GPU Training Execution of Megatron-LM GPT Pretraining

Let's now execute the same previous training job while using the 2 GPUs available in the interactive session. 

Using `torch.distributed.launch` to launch jobs on 2 GPUs, we need to set the number of processes per node to `--nproc_per_node 2`. 

The first distribution strategy we will experiment with is the data parallel distribution strategy, which is executed by default with Megatron-LM when several resources are available.
             
In the previous execution on one single GPU, the batch size processed by the GPU was 2 (set by `--micro-batch-size`) which also corresponds to the global batch size (set by `--global-batch-size`). 


## 2.3.1 Exercise: Megatron-LM GPT pretraining execution on 2 GPUs
Let's configure the new Megatron-LM GPT pretraining execution on 2 GPUs using data parallel distribution by modifying the "FIXME" in the following cell. 

To use 2 GPUs, we can keep the micro batch size per GPUs to 2 and thus double the global batch size to 4. If you get stuck, feel free to look at the [solution](solutions/ex2.3.ipynb).

Please notice that we will change the logfile name for each run (*log_2GPU.txt* in the following example).

In [None]:
%%writefile /dli/code/pretrain_gpt_2GPU.sh

#!/bin/bash

# Distributed training args
NNODES=1
GPUS_PER_NODE=#FIXEME         # <--- CHANGE HERE
TP_SIZE=1
PP_SIZE=1

# Distributed training 
MICRO_BATCH_SIZE=2
GLOBAL_BATCH_SIZE=#FIXEME    # <--- CHANGE HERE

# Model architecture 
NLAYERS=12
NHIDDEN=768
NHEADS=32
SEQ_LEN=1024
VOCAB_SIZE=50257

# Data Paths
VOCAB_FILE=/dli/data/GPT-2_assets/gpt2-vocab.json
MERGE_FILE=/dli/data/GPT-2_assets/gpt2-merges.txt
DATA_PATH=/dli/data/GPT-2_assets/my-gpt2_text_document

DATA_OUTPUT_PATH=/dli/megatron/checkpoints/test
CHECKPOINT_PATH=/dli/megatron/checkpoints
TENSORBOARD_PATH=/dli/megatron/tensorboard
LOGS_PATH=/dli/megatron/logs
NAME="log_2GPU"        

# SLURM args
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000


OPTIMIZER_ARGS=" \
            --optimizer adam \
            --adam-beta1 0.9 \
            --adam-beta2 0.95 \
            --adam-eps 1e-8 \
            --lr 6e-5 \
            --min-lr 6e-6 \
            --lr-decay-style cosine \
            --lr-decay-iters 800 \
            --lr-warmup-fraction .01 \
            --clip-grad 1.0 \
            --weight-decay 1e-1 \
            --exit-duration-in-mins 1190 \
            "

GPT_ARGS=" \
            --num-layers $NLAYERS \
            --hidden-size $NHIDDEN \
            --num-attention-heads $NHEADS \
            --seq-length $SEQ_LEN \
            --max-position-embeddings $SEQ_LEN \
            --micro-batch-size $MICRO_BATCH_SIZE \
            --global-batch-size $GLOBAL_BATCH_SIZE \
            --train-iters 100 \
            --vocab-file $VOCAB_FILE \
            --merge-file $MERGE_FILE \
            --init-method-std 0.006 \
            $OPTIMIZER_ARGS \
            $EXIT_OPTS \
            "

OUTPUT_ARGS=" \
            --log-interval 10 \
            --save-interval 300 \
            --eval-interval 1000 \
            --eval-iters 10 \
            --tensorboard-dir $TENSORBOARD_PATH \
            --tensorboard-queue-size 1 \
            --log-timers-to-tensorboard \
            --log-batch-size-to-tensorboard \
            --log-validation-ppl-to-tensorboard \
            "
export LAUNCHER="python -u -m torch.distributed.launch \
            --nproc_per_node $GPUS_PER_NODE \
            --nnodes $NNODES \
            --master_addr $MASTER_ADDR \
            --master_port $MASTER_PORT \
            "

export CMD=" \
            /dli/megatron/Megatron-LM/pretrain_gpt.py \
            --tensor-model-parallel-size $TP_SIZE \
            --pipeline-model-parallel-size $PP_SIZE \
            $GPT_ARGS \
            $OUTPUT_ARGS \
            --save $CHECKPOINT_PATH \
            --data-path $DATA_PATH \
            --data-impl mmap \
            --split 949,50,1 \
            --distributed-backend nccl \
            "

bash -c '$LAUNCHER  $CMD' 2>&1 | tee -a $LOGS_PATH/$NAME.txt


Now let's run this script in an interactive session. To do so, follow the 3 steps:
1. Launch a terminal session
2. Run an interactive session by executing `srun -N 1 --pty /bin/bash`
3. Run the megatron gpt3 pretraining on 1 GPU by executing `bash ./code/pretrain_gpt_2GPU.sh`

Run the following cell to get the link to open a terminal session and the instructions to run an interactive session. Then, submit a pretraining job on 2 GPUs.

In [None]:
%%html

<pre>
   Step 1: Open a terminal session by following the <a href="", data-commandlinker-command="terminal:create-new">Terminal link</a>
   Step 2: Run an interactive session: <font color="green">srun -N 1 --pty /bin/bash</font>
   Step 3: Run the megatron gpt3 pretraining on 1 GPU: <font color="green">bash ./code/pretrain_gpt_2GPU.sh</font>
</pre>

While the GPT pretraining on 1 Node and 2 GPUs is running. We can check the SLURM queue.

In [None]:
# Check the SLURM queue
!squeue

We can also Check the GPUs using the `nvidia-smi` command. We should see GPU 0 and 1 utilized as shown in the figure bellow.

<img src="images/1N_2gpus_utilization.png" width="650"/>

In [None]:
# Check GPU utilization on the master node
!nvidia-smi

## 2.3.2 Understanding Multi-GPU Megatron-LM Execution Logs

Let's have a look at the execution logs:

<img src="images/interactive_launch2.png" width="900"/>

The world size Megatron-LM will be executing should be as follows: 
```
world size: 2, data-parallel-size: 2, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
```
As we have 2 GPUs available, by default, the distributed strategy executed is the data parallel strategy, meaning that the model is copied on both GPUs and will process different data batches. 

To understand the performance of the GPT pretraining on 2 GPUS, we can check the generated [log file](/dli/megatron/logs/log_2GPU.txt) during the execution.

In [None]:
!grep iteration /dli/megatron/logs/log_2GPU.txt

From the extract logs, notice the training performance while using 2 GPUs compared to 1 GPU.

` iteration      100/     100 | consumed samples:          400 | elapsed time per iteration (ms): 363.6 | learning rate: 5.822E-05 | global batch size:     4 | lm loss: 7.500983E+00 | loss scale: 1.0 | grad norm: 1.360 | number of skipped iterations:   0 | number of nan iterations:   0 |`
 
 
 
Notice the number of samples consumed, and the corresponding training time. Notice also that this is an almost linear increase which is a desirable property in multi-GPU systems.  

Discuss the performance with the instructor. The major change here is larger number of samples processed in the same time duration, therefore helping the model learn richer data representations, speeding up the training.

Great, before moving on, let's release some disk space by deleting the unnecessary checkpoints generated by the previous execution and cancel the remaining interactive session.


In [None]:
# Clean the checkpoints
!rm -rf /dli/megatron/checkpoints/*  

## 2.3.3 Model Distribution Considerations 

To execute the previous Multi-GPU script in Tensor or Pipeline parallel mode, we can configure the distribution using the argument `--tensor-model-parallel-size` or `--pipeline-model-parallel-size`. 

The world size of Megatron-LM training corresponding to the number of GPUs will remain the same while the data-parallel, tensor-model-parallel and pipeline-model-parallel should be adjusted according to your configuration: 
```
world size: 2, data-parallel-size: 1, tensor-model-parallel size: 2, pipeline-model-parallel size: 1
or
world size: 2, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 2

```
The world size is the product of data-parallel-size, tensor-model-parallel and pipeline-model-parallel. 

---
<h2 style="color:green;">Congratulations!</h2>

Great job with pretraining GPT-3 on a GPU cluster.<br>

Before moving on, we need to make sure that no jobs are still running or waiting on the SLURM queue. 
Let's check the SLURM jobs queue by executing the following cell:

In [None]:
# Check the SLURM jobs queue 
!squeue

If there are still jobs running or pending, execute the following cell to cancel all the user's jobs using the `scancel` command. 

In [None]:
# Cancel admin user jobs
!scancel -u $USER

# Check again the SLURM jobs queue (should be either empty, or the status TS column should be CG)
!squeue

Next, we will be running  GPT language model training on multi-nodes distribution configurations. Move on to [03_GPT_LM_pretrainings_multinodes.ipynb](03_GPT_LM_pretrainings_multinodes.ipynb).