<img src="./images/DLI_Header.png" style="width: 400px;">


# 4. Distributed Training Optimization

In this notebook, we will learn how to quantify the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) GPT pretraining performance and see optimization techniques such as Mixed Precision, Gradient Accumulation and Activation Checkpointing.

## The goals

The goals of this notebook are to:
* Optimize multi-node training of Megatron-LM scripts using Automatic Mixed Precision and Activation Checkpointing 
* Understand how to compute training performance


**[4.1 Mixed Precision Training](#1.1-The-hardware-overview)<br>**
**[4.2 Activation Checkpoiting ](#1.1-The-hardware-overview)<br>**
**[4.3 Gradient Accumulation](#1.1-The-hardware-overview)<br>**
**[4.4 The Training Performance](#4.4-Compute-The-Training-Performance)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[4.4.1 Compute the Number of Parameters](#1.1.1-Check-The-Available-CPUs)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.4.2 Exercise: Compute the Number of Parameters For Our Model](#1.1.2-Check-The-Available-GPUs)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.4.3 Compute the Theoretical Peak FLOP per second per GPU](#1.1.3-Check-The-Interconnect-Topology)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.4.4 Estimate the Training Duration / Epoch](#1.1.3-Check-The-Interconnect-Topology)<br>

### Cancel Previous Running/Pending Jobs

Before moving on, check that no jobs are still running or waiting on the SLURM queue. Let's check the SLURM jobs queue by executing the following cell:

In [None]:
# Check the SLURM jobs queue 
!squeue

If there are still jobs running or pending, execute the following cell to cancel all the admin user's jobs with the `scancel` command. 

In [None]:
# Cancel admin user jobs
!scancel -u $USER

# Check again the SLURM jobs queue (should be either empty, or the status TS column should be CG)
!squeue

---
# 4.1 Mixed Precision Training 

<img src="images/AMP.png" width="700"/>

**Automatic Mixed Precision (AMP)** allows using different numerical precisions when running mathematical operations. It performs some operations in half-precision format reducing the memory required while keeping single-precision in critical parts of the network.

Training with Automatic Mixed Precision takes advantage of the hardware acceleration provided by Tensor Cores available in NVIDIA GPUs from Volta architecture (Volta, Turing, Ampere and future architectures). Learn more about training with AMP, refer to the [Mixed precision training documentation](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html). 


In the previous notebook, Mixed Precision training was not enabled for the baseline run. We can see the details in the GPU Kernel view. 

<img src="images/profiling3.png" width="650"/>

According to the Performance Recommendation on the baseline run in the overview tab (shown also in the figure of the section 3.4.2 Pytorch Profiler with Tensorboard), Kernels with 27% of the total time are launched by Tensor Core eligible operators. We can speed up the operations by enabling Automatic Mixed Precision with FP16.

To run Megatron-LM pretraining in Mixed Precision, simply add the argument `--fp16` in the GPT_ARGS.

In [None]:
%%writefile /dli/code/pretrain_gpt_2Node4GPU_increase_MBS_fp16.sh
#!/bin/bash
#SBATCH --job-name=dli_2nodes
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1       
#SBATCH --cpus-per-task=32 ### Number of threads per task (OMP threads)
#SBATCH -o /dli/megatron/logs/%j.out
#SBATCH -e /dli/megatron/logs/%j.err

set -x -e

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Distributed training args
NNODES=2
GPUS_PER_NODE=2
TP_SIZE=1
PP_SIZE=1 

# SLURM args
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000

# Distributed training 
MICRO_BATCH_SIZE=4      
GLOBAL_BATCH_SIZE=16    

# Model architecture 
NLAYERS=12
NHIDDEN=768
NHEADS=32
SEQ_LEN=1024
VOCAB_SIZE=50257

# Data Paths
DATA_OUTPUT_PATH=/dli/megatron/checkpoints/test
CHECKPOINT_PATH=/dli/megatron/checkpoints
TENSORBOARD_PATH=/dli/megatron/tensorboard
LOGS_PATH=/dli/megatron/logs
VOCAB_FILE=/dli/data/GPT-2_assets/gpt2-vocab.json
MERGE_FILE=/dli/data/GPT-2_assets/gpt2-merges.txt
DATA_PATH=/dli/data/GPT-2_assets/my-gpt2_text_document

NAME="log_2Nodes4GPUS_increase_GBS_fp16"        

OPTIMIZER_ARGS=" \
            --optimizer adam \
            --adam-beta1 0.9 \
            --adam-beta2 0.95 \
            --adam-eps 1e-8 \
            --lr 6e-5 \
            --min-lr 6e-6 \
            --lr-decay-style cosine \
            --lr-decay-iters 800 \
            --lr-warmup-fraction .01 \
            --clip-grad 1.0 \
            --weight-decay 1e-1 \
            --exit-duration-in-mins 1190 \
              "

GPT_ARGS=" \
            --num-layers $NLAYERS \
            --hidden-size $NHIDDEN \
            --num-attention-heads $NHEADS \
            --seq-length $SEQ_LEN \
            --max-position-embeddings $SEQ_LEN \
            --micro-batch-size $MICRO_BATCH_SIZE \
            --global-batch-size $GLOBAL_BATCH_SIZE \
            --train-iters 100 \
            --vocab-file $VOCAB_FILE \
            --merge-file $MERGE_FILE \
            --init-method-std 0.006 \
            --fp16 \
            $OPTIMIZER_ARGS \
            "

OUTPUT_ARGS=" \
            --log-interval 10 \
            --save-interval 300 \
            --eval-interval 1000 \
            --eval-iters 10 \
            --tensorboard-dir $TENSORBOARD_PATH \
            --tensorboard-queue-size 1 \
            --log-timers-to-tensorboard \
            --log-batch-size-to-tensorboard \
            --log-validation-ppl-to-tensorboard \
            --profile-execution True \
            --profile-name fp16 \
            "

export LAUNCHER="python -u -m torch.distributed.launch \
             --nproc_per_node $GPUS_PER_NODE \
             --nnodes $NNODES \
             --master_addr $MASTER_ADDR \
             --master_port $MASTER_PORT \
             "

export CMD=" \
             /dli/megatron/Megatron-LM/pretrain_gpt.py \
             --tensor-model-parallel-size $TP_SIZE \
             --pipeline-model-parallel-size $PP_SIZE \
             $GPT_ARGS \
             $OUTPUT_ARGS \
             --save $CHECKPOINT_PATH \
             --data-path $DATA_PATH \
             --data-impl mmap \
             --split 949,50,1 \
             --distributed-backend nccl \
           "

clear; srun --jobid $SLURM_JOBID bash -c 'NCCL_DEBUG=INFO  $LAUNCHER --node_rank $SLURM_PROCID $CMD' 2>&1 | tee -a $LOGS_PATH/$NAME.txt

Now, let's submit the previous sbatch script [pretrain_gpt_2Node4GPU_increase_MBS_fp16.sh](./code/pretrain_gpt_2Node4GPU_increase_MBS_fp16.sh) and check the SLURM queue using the `squeue` command.

In [None]:
# submit the 2 nodes jobs
!sbatch /dli/code/pretrain_gpt_2Node4GPU_increase_MBS_fp16.sh

# Check the SLURM queue
!squeue

To understand the performance of the Megatron GPT3 pretraining, we can check the generated logs during execution.

Let's first look at the generated [logs](./megatron/logs/log_2Nodes4GPUS_increase_GBS_fp16.txt) and check the world size of our executed run. We should see this:

```
    using world size: 4, data-parallel-size: 4, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
    using torch.float16 for parameters ...
```

Notice that we have the message `torch.float16 for parameters` as we are running in fp16 mode.

After a few seconds, let's check the training performance and discuss that with the instructor. Please notice that if you are running Megatron-LM for the first time in the class, the code will need about 6 minutes to be compiled. Until there, you will not be able to see any GPU activity.

In [None]:
!grep elapsed /dli/megatron/logs/log_2Nodes4GPUS_increase_GBS_fp16.txt

### Pytorch Profiler with Tensorboard

The profiling is available on the Tensorboard link at the `pytorch_profiler` tab for the previous run. In case you already closed the Tensorboard page, you can re-generate the link by executing the next cell. Click the link to open Tensorboard and then, go to the `PYTORCH_PROFILER` tab.

In [None]:
%%js
const href = window.location.hostname +'/tensorboard/';
let a = document.createElement('a');
let link = document.createTextNode('Open Tensorboard!');
a.appendChild(link);
a.href = "http://" + href;
a.style.color = "navy"
a.target = "_blank"
element.append(a);

Looking at the `run_fp16_gpu0` results, the GPU utilization is reduced. From the GPU Kernel view, we can see that when enabling Mixed precision, 23.6% of GPU operations are accelerated with TensorCores.

<img src="images/profiling5_AMP.png" width="400"/>



### How about the memory?
We can see that the memory consumption is increased with a peak of 12G (~10G in the baseline). This is because some model weights are stored both in FP32 and FP16 while the activations and gradients are stored in FP16. Thus, the memory usage from the weights is increased while less memory is used by the activations and gradients at the forward and backward passes. When zooming into the new jumps, it shows an additional Pytorch copy operations *_to_copy*.


<img src="images/profiling_FP16_memory.png" width="800"/>


Great! Before moving on, let's release some disk space by deleting the unnecessary checkpoints generated by the previous execution.

In [None]:
# Clean the checkpoints 
!rm -rf /dli/megatron/checkpoints/* 

---
# 4.2 Activation Checkpointing 


<img src="images/activation_checkpoiting.png" width="700" align="center"/>

Activation Checkpointing is another technique allowing us to save memory during the training with the cost of additional re-compute. In the vanilla forward and backward pass, all feature maps are computed during the forward pass and stored for the backward step. In the activation checkpointing strategy, only some intermediate results are stored (called checkpoints) during the forward pass. Those checkpoints will be used at the backwards pass to recompute further feature maps when needed. There are several implementations of checkpointing, including manual checkpoints provided by the user or automatic selection.

Megatron-LM supports two activation checkpointing methods: uniform and block.

- Uniform: Uniformly divides the layers into groups and stores the input activations of each group in memory. 
- Block: Checkpoints the input activations of a set number of individual Transformer layers per pipeline stage.

To run Megatron-LM pretraining with Activation Checkpointing, simply add the argument `--activations-checkpoint-method` to uniform or block in the GPT_ARGS.


In [None]:
%%writefile /dli/code/pretrain_gpt_2Node4GPU_increase_BS_fp16_activation_checkpointing.sh
#!/bin/bash
#SBATCH --job-name=dli_2nodes
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1       
#SBATCH --cpus-per-task=32 ### Number of threads per task (OMP threads)
#SBATCH -o /dli/megatron/logs/%j.out
#SBATCH -e /dli/megatron/logs/%j.err

set -x -e

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Distributed training args
NNODES=2
GPUS_PER_NODE=2
TP_SIZE=1
PP_SIZE=1 

# SLURM args
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000

# Distributed training 
MICRO_BATCH_SIZE=4      
GLOBAL_BATCH_SIZE=16    

# Model architecture 
NLAYERS=12
NHIDDEN=768
NHEADS=32
SEQ_LEN=1024
VOCAB_SIZE=50257

# Data Paths
DATA_OUTPUT_PATH=/dli/megatron/checkpoints/test
CHECKPOINT_PATH=/dli/megatron/checkpoints
TENSORBOARD_PATH=/dli/megatron/tensorboard
LOGS_PATH=/dli/megatron/logs
VOCAB_FILE=/dli/data/GPT-2_assets/gpt2-vocab.json
MERGE_FILE=/dli/data/GPT-2_assets/gpt2-merges.txt
DATA_PATH=/dli/data/GPT-2_assets/my-gpt2_text_document

NAME="log_2Nodes4GPUS_increase_GBS_fp16_activation_checkpointing"        


OPTIMIZER_ARGS=" \
            --optimizer adam \
            --adam-beta1 0.9 \
            --adam-beta2 0.95 \
            --adam-eps 1e-8 \
            --lr 6e-5 \
            --min-lr 6e-6 \
            --lr-decay-style cosine \
            --lr-decay-iters 800 \
            --lr-warmup-fraction .01 \
            --clip-grad 1.0 \
            --weight-decay 1e-1 \
            --exit-duration-in-mins 1190 \
              "

GPT_ARGS=" \
            --num-layers $NLAYERS \
            --hidden-size $NHIDDEN \
            --num-attention-heads $NHEADS \
            --seq-length $SEQ_LEN \
            --max-position-embeddings $SEQ_LEN \
            --micro-batch-size $MICRO_BATCH_SIZE \
            --global-batch-size $GLOBAL_BATCH_SIZE \
            --train-iters 100 \
            --vocab-file $VOCAB_FILE \
            --merge-file $MERGE_FILE \
            --init-method-std 0.006 \
            --fp16 \
            --activations-checkpoint-method uniform \
            $OPTIMIZER_ARGS \
            "

OUTPUT_ARGS=" \
            --log-interval 10 \
            --save-interval 300 \
            --eval-interval 1000 \
            --eval-iters 10 \
            --tensorboard-dir $TENSORBOARD_PATH \
            --tensorboard-queue-size 1 \
            --log-timers-to-tensorboard \
            --log-batch-size-to-tensorboard \
            --log-validation-ppl-to-tensorboard \
            --profile-execution True \
            --profile-name  fp16_activation_checkpointing \
            "

export LAUNCHER="python -u -m torch.distributed.launch \
             --nproc_per_node $GPUS_PER_NODE \
             --nnodes $NNODES \
             --master_addr $MASTER_ADDR \
             --master_port $MASTER_PORT \
             "

export CMD=" \
             /dli/megatron/Megatron-LM/pretrain_gpt.py \
             --tensor-model-parallel-size $TP_SIZE \
             --pipeline-model-parallel-size $PP_SIZE \
             $GPT_ARGS \
             $OUTPUT_ARGS \
             --save $CHECKPOINT_PATH \
             --data-path $DATA_PATH \
             --data-impl mmap \
             --split 949,50,1 \
             --distributed-backend nccl \
           "

clear; srun --jobid $SLURM_JOBID bash -c 'NCCL_DEBUG=INFO  $LAUNCHER --node_rank $SLURM_PROCID $CMD' 2>&1 | tee -a $LOGS_PATH/$NAME.txt

Now, let's submit the previous sbatch script [pretrain_gpt_2Node4GPU_increase_BS_fp16_activation_checkpointing.sh](/dli/code/pretrain_gpt_2Node4GPU_increase_BS_fp16_activation_checkpoiting.sh) and check the SLURM queue using the `squeue` command.

In [None]:
# Submit the 2 nodes jobs
!sbatch /dli/code/pretrain_gpt_2Node4GPU_increase_BS_fp16_activation_checkpointing.sh

# Check the SLURM queue
!squeue

To understand the performance of the Megatron GPT3 pretraining, we can check the generated [logs](./megatron/logs/log_2Nodes4GPUS_increaseBS_fp16_activation_checkpointing.txt) and discuss them with the instructor.

In [None]:
!grep elapsed /dli/megatron/logs/log_2Nodes4GPUS_increase_GBS_fp16_activation_checkpointing.txt

### How about the memory?

The profiling is available on the Tensorboard link at the `pytorch_profiler` tab. When using activation checkpointing, we can see at `run_fp16_activation_checkpointing_GPU0` that the memory consumption is reduced considerably to a peak of ~3G. This is since some activations are not stored and recomputed when necessary.
By zooming into the graph, we can trace the Pytorch CheckpointFunctions. 

<img src="images/profiling_FP16_checkpoiting_memory.png" width="900"/>

Great! Before moving on, let's release some disk space by deleting the unnecessary checkpoints generated by the previous execution.

In [None]:
# Clean the checkpoints 
!rm -rf /dli/megatron/checkpoints/*

---
# 4.3 Gradient Accumulation 


Another way of increasing the batch size is to use gradient accumulation. Instead of splitting the data across workers as in Distributed Data Parallel, in the Gradient Accumulation technique, the same worker processes several batches and accumulates the gradients before updating the model’s parameters.

[NVIDIA APEX Library](https://github.com/NVIDIA/apex#quick-start) provides an optimized implementation when using Gradient Accumulation with Automatic Mixed precision. This implementation removes unnecessary double precision copies of the gradient by accumulating it in low precision first before going back to double precision. Megatron-LM library uses the APEX implementation.  


To run Megatron-LM pretraining with Gradient Accumulation, simply increase the global batch size while maintaining the micro batch size the same.  

In [None]:
%%writefile /dli/code/pretrain_gpt_2Node4GPU_increase_BS_fp16_activation_checkpointing_gradient_accumulation.sh
#!/bin/bash
#SBATCH --job-name=dli_2nodes
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1       
#SBATCH --cpus-per-task=32 ### Number of threads per task (OMP threads)
#SBATCH -o /dli/megatron/logs/%j.out
#SBATCH -e /dli/megatron/logs/%j.err

set -x -e

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Distributed training args
NNODES=2
GPUS_PER_NODE=2
TP_SIZE=1
PP_SIZE=1 

# SLURM args
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000

# Distributed training 
MICRO_BATCH_SIZE=4      
GLOBAL_BATCH_SIZE=64    

# Model architecture 
NLAYERS=12
NHIDDEN=768
NHEADS=32
SEQ_LEN=1024
VOCAB_SIZE=50257

# Data Paths
DATA_OUTPUT_PATH=/dli/megatron/checkpoints/test
CHECKPOINT_PATH=/dli/megatron/checkpoints
TENSORBOARD_PATH=/dli/megatron/tensorboard
LOGS_PATH=/dli/megatron/logs
VOCAB_FILE=/dli/data/GPT-2_assets/gpt2-vocab.json
MERGE_FILE=/dli/data/GPT-2_assets/gpt2-merges.txt
DATA_PATH=/dli/data/GPT-2_assets/my-gpt2_text_document

NAME="log_2Nodes4GPUS_increase_GBS_fp16_activation_checkpointing_gradient_accumulation"        


OPTIMIZER_ARGS=" \
            --optimizer adam \
            --adam-beta1 0.9 \
            --adam-beta2 0.95 \
            --adam-eps 1e-8 \
            --lr 6e-5 \
            --min-lr 6e-6 \
            --lr-decay-style cosine \
            --lr-decay-iters 800 \
            --lr-warmup-fraction .01 \
            --clip-grad 1.0 \
            --weight-decay 1e-1 \
            --exit-duration-in-mins 1190 \
              "

GPT_ARGS=" \
            --num-layers $NLAYERS \
            --hidden-size $NHIDDEN \
            --num-attention-heads $NHEADS \
            --seq-length $SEQ_LEN \
            --max-position-embeddings $SEQ_LEN \
            --micro-batch-size $MICRO_BATCH_SIZE \
            --global-batch-size $GLOBAL_BATCH_SIZE \
            --train-iters 100 \
            --vocab-file $VOCAB_FILE \
            --merge-file $MERGE_FILE \
            --init-method-std 0.006 \
            --fp16 \
            --activations-checkpoint-method uniform \
            $OPTIMIZER_ARGS \
            "

OUTPUT_ARGS=" \
            --log-interval 10 \
            --save-interval 300 \
            --eval-interval 1000 \
            --eval-iters 10 \
            --tensorboard-dir $TENSORBOARD_PATH \
            --tensorboard-queue-size 1 \
            --log-timers-to-tensorboard \
            --log-batch-size-to-tensorboard \
            --log-validation-ppl-to-tensorboard \
            --profile-execution True \
            --profile-name fp16_activation_checkpointing_gradient_accumulation \
            "

export LAUNCHER="python -u -m torch.distributed.launch \
             --nproc_per_node $GPUS_PER_NODE \
             --nnodes $NNODES \
             --master_addr $MASTER_ADDR \
             --master_port $MASTER_PORT \
             "

export CMD=" \
             /dli/megatron/Megatron-LM/pretrain_gpt.py \
             --tensor-model-parallel-size $TP_SIZE \
             --pipeline-model-parallel-size $PP_SIZE \
             $GPT_ARGS \
             $OUTPUT_ARGS \
             --save $CHECKPOINT_PATH \
             --data-path $DATA_PATH \
             --data-impl mmap \
             --split 949,50,1 \
             --distributed-backend nccl \
           "

clear; srun --jobid $SLURM_JOBID bash -c 'NCCL_DEBUG=INFO  $LAUNCHER --node_rank $SLURM_PROCID $CMD' 2>&1 | tee -a $LOGS_PATH/$NAME.txt

Now, let's submit the previous sbatch script [pretrain_gpt_2Node4GPU_increase_BS_fp16_activation_checkpointing_gradient_accumulation.sh](/dli/code/pretrain_gpt_2Node4GPU_increase_BS_fp16_activation_checkpointing_gradient_accumulation.sh) and check the SLURM queue using the `squeue` command.

In [None]:
# Submit the 2 nodes jobs
!sbatch /dli/code/pretrain_gpt_2Node4GPU_increase_BS_fp16_activation_checkpointing_gradient_accumulation.sh

# Check the SLURM queue
!squeue

To understand the performance of the Megatron GPT3 pretraining, we can check the generated [logs](./megatron/logs/log_2Nodes4GPUS_increase_GBS_fp16_activation_checkpointing_gradient_accumulation.txt) and discuss them with the instructor.

Let's compare the number of micro-batches per GPU when enabling or disabling Gradient Accumulation. To do so, let’s compare to the previous runs without Gradient Accumulation.

In [None]:
print("Without gradient accumulation:")
!grep constant /dli/megatron/logs/log_2Nodes4GPUS_increase_GBS_fp16_activation_checkpointing.txt

print("With 4 gradient accumulation:")
!grep constant /dli/megatron/logs/log_2Nodes4GPUS_increase_GBS_fp16_activation_checkpointing_gradient_accumulation.txt

In [None]:
!grep elapsed /dli/megatron/logs/log_2Nodes4GPUS_increase_GBS_fp16_activation_checkpointing_gradient_accumulation.txt

### How about the memory?

The profiling is available in Tensorboard link at the `pytorch_profiler` tab under the run `run_fp16_activation_checkpointing_gradient_accumulation_gpu0`. 

We can see the 4 gradient accumulations stages per step on the memory tracing. 

<img src="images/profiling_FP16_checkpoiting_gradient_acc_memory.png" width="900"/>


Great! Before moving on, let's release some disk space by deleting the unnecessary checkpoints generated by the previous execution.

In [None]:
# Clean the checkpoints 
!rm -rf /dli/megatron/checkpoints/* 

---
# 4.4 The Training Performance

In order to train large Neural Networks in a reasonable time, scaling the infrastructure is unavoidable. 
Let's first compute the number of parameters of our transformer models and estimate its training according to the hardware infrastructure and the experimentally observed training throughput.


## 4.4.1 Compute the Number of Parameters of Transformers Model

The number of parameters for a Transformers model is computed as:
$P = 12 l h^2 (1 + \frac{13}{12h} + \frac{V+s}{12lh})$ where:
- $l$ = Number of Layers
- $h$ = Hidden Size
- $V$ = Vocabulary Size
- $s$ = Sequence Length


In [None]:
# Number of parameters of the Transformers model
def calculate_number_parameters(l,h,s,V):
    # Compute the number of parameters of the model
    P=12*l*h*h *(1+ (13/(12*h)) + ((V+s)/(12*l*h)))
    print("The number of parameters for the GPT architecture is: {} \n".format(int(P)))
    return P

As an example, let's compute the number of parameters of the transformer model having 40 layers, a hidden size of 6144, vocabulary size of 50257 and sequence length of 1024. This model should be approximately 18 billion parameters. 

In [None]:
# Set the model architecture parameters
l=40
h=6144
s=1048
V=50257
    
P=calculate_number_parameters(l,h,s,V)

## 4.4.2 Exercise: Compute the Number of Parameters For Our Model

Compute the number of parameters of the model we have been experimenting with in previous notebooks.
Have a look at the model architecture arguments in any Megaton-LM GPT pretraining scripts from previous notebooks. 

If you get stuck, you can look at the [solution](solutions/ex4.1.2.ipynb).

In [None]:
# Set our model architecture parameters
l=#FIXEME
h=#FIXEME
s=#FIXEME
V=#FIXEME
    
calculate_number_parameters(l,h,s,V)

## 4.4.3 Compute the Theoretical Peak FLOP per second per GPU

As detailed in the paper [Scaling Language Model Training to a Trillion Parameters Using Megatron](https://arxiv.org/pdf/2104.04473.pdf), the majority of floating-point operations in the model are performed in the matrix multiplications (GEMMs). If we consider only these GEMMs operations, the number of FLOPs per iteration is $F = 96 B s l h^2 (1 + \frac{s}{6h} + \frac{V}{16lh})$ where $B$ is the batch size. 

And, in case we have an estimate of the time spent per iteration `Time_per_iteration_second`, it is possible then to compute the theoretical peak FLOP per second and per GPUs and estimate the GPU usage by comparing. 

The following table shows the training performance of several GPT model sizes (from 1.7B to 1 trillion) pretrained using Megatron-LM library on a SuperPOD cluster with A100 GPUs.
<img src="https://github.com/NVIDIA/Megatron-LM/blob/main/images/cases_april2021.png?raw=true"/>


In [None]:
# Theoretical peak FLOP per second per GPU - with activation checkpointing (2 forwards and 1 backward)
def calculate_Theoretical_peak_FLOP_s_GPU(B,s, l,h,number_GPUs,Time_per_iteration_second):
    # The number of FLOPs per iteration
    F = 96*B*s* l*h*h *(1 + s/ (6*h) + V/(16*l*h))/1e+12
    
    #Theoretical peak FLOP per second per GPU
    PF= (F/Time_per_iteration_second/number_GPUs)
    print("Theoretical peak FLOP/s/GPU: {}\n".format(PF))
    
    # Percentage of theoretical peak FLOP/s on a A100 FP16 (change according the hardware)
    GPU_usage= PF/ 312 *100
    print("Percentage of theoretical peak FLOP/s: {}%".format(GPU_usage))
    
    return PF, GPU_usage

The percentage of theoretical peak FLOP/s in the previous function is based on **A100** hardware capabilities in **FP16/BF16 which is 312**. This needs to be updated according to the corresponding Tensor Core GPU performance specs. To learn more about the [Ampere architecture specifications](https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/).


If we consider our previous 18B parameters models pretrained on 16 GPUs with a global batch size of 512, they have a time per iteration of 32.09s. 
Let's compute the theoretical peak FLOP per second per GPU and the percentage GPU utilization:

In [None]:
global_batch_size=512
number_GPUs=16
Time_per_iteration_second=32.09

# Considering the 18B parameters model
l=40
h=6144
s=1048
V=50257
    
PF,GPU_usage=calculate_Theoretical_peak_FLOP_s_GPU(global_batch_size,s, l,h,number_GPUs,Time_per_iteration_second)

## 4.4.4 Estimate the Training Duration / Epoch

It is possible to estimate training duration per epoch according to the model, dataset, and hardware size. Training time (in seconds) is approximated with this equation $\approx \frac{8*T*P}{n * PF}$ where: 
- $T$ = Number of tokens in the dataset
- $P$ = Numbers of parameters 
- $n$ = Number of GPUs
- $PF$ = Achieved teraFLOP/s per GPU

More details are described in the paper [Scaling Language Model Training to a Trillion Parameters Using Megatron](https://arxiv.org/pdf/2104.04473.pdf).

Let's execute the 2 following cells to estimate the training duration for the 18B parameters models trained on a dataset of $T$=300 billion tokens:

In [None]:
from termcolor import colored

# Estimate the training time
def estimate_days_needed(T , P , N ,PF):  
    compute_sec=8*T*P/(N*PF*10e12)
    # Convert compute seconds to days
    to_days=round(compute_sec/(3600*24))
    print("This language model will need {} days per epoch.".format(colored(str(to_days),'blue', attrs=['bold'])))

In [None]:
# Number of tokens in the dataset
T=300*10e09

estimate_days_needed(T,P,number_GPUs,PF)

The result is 203 days, which is almost **7 months** required to train the 18B model on 16 GPUs (2 nodes) with a dataset of 300B tokens! 

In this case, scaling the number of nodes is unavoidable in order to train the model in a reasonable amount of time. 

For instance, consider a GPT-3 model with $P$=175 billion parameters trained on a dataset of $T$=300 billion tokens on $n$=1024 A100 GPUs. Using a batch size of 1536, we achieve $F$=140 teraFLOP/s per GPU. Thus, the time required to train this model is **34 days**.


---
<h2 style="color:green;">Congratulations!</h2>

Before moving on, we need to make sure no jobs are still running or waiting in the queue. 

In [None]:
# Check the SLURM jobs queue 
!squeue

If there are still jobs running or pending, execute the following cell to cancel all the admin user's jobs using the `scancel` command.

In [None]:
# Cancel admin user jobs
!scancel -u $USER

# Check again the SLURM jobs queue (should be either empty, or the status TS column should be CG)
!squeue

In the next lab, we will experiment with other techniques used for training large-scale neural networks and demonstrate their usage for Computer Vision. 