---
# 4.4 The Training Performance

In order to train large Neural Networks in a reasonable time, scaling the infrastructure is unavoidable. 
Let's first compute the number of parameters of our transformer models and estimate its training according to the hardware infrastructure and the experimentally observed training throughput.

## Compute the Number of Parameters of Transformers Model

The number of parameters for a Transformers model is computed as:
$P = 12 l h^2 (1 + \frac{13}{12h} + \frac{V+s}{12lh})$ where:
- $l$ = Number of Layers
- $h$ = Hidden Size
- $V$ = Vocabulary Size
- $s$ = Sequence Length

In [None]:
# Number of parameters of the Transformers model
def calculate_number_parameters(l,h,s,V):
    # Compute the number of parameters of the model
    P=12*l*h*h *(1+ (13/(12*h)) + ((V+s)/(12*l*h)))
    print("The number of parameters for the GPT architecture is: {} \n".format(int(P)))
    return P

As an example, let's compute the number of parameters of the transformer model having 40 layers, a hidden size of 6144, vocabulary size of 50257 and sequence length of 1024. This model should be approximately 18 billion parameters. 

In [None]:
# Set the model architecture parameters
l=40
h=6144
s=1048
V=50257
    
P=calculate_number_parameters(l,h,s,V)

# Compute the Theoretical Peak FLOP per second per GPU

As detailed in the paper [Scaling Language Model Training to a Trillion Parameters Using Megatron](https://arxiv.org/pdf/2104.04473.pdf), the majority of floating-point operations in the model are performed in the matrix multiplications (GEMMs). If we consider only these GEMMs operations, the number of FLOPs per iteration is $F = 96 B s l h^2 (1 + \frac{s}{6h} + \frac{V}{16lh})$ where $B$ is the batch size. 

And, in case we have an estimate of the time spent per iteration `Time_per_iteration_second`, it is possible then to compute the theoretical peak FLOP per second and per GPUs and estimate the GPU usage by comparing. 

The following table shows the training performance of several GPT model sizes (from 1.7B to 1 trillion) pretrained using Megatron-LM library on a SuperPOD cluster with A100 GPUs.
<img src="https://github.com/NVIDIA/Megatron-LM/blob/main/images/cases_april2021.png?raw=true"/>

In [None]:
# Theoretical peak FLOP per second per GPU - with activation checkpointing (2 forwards and 1 backward)
def calculate_Theoretical_peak_FLOP_s_GPU(B,s, l,h,number_GPUs,Time_per_iteration_second):
    # The number of FLOPs per iteration
    F = 96*B*s* l*h*h *(1 + s/ (6*h) + V/(16*l*h))/1e+12
    
    #Theoretical peak FLOP per second per GPU
    PF= (F/Time_per_iteration_second/number_GPUs)
    print("Theoretical peak FLOP/s/GPU: {}\n".format(PF))
    
    # Percentage of theoretical peak FLOP/s on a A100 FP16 (change according the hardware)
    GPU_usage= PF/ 312 *100
    print("Percentage of theoretical peak FLOP/s: {}%".format(GPU_usage))
    
    return PF, GPU_usage

The percentage of theoretical peak FLOP/s in the previous function is based on **A100** hardware capabilities in **FP16/BF16 which is 312**. This needs to be updated according to the corresponding Tensor Core GPU performance specs. To learn more about the [Ampere architecture specifications](https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/).


If we consider our previous 18B parameters models pretrained on 16 GPUs with a global batch size of 512, they have a time per iteration of 32.09s. 
Let's compute the theoretical peak FLOP per second per GPU and the percentage GPU utilization:

In [None]:
global_batch_size=512
number_GPUs=16
Time_per_iteration_second=32.09

# Considering the 18B parameters model
l=40
h=6144
s=1048
V=50257
    
PF,GPU_usage=calculate_Theoretical_peak_FLOP_s_GPU(global_batch_size,s, l,h,number_GPUs,Time_per_iteration_second)

# Estimate the Training Duration / Epoch

It is possible to estimate training duration per epoch according to the model, dataset, and hardware size. Training time (in seconds) is approximated with this equation $\approx \frac{8*T*P}{n * PF}$ where: 
- $T$ = Number of tokens in the dataset
- $P$ = Numbers of parameters 
- $n$ = Number of GPUs
- $PF$ = Achieved teraFLOP/s per GPU

More details are described in the paper [Scaling Language Model Training to a Trillion Parameters Using Megatron](https://arxiv.org/pdf/2104.04473.pdf).

Let's execute the 2 following cells to estimate the training duration for the 18B parameters models trained on a dataset of $T$=300 billion tokens:

In [None]:
from termcolor import colored

# Estimate the training time
def estimate_days_needed(T , P , N ,PF):  
    compute_sec=8*T*P/(N*PF*10e12)
    # Convert compute seconds to days
    to_days=round(compute_sec/(3600*24))
    print("This language model will need {} days per epoch.".format(colored(str(to_days),'blue', attrs=['bold'])))

In [None]:
# Number of tokens in the dataset
T=300*10e09

estimate_days_needed(T,P,number_GPUs,PF)

The result is 203 days, which is almost **7 months** required to train the 18B model on 16 GPUs (2 nodes) with a dataset of 300B tokens! 

In this case, scaling the number of nodes is unavoidable in order to train the model in a reasonable amount of time. 

For instance, consider a GPT-3 model with $P$=175 billion parameters trained on a dataset of $T$=300 billion tokens on $n$=1024 A100 GPUs. Using a batch size of 1536, we achieve $F$=140 teraFLOP/s per GPU. Thus, the time required to train this model is **34 days**.