## The Paper "Efficient Large-Scale Language Model Training on GPU Clusters" 
## has the following equation used in estimating compute needed (in days)

![training time estimate](TrainingTimeEstimate.JPG)

paper : https://arxiv.org/pdf/2104.04473.pdf

- given the following information 
- T=300*1e+9 dataset size measured in numbers of tokens in the dataset
- P=175*1e+9 number of model parameters, for GPT3 
- n=640 number of GPUs in the compute cluster

------------------------------------------------------------------------
![GPT3 variants ](GPT3_all.png)
paper : https://arxiv.org/pdf/2005.14165.pdf
![Cases](images/cases_jan2021.png)

In [36]:
import numpy as np
T=300*1e+9 #oftokens in the dataset
#P=175*1e+9 # number of model parameters
n= 1024 # Berzelius 680 # number of GPUs in the compute cluster

def calculate_days_needed(T , P , n ,x):
    if x is None:
        return '1-2 weeks'
    else:
        #x=140*1e+12 # TeraFlop/s per GPU
        tot=8*T*P
        div=n*x
        compute_sec=tot/div
        #convert compute seconds to days
        to_days=round(compute_sec/(3600*24),1)
        return to_days

GPT3_models_labels=[ 'gpt3_2.7B', 'gpt3_6.7B','gpt3_13B', 'gpt3_175B']
GPT3_model_params=[ 2.7*1e+9, 6.7*1e+9 , 13*1e+9, 175*1e+9,1*1e+12 ]
GPT3_model_params_str=['1.3 Billion' ,'2.7 Billion', '13 Billion', '175 Billion']
#according to the table above
GPT3_X=[127*1e+12, 130*1e+12,127*1e+12,140*1e+12 ]
print("all below are measured with dataset size **300 billion** measured in tokens \n")
for gpt3_name, gpt3_params, gpt3_param_str, x in zip(GPT3_models_labels,GPT3_model_params,GPT3_model_params_str, GPT3_X ):
    days_needed=calculate_days_needed(T,gpt3_params,n,x)
    print(" ----------------------------------------------------------------------------------------")
    print(" language model :{} with {} number of parameters , it will need {} days to compute \n".format(gpt3_name, gpt3_param_str, str(days_needed)))

all below are measured with dataset size **300 billion** measured in tokens 

 ----------------------------------------------------------------------------------------
 language model :gpt3_2.7B with 1.3 Billion number of parameters , it will need 0.6 days to compute 

 ----------------------------------------------------------------------------------------
 language model :gpt3_6.7B with 2.7 Billion number of parameters , it will need 1.4 days to compute 

 ----------------------------------------------------------------------------------------
 language model :gpt3_13B with 13 Billion number of parameters , it will need 2.8 days to compute 

 ----------------------------------------------------------------------------------------
 language model :gpt3_175B with 175 Billion number of parameters , it will need 33.9 days to compute 



In [37]:
# For Model of 1 Trillion Parameters
T=450*1e+9
P=1e+12
n=3072
X=163*1e+12
total_compute_seconds=(8*T*P)/(n*X)
tot_in_days=round(total_compute_seconds/(3600*24),3)
print("For Model of **1 Trillion** Parameters, we scale-up the dataset size  to **450 billion** measured in tokens \n")
print("it will take about {} days to compute".format(str(tot_in_days)))

For Model of **1 Trillion** Parameters, we scale-up the dataset size  to **450 billion** measured in tokens 

it will take about 83.211 days to compute


![the power law](Compute_Datasize_Parameters.JPG)
### source : https://arxiv.org/pdf/2001.08361.pdf