## The Paper "Efficient Large-Scale Language Model Training on GPU Clusters" 
## has the following equation used in estimating compute needed (in days)

![training time estimate](TrainingTimeEstimate.JPG)

paper : https://arxiv.org/pdf/2104.04473.pdf

- given the following information 
- T=300*1e+9 dataset size measured in numbers of tokens in the dataset
- P=175*1e+9 number of model parameters, for GPT3 
- n=640 number of GPUs in the compute cluster

------------------------------------------------------------------------
![GPT3 variants ](GPT3_all.png)
paper : https://arxiv.org/pdf/2005.14165.pdf

In [32]:
def superscript(n):
    return "".join(["⁰¹²³⁴⁵⁶⁷⁸⁹"[ord(c)-ord('0')] for c in str(n)]) 


In [31]:
import numpy as np
T=300*1e+9 #oftokens in the dataset
#P=175*1e+9 # number of model parameters
n=640 # number of GPUs in the compute cluster
def prettify(x):
    s=str(int(x))
    l=len(s)
    start=s.find('0')
    num=str(s[:start])
    diff=l-start-1
    return num+'x10'+superscript(diff)

def calculate_days_needed(T , P , n ):
    X=140*1e+12 # TeraFlop/s per GPU
    tot=8*T*P
    div=n*X
    compute_sec=tot/div
    #convert compute seconds to days
    to_days=round(compute_sec/(3600*24),2)
    return to_days

GPT3_models_labels=['gpt3_small', 'gpt3_medium', 'gpt3_large', 'gpt3_XL', 'gpt3_2.7B', 'gpt3_6.7B','gpt3_13B', 'gpt3_175B']
GPT3_model_params=[125*1e+6, 350*1e+6 , 760*1e+6, 1.3*1e+9 , 2.7*1e+9, 6.7*1e+9 , 13*1e+9, 175*1e+9]
GPT_model_params_str=['125 Million','350 Million', '750 Million' ,'1.3 Billion' ,'2.7 Billion', '13 Billion', '175 Billion']

for gpt3_name, gpt3_params, gpt3_param_str in zip(GPT3_models_labels,GPT3_model_params,GPT_model_params_str ):
    days_needed=calculate_days_needed(T,gpt3_params,n)
    print(" ----------------------------------------------------------------------------------------")
    print(" language model :{} with {} number of parameters , it will need {} days to compute \n".format(gpt3_name, gpt3_param_str, str(days_needed)))

 ----------------------------------------------------------------------------------------
 language model :gpt3_small with 125 Million number of parameters , it will need 0.04 days to compute 

 ----------------------------------------------------------------------------------------
 language model :gpt3_medium with 350 Million number of parameters , it will need 0.11 days to compute 

 ----------------------------------------------------------------------------------------
 language model :gpt3_large with 750 Million number of parameters , it will need 0.24 days to compute 

 ----------------------------------------------------------------------------------------
 language model :gpt3_XL with 1.3 Billion number of parameters , it will need 0.4 days to compute 

 ----------------------------------------------------------------------------------------
 language model :gpt3_2.7B with 2.7 Billion number of parameters , it will need 0.84 days to compute 

 ---------------------------------

![the power law](Compute_Datasize_Parameters.JPG)
### source : https://arxiv.org/pdf/2001.08361.pdf