### Training Details (GPT-3 Paper)

| Model Name              | nparams | nlayers | dmodel | nheads | dhead | Batch Size | Learning Rate  |
|-------------------------|---------|---------|--------|--------|-------|------------|----------------|
| GPT-3 Small             | 125M    | 12      | 768    | 12     | 64    | 0.5M       | 6.0 × 10−4    |
| GPT-3 Medium            | 350M    | 24      | 1024   | 16     | 64    | 0.5M       | 3.0 × 10−4    |
| GPT-3 Large             | 760M    | 24      | 1536   | 16     | 96    | 0.5M       | 2.5 × 10−4    |
| GPT-3 XL                | 1.3B    | 24      | 2048   | 24     | 128   | 1M         | 2.0 × 10−4    |
| GPT-3 2.7B              | 2.7B    | 32      | 2560   | 32     | 80    | 1M         | 1.6 × 10−4    |
| GPT-3 6.7B              | 6.7B    | 32      | 4096   | 32     | 128   | 2M         | 1.2 × 10−4    |
| GPT-3 13B               | 13.0B   | 40      | 5140   | 40     | 128   | 2M         | 1.0 × 10−4    |
| GPT-3 175B or “GPT-3”    | 175.0B  | 96      | 12288  | 96     | 128   | 3.2M       | 0.6 × 10−4    |

**Table 2.1:** Sizes, architectures, and learning hyper-parameters (batch size in tokens and learning rate) of the models
which we trained. All models were trained for a total of 300 billion tokens.


**Table 2.1** shows the sizes and architectures of our 8 models. Here nparams is the total number of trainable parameters,
nlayers is the total number of layers, dmodel is the number of units in each bottleneck layer (we always have the
feedforward layer four times the size of the bottleneck layer, dff = 4 ∗ dmodel), and dhead is the dimension of each
attention head. All models use a context window of nctx = 2048 tokens. We partition the model across GPUs along
both the depth and width dimension in order to minimize data-transfer between nodes. The precise architectural
parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models
across GPU’s. Previous work [KMH+20 ] suggests that validation loss is not strongly sensitive to these parameters
within a reasonably broad range.

#### B Details of Model Training

To train all versions of GPT-3, we use **Adam** with **β1 = 0.9**, **β2 = 0.95**, and **ε = 10⁻⁸**, clip the global norm of the gradient at **1.0**, and apply **cosine decay** for the learning rate, reducing it to **10%** of its value over **260 billion tokens** (after which training continues at 10% of the original rate). There is a **linear learning rate warmup** over the first **375 million tokens**, and the batch size is gradually increased from **32k tokens** to the full value over the first **4–12 billion tokens** of training, depending on model size. Data are sampled without replacement until an epoch boundary is reached to minimize overfitting, and all models use a **weight decay of 0.1** for regularization. During training, we always use sequences of the full **2048-token context window**, packing multiple documents into a single sequence when documents are shorter than 2048, with a special **end of text token** delimiting documents to efficiently indicate that separated contexts are unrelated.


### Login to the hub to push checkpoints and final model

In [None]:
from huggingface_hub import HfApi
import getpass
import os 

hf_token = getpass.getpass("Enter your Hugging Face token: ")
os.environ["HF_TOKEN"] = hf_token
repo_name  = "cwestnedge/gpt2_test"
api = HfApi(token=os.environ["HF_TOKEN"])

### Load Data

In [None]:
import torch, glob
from torch.utils.data import IterableDataset, DataLoader
from transformers import GPT2LMHeadModel, get_scheduler
import time
import re

class PTIterableDataset(IterableDataset):
    def __init__(self, pt_files):
        self.pt_files = pt_files

    def __iter__(self):
        for file_path in self.pt_files:
            data = torch.load(file_path)
            for i in range(data["input_ids"].size(0)):
                sample = {
                    "input_ids": data["input_ids"][i],
                    "attention_mask": data["attention_mask"][i],
                    "files": file_path.split('/')[-1]
                }
                if data.get("labels") is not None:
                    sample["labels"] = data["labels"][i]
                yield sample


# really sorry this function is necessary but i screwed up the file naming conventions
# so this is my lame patch
def extract_number(filename):
    match = re.search(r'(\d+)', filename)
    return int(match.group(1)) if match else 0

# torch.cuda.empty_cache()

loader_batch_size = 4
train_files = sorted(glob.glob("../processed_batches/train/*.pt"), key=extract_number)
test_files = sorted(glob.glob("../processed_batches/test/*.pt"), key=extract_number)

train_loader = DataLoader(PTIterableDataset(train_files), batch_size=loader_batch_size, num_workers=0)
test_loader = DataLoader(PTIterableDataset(test_files), batch_size=loader_batch_size, num_workers=0)

print('-'*50 + 'TRAIN' + '-'*50)
train = next(iter(train_loader))
print(train)
print(train['input_ids'].shape)

print('-'*50 + 'TEST' + '-'*50)
test = next(iter(test_loader))
print(test)
print(test['input_ids'].shape)

### Define Hyperparapeters (currently using ones for models on scale with gpt3-large)

In [None]:
# === METADATA ===
gpt3_warmup_step_ratio = (375_000_000/260_000_000_000) # per the gpt3 paper treating it as a ratio which may not be the best
n_files = 221713
n_tokens_per_file = 16*1024 # (file_batch_sixe x max_token_len)
total_tokens = n_files * n_tokens_per_file

# === HYPERPARAMETERS ===
# we want to have roughly .5M tokens per batch based on above figure for gpt3
gradient_accumulation_steps = 4 # 128 
tokens_per_batch = (n_tokens_per_file/loader_batch_size) * gradient_accumulation_steps
print(f"Tokens per batch (should be roughly .5M): {tokens_per_batch}")

initial_lr = 2.5e-4 # LR for should be 6e-4 to 2.5e-4 for gpt3 small-large
n_training_steps = total_tokens / tokens_per_batch
n_warmup_steps = int(round(n_training_steps * gpt3_warmup_step_ratio, 1))
print(f"N warmup steps (could be {gpt3_warmup_step_ratio*100:.2f}% of {n_training_steps} training_steps) => {n_warmup_steps} steps")

beta1, beta2 = 0.9, 0.95 # these may need to be changed to fit our training assumptions
max_grad_norm = 1.0 # paper uses 1 i think this generalizes to our usecase
weight_decay = .10 # i believe this still makes sense


In [None]:
gradient_accumulation_steps * loader_batch_size

### Helper functions to handle weight decay and checkpointing

In [None]:
def get_grouped_params(model, weight_decay, no_decay=["bias", "LayerNorm.weight"]):
    '''handy function for setting weight decay shoutout to hugging face book '''
    params_with_wd, params_without_wd = [], []
    for n, p in model.named_parameters():
        if any(nd in n for nd in no_decay):
            params_without_wd.append(p)
        else:
            params_with_wd.append(p)
    return [{'params': params_with_wd, 'weight_decay': weight_decay},
            {'params': params_without_wd, 'weight_decay': 0.0}]

def save_checkpoint_metadata(optimizer, lr_scheduler, step, losses, batch_file, scaler=None):
    checkpoint = {
        'optimizer': optimizer.state_dict(),
        'lr_scheduler': lr_scheduler.state_dict(),
        'global_step': step,
        'losses': losses,
        'batch_file': batch_file,
    }

    # scaler is for GPU only since doing fp16 on GPU
    if scaler is not None:
        checkpoint['scaler'] = scaler.state_dict()

    # checkpoint file locally so we can easily push to hub
    torch.save(checkpoint, "training_state.pt")


def push_model_and_state_to_hub(model, api, step, max_retries=3, retry_delay=10):
    for attempt in range(1, max_retries+1):
        try:
            # FIRST push model to hub FIRST (creates repo if it doesnt exit)
            model.push_to_hub(repo_name, commit_message=f"Checkpoint at step {step}")

            # THEN the training state file
            api.upload_file(
                path_or_fileobj="training_state.pt",
                path_in_repo="training_state.pt",
                repo_id=repo_name,
                commit_message=f"Training state at step {step}"
            )
            # then push the model
        except Exception as e:
            print(f"Attempt {attempt} failed: {e}")
            if attempt == max_retries:
                print("Max attempts reached. Exiting.")
                raise e
            time.sleep(retry_delay)

### initialize model, optimizer, and scheduler for training 

In [None]:
model = torch.compile(GPT2LMHeadModel.from_pretrained("openai-community/gpt2"))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.train()

model_params_with_decay = get_grouped_params(model, weight_decay=weight_decay)
optimizer = torch.optim.AdamW(
    model_params_with_decay, # default would be model.parameters()
    lr=initial_lr,
    )

lr_scheduler = get_scheduler(
    name="cosine", 
    optimizer=optimizer, 
    num_warmup_steps=n_warmup_steps, 
    num_training_steps=n_training_steps
    )

### Train model (GPU)

In [None]:
# loss_history = []
# global_step = 0
# scaler = torch.amp.GradScaler("cuda") # for floating point 16 (GPU only)

# for step, batch in enumerate(train_loader):
#     input_ids = batch['input_ids'].to(device)
#     attention_mask = batch['attention_mask'].to(device)
#     batch_file_names = set(batch['files']) # helps us keep track of where training left off at

#     # for GPU only 
#     with torch.autocast(device_type="cuda"):
#         # labels and input are same since GPT2LMHeadModel will perform shift internally
#         # if computing loss externally you will want to shift labels then pass to loss_fn
#         outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
#         loss = outputs.loss

#     loss = loss / gradient_accumulation_steps
#     scaler.scale(loss).backward()

#     # optimizer step...
#     if (step + 1) % gradient_accumulation_steps == 0:
#         # unscale, clip, step, update scaler & scheduler
#         scaler.unscale_(optimizer)
#         torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
#         scaler.step(optimizer)
#         scaler.update()

#         lr_scheduler.step()
#         optimizer.zero_grad()
#         global_step += 1

#         loss_to_log = loss.item() * gradient_accumulation_steps
#         loss_history.append(loss_to_log)
#         print(f"Global step {global_step}, loss: {loss_to_log:.4f}")

#         if global_step % 100 == 0:
#             commit_msg = f"Checkpoint at step {global_step}"
#             save_checkpoint_metadata(
#                 optimizer=optimizer,
#                 lr_scheduler=lr_scheduler,
#                 scaler=scaler,
#                 step=global_step,
#                 losses=loss_history,
#                 batch_file=batch_file_names
#             )

#             print('Pushing to hub...')
#             push_model_and_state_to_hub(
#                 model=model,
#                 api=api,
#                 step=global_step,
#                 max_retries=3,
#                 retry_delay=10
#             )

# if (step + 1) % gradient_accumulation_steps != 0:
#     # perform a final optimizer step to flush any remaining gradients
#     scaler.unscale_(optimizer)
#     torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
#     scaler.step(optimizer)
#     scaler.update()

#     lr_scheduler.step()
#     optimizer.zero_grad()
#     global_step += 1

#     loss_to_log = loss.item() * gradient_accumulation_steps
#     loss_history.append(loss_to_log)

#     print(f"Performed final optimizer step to flush remaining gradients at global step {global_step}")

# # final commit
# print(f'FIRST PASS COMPLETE AT STEP {step}')
# save_checkpoint_metadata(
#     optimizer=optimizer,
#     lr_scheduler=lr_scheduler,
#     scaler=scaler,
#     step=global_step,
#     losses=loss_history,
#     batch_file=batch_file_names
# )
# push_model_and_state_to_hub(
#     model=model,
#     api=api,
#     step=global_step,
#     max_retries=3,
#     retry_delay=10
# )

### Train Model (CPU for testing only)

In [None]:
loss_history = []
global_step = 0
save_steps = 1
for step, batch in enumerate(train_loader):
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    batch_file_names = set(batch['files'])  # helps us keep track of where training left off at

    # forward pass (no autocast for CPU)
    outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
    loss = outputs.loss

    loss = loss / gradient_accumulation_steps
    loss.backward()  # Standard backward pass without scaling

    if (step + 1) % gradient_accumulation_steps == 0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        optimizer.zero_grad()
        lr_scheduler.step()
        global_step += 1

        loss_to_log = loss.item() * gradient_accumulation_steps
        loss_history.append(loss_to_log)
        print(f"Global step {global_step}, loss: {loss_to_log:.4f}")

        if global_step % save_steps == 0:
            commit_msg = f"Checkpoint at step {global_step}"
            save_checkpoint_metadata(
                optimizer=optimizer,
                lr_scheduler=lr_scheduler,
                scaler=None,  # No scaler for CPU training
                step=global_step,
                losses=loss_history,
                batch_file=batch_file_names
            )

            print('Pushing to hub...')
            push_model_and_state_to_hub(
                model=model,
                api=api,
                step=global_step,
                max_retries=3,
                retry_delay=10
            )
    
if (step + 1) % gradient_accumulation_steps != 0:
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
    optimizer.step()
    optimizer.zero_grad()
    lr_scheduler.step()
    global_step += 1

    loss_to_log = loss.item() * gradient_accumulation_steps
    loss_history.append(loss_to_log)
    print(f"Performed final optimizer step to flush remaining gradients at global step {global_step}")

print(f'FIRST PASS COMPLETE AT STEP {step}')
save_checkpoint_metadata(
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    scaler=None,  # no scaler for CPU training
    step=global_step,
    losses=loss_history,
    batch_file=batch_file_names
)
push_model_and_state_to_hub(
    model=model,
    api=api,
    step=global_step,
    max_retries=3,
    retry_delay=10
)


### load checkpoint

In [None]:
# from huggingface_hub import hf_hub_download, HfApi
# import getpass
# import os 

# hf_token = getpass.getpass("Enter your Hugging Face token: ")
# os.environ["HF_TOKEN"] = hf_token
# repo_name  = "cwestnedge/gpt2_test"
# api = HfApi(token=os.environ["HF_TOKEN"])

# training_state_path = hf_hub_download(
#     repo_id=repo_name, 
#     filename="training_state.pt",
#     token=hf_token
# )

# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# checkpoint = torch.load(training_state_path, map_location=torch.device(device))

# # pull relevant stuff
# optimizer_state = checkpoint['optimizer']
# lr_state = checkpoint['lr_scheduler']
# global_step = checkpoint['global_step']
# loss = checkpoint['losses']
# last_batch = checkpoint['batch_file']

In [None]:
# from transformers import GPT2LMHeadModel
# import torch

# def get_grouped_params(model, weight_decay, no_decay=["bias", "LayerNorm.weight"]):
#     '''handy function for setting weight decay shoutout to hugging face book '''
#     params_with_wd, params_without_wd = [], []
#     for n, p in model.named_parameters():
#         if any(nd in n for nd in no_decay):
#             params_without_wd.append(p)
#         else:
#             params_with_wd.append(p)
#     return [{'params': params_with_wd, 'weight_decay': weight_decay},
#             {'params': params_without_wd, 'weight_decay': 0.0}]


# model_t = GPT2LMHeadModel.from_pretrained(repo_name, token=hf_token)
# model_t.to(device);
# grouped_params = get_grouped_params(model_t, weight_decay=.10)

# optimizer = torch.optim.Adam(
#     grouped_params, 
#     lr=lr_state['_last_lr'][0]
#     )
# optimizer.load_state_dict(checkpoint['optimizer'])

# n_warmup_steps = 100  # example value adjust accordingly
# n_training_steps = 1000  # example value adjust accordingly
# lr_scheduler = get_scheduler(
#     name="cosine", 
#     optimizer=optimizer, 
#     num_warmup_steps=n_warmup_steps, # original values
#     num_training_steps=n_training_steps # original value
#     )
# lr_scheduler.load_state_dict(lr_state)
