<a href="https://colab.research.google.com/github/brennenho/brennen.dev/blob/main/pythia_pretraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install dependencies from PyPI.
- `datasets`: hugging face library for loading and processing datasets
- `transformers`: hugging face library with pretrained models and tokenizers
- `optuna`: automated hyperparameter optimization framework

In [1]:
!pip install datasets transformers optuna -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m481.3/491.2 kB[0m [31m21.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/386.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m386.6/386.6 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/231.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m231.9/231.9 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

Load the `pythia-14m` model and corresponding tokenizer from hugging face.
Additionally ensure there is a fallback padding token if needed.


In [2]:
import torch
from transformers import GPTNeoXForCausalLM, AutoTokenizer

MODEL_NAME = "EleutherAI/pythia-14m"

# check device availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# load from hugging face
model = GPTNeoXForCausalLM.from_pretrained(
  MODEL_NAME,
  torch_dtype = torch.float32,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

model.to(device)

# fallback padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Using device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/595 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/53.3M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/264 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Load the 10m subset of Dolma from hugging face. During this process, we tokenize the dataset and cut sequences longer than 64. The tokenization is ran in batches of 32.

In [3]:
from datasets import load_dataset

def load_and_process_data(dataset_name, split="train"):
    dataset = load_dataset(dataset_name, split=split)

    def tokenize_function(examples):
        return tokenizer(
            examples["text"],
            truncation=True,
            max_length=64,
            padding="max_length"
        )

    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        batch_size=32,
        remove_columns=["text"]
    )

    return tokenized_dataset

dataset = load_and_process_data("fionac411/dolma-10m")

README.md:   0%|          | 0.00/278 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/28.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/27928 [00:00<?, ? examples/s]

Map:   0%|          | 0/27928 [00:00<?, ? examples/s]

Evaluate initital model using loaded dataset. Evaluation calculates loss and perplexity.

In [4]:
import math
from torch.nn import CrossEntropyLoss

def evaluate_model(model, dataset, batch_size=8):
    model.eval()
    total_loss = 0.0
    total_samples = 0
    loss_fn = CrossEntropyLoss(ignore_index=tokenizer.pad_token_id, reduction='sum')

    # process in batches
    for i in range(0, len(dataset), batch_size):
        batch = dataset[i:i+batch_size]
        input_ids = torch.tensor(batch["input_ids"]).to(device)
        attention_mask = torch.tensor(batch["attention_mask"]).to(device)

        with torch.no_grad():
            # forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits

            # shift for causal language modeling
            shift_logits = logits[:, :-1, :].contiguous()
            shift_labels = input_ids[:, 1:].contiguous()

            # calculate loss
            loss = loss_fn(shift_logits.view(-1, shift_logits.size(-1)),
                           shift_labels.view(-1))

            total_loss += loss.item()
            total_samples += (input_ids[:, 1:] != tokenizer.pad_token_id).sum().item()

    # calculate perplexity
    avg_loss = total_loss / total_samples
    perplexity = math.exp(avg_loss)

    return (avg_loss, perplexity)

# initial model evaulation
initial_eval = evaluate_model(model, dataset)
print("Initial model evaluation:")
print(f"\tLoss: {initial_eval[0]}")
print(f"\tPerplexity: {initial_eval[1]}")

Initial model evaluation:
	Loss: 5.022291708993566
	Perplexity: 151.7586923453805


Use a data collator to dynamically batch and pad data. Used to prepare inputs for next-token prediction.

In [5]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
    pad_to_multiple_of=8
)

Instead of manual testing to determine the most optimal hyperparameters, we'll run multiple training trials and attempt to minimize loss. Each trial will use a re-initialized model.

In [6]:
import optuna
from transformers import Trainer, TrainingArguments

# initialize a new model for each trial run
def model_init():
    return GPTNeoXForCausalLM.from_pretrained(
        MODEL_NAME, torch_dtype=torch.float32
    ).to(device)

training_args = TrainingArguments(
    output_dir="./pythia-sweep",
    overwrite_output_dir=True,
    report_to="none",
    num_train_epochs=1,
)

# create trainer
trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=dataset,
    data_collator=data_collator,
)

def hp_space(trial: optuna.Trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-6, 5e-5, log=True),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [2,4,8]),
        "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.3),
        "warmup_steps": trial.suggest_int("warmup_steps", 0, 500),
    }

best_run = trainer.hyperparameter_search(
    direction="minimize",  # minimize eval loss
    hp_space=hp_space,
    backend="optuna",
    n_trials=5,
)

best_hps = best_run.hyperparameters

print("Best hyperparameters:")
print(best_hps)

[I 2025-04-20 18:47:15,839] A new study created in memory with name: no-name-f216fbed-c680-4886-9914-5e7a1b7f9cbb


Step,Training Loss
500,4.9261
1000,4.9352
1500,4.8708
2000,4.8783
2500,4.8386
3000,4.8169


[I 2025-04-20 18:49:20,082] Trial 0 finished with value: 4.827736854553223 and parameters: {'learning_rate': 5.9146344512878165e-06, 'per_device_train_batch_size': 8, 'weight_decay': 0.037169267735569375, 'warmup_steps': 244}. Best is trial 0 with value: 4.827736854553223.


Step,Training Loss
500,5.1346
1000,5.2561
1500,5.2098
2000,5.1521
2500,5.0692
3000,5.0242
3500,5.0132
4000,4.9949
4500,4.9064
5000,4.9318


[I 2025-04-20 18:52:40,045] Trial 1 finished with value: 4.747718334197998 and parameters: {'learning_rate': 2.291684063607082e-05, 'per_device_train_batch_size': 4, 'weight_decay': 0.24299585477781244, 'warmup_steps': 326}. Best is trial 1 with value: 4.747718334197998.


Step,Training Loss
500,4.9627
1000,5.0376
1500,5.0436
2000,5.0458
2500,5.0375
3000,5.0156
3500,4.9946
4000,5.001
4500,4.9578
5000,4.9213


[I 2025-04-20 18:58:41,924] Trial 2 finished with value: 4.828596115112305 and parameters: {'learning_rate': 5.2285141238299564e-06, 'per_device_train_batch_size': 2, 'weight_decay': 0.1326736863157424, 'warmup_steps': 459}. Best is trial 1 with value: 4.747718334197998.


Step,Training Loss
500,4.915
1000,4.9201
1500,4.9302
2000,4.9173
2500,4.8752
3000,4.861
3500,4.8815
4000,4.8853
4500,4.8285
5000,4.8713


[I 2025-04-20 19:02:02,151] Trial 3 finished with value: 4.8578081130981445 and parameters: {'learning_rate': 2.482120450731778e-06, 'per_device_train_batch_size': 4, 'weight_decay': 0.15498496979453638, 'warmup_steps': 352}. Best is trial 1 with value: 4.747718334197998.


Step,Training Loss
500,5.2558
1000,5.432
1500,5.361
2000,5.332
2500,5.3243
3000,5.2668
3500,5.2256
4000,5.2174
4500,5.1571
5000,5.1068


[I 2025-04-20 19:08:09,617] Trial 4 finished with value: 4.765420436859131 and parameters: {'learning_rate': 2.07402949423015e-05, 'per_device_train_batch_size': 2, 'weight_decay': 0.11126827386837147, 'warmup_steps': 262}. Best is trial 1 with value: 4.747718334197998.


Best hyperparameters:
{'learning_rate': 2.291684063607082e-05, 'per_device_train_batch_size': 4, 'weight_decay': 0.24299585477781244, 'warmup_steps': 326}


Train the model using the best hyperparameters found previously.

In [7]:
for hp_name, hp_value in best_hps.items():
    setattr(training_args, hp_name, hp_value)

print(training_args.learning_rate,
      training_args.num_train_epochs,
      training_args.per_device_train_batch_size,
      training_args.weight_decay,
      training_args.warmup_steps)

trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=dataset,
    data_collator=data_collator,
)
trainer.train()

2.291684063607082e-05 1 4 0.24299585477781244 326


Step,Training Loss
500,5.1346
1000,5.2561
1500,5.2098
2000,5.1521
2500,5.0692
3000,5.0242
3500,5.0132
4000,4.9949
4500,4.9064
5000,4.9318


TrainOutput(global_step=6982, training_loss=5.005195873373048, metrics={'train_runtime': 167.3375, 'train_samples_per_second': 166.896, 'train_steps_per_second': 41.724, 'total_flos': 81813936537600.0, 'train_loss': 5.005195873373048, 'epoch': 1.0})

Evaluate the final model for changes in loss and perplexity.

In [8]:
final_eval = evaluate_model(trainer.model, dataset)
print("Final model evaluation:")
print(f"\tLoss: {final_eval[0]}")
print(f"\tPerplexity: {final_eval[1]}")

Final model evaluation:
	Loss: 4.74771887364994
	Perplexity: 115.32092266514249


### Summary of Results  
- **Initial evaluation** (before continued pre‑training):  
  - *Loss:* `5.022291708993566`  
  - *Perplexity:* `151.7586923453805`  
- **Final evaluation** (after hyperparameter‑selected training):  
  - *Loss:* `4.74771887364994`  
  - *Perplexity:* `115.32092266514249`

  > Continued pre‑training on the Dolma‑10M subset yielded a **decrease** in both loss and perplexity, indicating the model adapted to the domain text.

---

### Best Hyperparameters  
From the hyperparameter sweep (5 trials), we found:  
- **learning_rate:** `2.291684063607082e-05`
- **per_device_train_batch_size:**`4`
- **weight_decay:** `0.24299585477781244`
- **warmup_steps:** `326`

  > These hyperparameters align with the practices of low learning rates and moderate warmup steps.

---

### Next Experiments

- **Longer Context Windows:** Increase max_length beyond 64 to see how extra context affects predictions
- **Data Splitting:** Reserve 10–20% of Dolma for evaluation to measure over‑fitting
- **Multi‑Epoch Sweeps:** Test training for multiple epochs
- **Scale Up:** Apply the best hyperparameters to a larger fraction Dolma dataset and track training curves over time