# Introduction
The [t5-small on a single GPU](1. T5-Small on Single GPU) example provided a straightforward example of fine-tuning a language model. However, you might have noticed that the training problem was still essentially structured as a supervised learning problem: we had a text (code snippet) and a desired completion. When training LLMs like the GPT models, labels are not provided manually. We instead use an approach called self-supervised learning wherein the objective is automatically computed from the inputs. One example of self-supervised learning is causal language modeling, where the task is to predict the next word based on the previous words. E.g. the sentence "The boy hid behind the tree" would be decomposed into the following training tasks:
- Input: `The`, Target: `boy`
- Input: `The boy`, Target: `hid`
- Input: `The boy hid`, Target: `behind`
- Input: `The boy hid behind`, Target: `the`
- Input: `The boy hid behind the`, Target: `tree`.

This requires us to preprocess our data and pass it along to the model somewhat differently, which will be the subject of this notebook. We will still limit this example to training on a single GPU (an a10 with 24GB VRAM). We will use the [gpt2](https://huggingface.co/gpt2) model with 124M parameters. Later, we will work though Eleuther's [Transformer Math blog post](https://blog.eleuther.ai/transformer-math/#training) to understand the memory costs associated with training this model under different conditions and verify that it matches our experience. Hugging Face also provides a guide to [model memory anatomy](https://huggingface.co/docs/transformers/model_memory_anatomy).

According to the Hugging Face post, a good heuristic is that we require around 18GB VRAM + additional memory for activations (dependent on sequence length, batch size, and various model architecture details) for mixed-precision training. In this case, that translates to around 2GB VRAM + activations.

# Topics Covered in this Notebook
The major difference between this exampl and the t5-small example is the focus on self-supervised learning. Additionally, this notebook will go a little deeper into:
- monitoring training metrics with MLflow
- measuring memory usage

Before progressing to multi-GPU and multi-node training, we will also explore ways to improve training efficiency on a single GPU with techniques such as mixed-precision training.

# Choosing a Fine-Tuning Task
We will fine-tune GPT2 on the [tinystories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset. TinyStories is:

> a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4.

and can be used to train small models (actually quite a bit smaller than GPT-2) that

> still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.

([Source](https://arxiv.org/abs/2305.07759))

We can evaluate the model by passing prompts such as this example from the TinyStories paper:

> Once upon a time there was a pumpkin. It was a very special pumpkin, it could speak. It was sad because it couldn’t move. Every day, it would say

and evaluating the grammar, consistency, and creativity of the output. We hope to see improvements in these areas after training.

# 1. Load the model and try some examples

We'll begin by loading the model and trying out some examples.

In [None]:
%pip install --upgrade -r ./gpt2_requirements.txt

In [None]:
# Some Environment Setup
OUTPUT_DIR = # the path to the output directory; where model checkpoints will be saved
LOG_DIR = # the path to the log directory; where logs will be saved
CACHE_DIR = # the path to the cache directory; where cache files will be saved

In [None]:
from pathlib import Path

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    device_map="auto",
    cache_dir=Path(CACHE_DIR) / "model",
)

In [None]:
examples = [
    "There was a cat with magic powers. It could turn invisible. But one day, the cat lost its magic and",
    "There was a cloud that could laugh. It laughed every day. But one day, the cloud didn't laugh. The animals in the forest decided to",
    "Every night, Mia looked at the stars. But one night, one star twinkled differently. It seemed to be sending a message. Mia thought hard about what it could mean and",
]

# Tokenize the examples
inputs = tokenizer(examples, return_tensors="pt", padding=True, add_special_tokens=True, truncation=True)

# Move tensors to the same device as the model
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate text with the model
outputs = model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_new_tokens=50,
    do_sample=True,
    top_p=0.95,
)
# Decode and print the outputs
for i, output in enumerate(outputs):
    print(f"Completion for example {i + 1}:")
    print(tokenizer.decode(output, skip_special_tokens=True))
    print("\n")

Not the most coherent results. Hopefully our fine-tuning will improve this. Let's get the dataset and take a look at it.

# 2. Get the dataset

In [None]:
from datasets import load_dataset
tinystories = load_dataset('roneneldan/TinyStories',
                           cache_dir=str(Path(CACHE_DIR) / "data"))

### Inspect the Dataset

In [None]:
tinystories

There are > 2 million training samples and > 20,000 validation samples.

In [None]:
import pandas as pd

# Convert the train dataset to a pandas dataframe and preview the first few rows
df = pd.DataFrame(tinystories['train'][:10])
print(df)

# 3. Fine-Tune the Model
This time around, we're going to train the model with a little more care. In particular, we will:
- keep a close eye on training metrics using MLflow
- do a few test runs to choose a set of reasonable hyperparameters for our final fine-tuning run
- use mixed-precision training for faster training

As in the t5-small example, we are not going to fine-tune on the entire dataset. Instead, we will sample 100,000 examples and fine-tune on those.

In [None]:
from torch.utils.data import DataLoader
import os

# Shuffle and select a subset of the train data
sample_size = 100000
shuffled_train_data = tinystories["train"].shuffle(seed=42)
subset_train_data = shuffled_train_data.select(range(sample_size))


def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

if not os.path.exists("./cache/"):
    os.makedirs("./cache/")

# Tokenize and cache the train data
tokenized_train_data = subset_train_data.map(
    tokenize_function,
    batched=True,
    batch_size=1000,
    cache_file_name=Path(CACHE_DIR) / "train_cache.arrow"  
)

# Tokenize and cache the validation data
tokenized_validation_data = tinystories["validation"].map(
    tokenize_function,
    batched=True,
    batch_size=1000,
    cache_file_name=Path(CACHE_DIR) / "validation_cache.arrow" 
)

In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
import mlflow

# Define the training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=1,
    per_device_train_batch_size=4, 
    per_device_eval_batch_size=4, 
    warmup_steps=1,
    weight_decay=0.01,
    logging_dir=LOG_DIR,
    logging_steps=50,  # Log every 10 steps
    evaluation_strategy="steps",  # Evaluate every 'eval_steps'
    eval_steps=1000,
    fp16=True,
)

# Initialize the data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data.select(range(20000)),  # Use only the first 20k rows for train data
    eval_dataset=tokenized_validation_data.select(range(5000)),  # Use only the first 5k rows for eval data
    data_collator=data_collator,
)

# Start training and track with MLflow
with mlflow.start_run():
    trainer.train()
    mlflow.log_params(training_args.to_dict())

# 4. Load the Model Checkpoint and Run some Examples

In [None]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

examples = [
    "There was a cat with magic powers. It could turn invisible. But one day, the cat lost its magic and",
    "There was a cloud that could laugh. It laughed every day. But one day, the cloud didn't laugh. The animals in the forest decided to",
    "Every night, Mia looked at the stars. But one night, one star twinkled differently. It seemed to be sending a message. Mia thought hard about what it could mean and",
]

# Specify the path to your checkpoint
checkpoint_path = Path(OUTPUT_DIR) / "checkpoint-5000"

# Load the tokenizer and model from the checkpoint
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

model = AutoModelForCausalLM.from_pretrained(checkpoint_path)

# Create a pipeline for text generation (adjust task as needed)
gpt2_pipeline = pipeline(
    "text-generation", model=model, tokenizer=tokenizer, device_map="auto"
)

# Use the pipeline for inference
gpt2_pipeline(examples, max_new_tokens=50)