# Introduction
The [t5-small on a single GPU](1. T5-Small on Single GPU) example provided a straightforward example of fine-tuning a language model. However, you might have noticed that the training problem was still essentially structured as a supervised learning problem: we had a text (code snippet) and a desired completion. When training LLMs like the GPT models, labels are not provided manually. We instead use an approach called self-supervised learning wherein the objective is automatically computed from the inputs. One example of self-supervised learning is causal language modeling, where the task is to predict the next word based on the previous words. E.g. the sentence "The boy hid behind the tree" would be decomposed into the following training tasks:
- Input: `The`, Target: `boy`
- Input: `The boy`, Target: `hid`
- Input: `The boy hid`, Target: `behind`
- Input: `The boy hid behind`, Target: `the`
- Input: `The boy hid behind the`, Target: `tree`.

This requires us to preprocess our data and pass it along to the model somewhat differently, which will be the subject of this notebook. We will still limit this example to training on a single GPU (an a10 with 24GB VRAM). We will use the [gpt2](https://huggingface.co/gpt2) model with 124M parameters. Later, we will work though Eleuther's [Transformer Math blog post](https://blog.eleuther.ai/transformer-math/#training) to understand the memory costs associated with training this model under different conditions and verify that it matches our experience. Hugging Face also provides a guide to [model memory anatomy](https://huggingface.co/docs/transformers/model_memory_anatomy).

According to the Hugging Face post, a good heuristic is that we require around 18GB VRAM + additional memory for activations (dependent on sequence length, batch size, and various model architecture details) for mixed-precision training. In this case, that translates to around 2GB VRAM + activations.

# Topics Covered in this Notebook
The major difference between this exampl and the t5-small example is the focus on self-supervised learning. Additionally, this notebook will go a little deeper into:
- monitoring training metrics with MLflow
- measuring memory usage

Before progressing to multi-GPU and multi-node training, we will also explore ways to improve training efficiency on a single GPU with techniques such as mixed-precision training.

# Choosing a Fine-Tuning Task
We will fine-tune GPT2 on the [tinystories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset. TinyStories is:

> a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4.

and can be used to train small models (actually quite a bit smaller than GPT-2) that

> still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.

([Source](https://arxiv.org/abs/2305.07759))

We can evaluate the model by passing prompts such as this example from the TinyStories paper:

> Once upon a time there was a pumpkin. It was a very special pumpkin, it could speak. It was sad because it couldn’t move. Every day, it would say

and evaluating the grammar, consistency, and creativity of the output. We hope to see improvements in these areas after training.

# 1. Load the model and try some examples

We'll begin by loading the model and trying out some examples.

In [1]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    device_map="auto",
)

In [2]:
examples = [
    "There was a cat with magic powers. It could turn invisible. But one day, the cat lost its magic and",
    "There was a cloud that could laugh. It laughed every day. But one day, the cloud didn't laugh. The animals in the forest decided to",
    "Every night, Mia looked at the stars. But one night, one star twinkled differently. It seemed to be sending a message. Mia thought hard about what it could mean and",
]

# Tokenize the examples
inputs = tokenizer(examples, return_tensors="pt", padding=True, add_special_tokens=True, truncation=True)

# Move tensors to the same device as the model
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate text with the model
outputs = model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_new_tokens=50,
    do_sample=True,
    top_p=0.95,
)
# Decode and print the outputs
for i, output in enumerate(outputs):
    print(f"Completion for example {i + 1}:")
    print(tokenizer.decode(output, skip_special_tokens=True))
    print("\n")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Completion for example 1:
There was a cat with magic powers. It could turn invisible. But one day, the cat lost its magic and fell dead into the darkness."

"Hmph. Of course. I guess that's the best we can do." I said.

And of course it did.

A cold wind had whipped through the clouds. I stepped out


Completion for example 2:
There was a cloud that could laugh. It laughed every day. But one day, the cloud didn't laugh. The animals in the forest decided to escape the wind. They escaped by jumping into the forest. They jumped into the forest. They came out of it. They jumped into the sea. They started fishing, going fishing. And there, where it was, there were fishes. But all


Completion for example 3:
Every night, Mia looked at the stars. But one night, one star twinkled differently. It seemed to be sending a message. Mia thought hard about what it could mean and how she would interpret what the night saw. All she could do was pray and pray for the stars. She wanted to know more.

It 

Not the most coherent results. Hopefully our fine-tuning will improve this. Let's get the dataset and take a look at it.

# 2. Get the dataset

In [3]:
from datasets import load_dataset
tinystories = load_dataset('roneneldan/TinyStories')

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Inspect the Dataset

In [4]:
tinystories

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})

There are > 2 million training samples and > 20,000 validation samples.

In [5]:
import pandas as pd

# Convert the train dataset to a pandas dataframe and preview the first few rows
df = pd.DataFrame(tinystories['train'][:10])
print(df)

                                                text
0  One day, a little girl named Lily found a need...
1  Once upon a time, there was a little car named...
2  One day, a little fish named Fin was swimming ...
3  Once upon a time, in a land full of trees, the...
4  Once upon a time, there was a little girl name...
5  Once upon a time, in a big lake, there was a b...
6  Once upon a time, in a small town, there was a...
7  Once upon a time, in a peaceful town, there li...
8  Once upon a time, there was a clever little do...
9  One day, a fast driver named Tim went for a ri...


# 3. Fine-Tune the Model
This time around, we're going to train the model with a little more care. In particular, we will:
- keep a close eye on training metrics using MLflow
- do a few test runs to choose a set of reasonable hyperparameters for our final fine-tuning run
- use mixed-precision training for faster training

In [6]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast, TrainingArguments, Trainer, DataCollatorForLanguageModeling
import mlflow

# Prepare the training data
train_encodings = tokenizer(tinystories['train']['text'], truncation=True, padding=True)

# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=0.25,            # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,                # when to print log
    fp16=True,                       # use mixed precision
)

# Initialize the data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_encodings,       # training dataset
    data_collator=data_collator,         # use data collator for language modeling
)

# Start training and track with MLflow
with mlflow.start_run():
    trainer.train()
    mlflow.log_params(training_args.to_dict())



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


KeyboardInterrupt: 