In [0]:
%pip install --upgrade transformers accelerate torch
dbutils.library.restartPython() 

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting torch
  Downloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 670.2/670.2 MB 2.0 MB/s eta 0:00:00
Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 5.6 MB/s eta 0:00:00
Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 5.7 MB/s eta 0:00:00
Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 kB 874.7 kB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none

# Introduction
The [t5-small on a single GPU](1. T5-Small on Single GPU) example provided a straightforward example of fine-tuning a language model. However, you might have noticed that the training problem was still essentially structured as a supervised learning problem: we had a text (code snippet) and a desired completion. When training LLMs like the GPT models, labels are not provided manually. We instead use an approach called self-supervised learning wherein the objective is automatically computed from the inputs. One example of self-supervised learning is causal language modeling, where the task is to predict the next word based on the previous words. E.g. the sentence "The boy hid behind the tree" would be decomposed into the following training tasks:
- Input: `The`, Target: `boy`
- Input: `The boy`, Target: `hid`
- Input: `The boy hid`, Target: `behind`
- Input: `The boy hid behind`, Target: `the`
- Input: `The boy hid behind the`, Target: `tree`.

This requires us to preprocess our data and pass it along to the model somewhat differently, which will be the subject of this notebook. We will still limit this example to training on a single GPU (an a10 with 24GB VRAM). We will use the [gpt2](https://huggingface.co/gpt2) model with 124M parameters. Later, we will work though Eleuther's [Transformer Math blog post](https://blog.eleuther.ai/transformer-math/#training) to understand the memory costs associated with training this model under different conditions and verify that it matches our experience. Hugging Face also provides a guide to [model memory anatomy](https://huggingface.co/docs/transformers/model_memory_anatomy).

According to the Hugging Face post, a good heuristic is that we require around 18GB VRAM + additional memory for activations (dependent on sequence length, batch size, and various model architecture details) for mixed-precision training. In this case, that translates to around 2GB VRAM + activations.

# Topics Covered in this Notebook
The major difference between this exampl and the t5-small example is the focus on self-supervised learning. Additionally, this notebook will go a little deeper into:
- monitoring training metrics with MLflow
- measuring memory usage

Before progressing to multi-GPU and multi-node training, we will also explore ways to improve training efficiency on a single GPU with techniques such as mixed-precision training.

# Choosing a Fine-Tuning Task
We will fine-tune GPT2 on the [tinystories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset. TinyStories is:

> a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4.

and can be used to train small models (actually quite a bit smaller than GPT-2) that

> still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.

([Source](https://arxiv.org/abs/2305.07759))

We can evaluate the model by passing prompts such as this example from the TinyStories paper:

> Once upon a time there was a pumpkin. It was a very special pumpkin, it could speak. It was sad because it couldn’t move. Every day, it would say

and evaluating the grammar, consistency, and creativity of the output. We hope to see improvements in these areas after training.

# 1. Load the model and try some examples

We'll begin by loading the model and trying out some examples.

In [0]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    device_map="auto",
)

2023-12-15 00:38:11.258375: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-15 00:38:11.258443: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-15 00:38:11.258465: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-15 00:38:11.265245: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [0]:
examples = [
    "There was a cat with magic powers. It could turn invisible. But one day, the cat lost its magic and",
    "There was a cloud that could laugh. It laughed every day. But one day, the cloud didn't laugh. The animals in the forest decided to",
    "Every night, Mia looked at the stars. But one night, one star twinkled differently. It seemed to be sending a message. Mia thought hard about what it could mean and",
]

# Tokenize the examples
inputs = tokenizer(examples, return_tensors="pt", padding=True, add_special_tokens=True, truncation=True)

# Move tensors to the same device as the model
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate text with the model
outputs = model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_new_tokens=50,
    do_sample=True,
    top_p=0.95,
)
# Decode and print the outputs
for i, output in enumerate(outputs):
    print(f"Completion for example {i + 1}:")
    print(tokenizer.decode(output, skip_special_tokens=True))
    print("\n")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Completion for example 1:
There was a cat with magic powers. It could turn invisible. But one day, the cat lost its magic and became a demon. But the demon couldn't have survived for very long and attacked me and his companions. This is the final stage of the story.

"As expected, there are a few things you can say as well. First, your


Completion for example 2:
There was a cloud that could laugh. It laughed every day. But one day, the cloud didn't laugh. The animals in the forest decided to kill everyone.

"If there's something I'm trying to do, I can't run!"

"Do you have a job to do?"

"I don't want to die, so I can't run anymore."



Completion for example 3:
Every night, Mia looked at the stars. But one night, one star twinkled differently. It seemed to be sending a message. Mia thought hard about what it could mean and it didn't stop there.

"Why didn't you want to die? If you don't want to, I'm sure you'll need to see someone who loves him, or someone who's not afraid to talk a

Not the most coherent results. Hopefully our fine-tuning will improve this. Let's get the dataset and take a look at it.

# 2. Get the dataset

In [0]:
from datasets import load_dataset
tinystories = load_dataset('roneneldan/TinyStories')



### Inspect the Dataset

In [0]:
tinystories

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})

There are > 2 million training samples and > 20,000 validation samples.

In [0]:
import pandas as pd

# Convert the train dataset to a pandas dataframe and preview the first few rows
df = pd.DataFrame(tinystories['train'][:10])
print(df)

                                                text
0  One day, a little girl named Lily found a need...
1  Once upon a time, there was a little car named...
2  One day, a little fish named Fin was swimming ...
3  Once upon a time, in a land full of trees, the...
4  Once upon a time, there was a little girl name...
5  Once upon a time, in a big lake, there was a b...
6  Once upon a time, in a small town, there was a...
7  Once upon a time, in a peaceful town, there li...
8  Once upon a time, there was a clever little do...
9  One day, a fast driver named Tim went for a ri...


# 3. Fine-Tune the Model
This time around, we're going to train the model with a little more care. In particular, we will:
- keep a close eye on training metrics using MLflow
- do a few test runs to choose a set of reasonable hyperparameters for our final fine-tuning run
- use mixed-precision training for faster training

In [0]:
from torch.utils.data import DataLoader
import os

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)


if os.environ.get('DATABRICKS_RUNTIME_VERSION') is not None:
    cache_file_path_train = "/Volumes/daniel_liden/fine_tune/assets/cache/train_cache.arrow"
    cache_file_path_valid = "/Volumes/daniel_liden/fine_tune/assets/cache/validation_cache.arrow"
else:
    cache_file_path_train = "./cache/train_cache.arrow"
    cache_file_path_valid = "./cache/validation_cache.arrow"


if not os.path.exists("./cache/"):
    os.makedirs("./cache/")

# Tokenize and cache the train data
tokenized_train_data = tinystories["train"].map(
    tokenize_function,
    batched=True,
    batch_size=1000,
    cache_file_name=cache_file_path_train  # Cache file for the training set
)

# Tokenize and cache the validation data
tokenized_validation_data = tinystories["validation"].map(
    tokenize_function,
    batched=True,
    batch_size=1000,
    cache_file_name=cache_file_path_valid  # Cache file for the validation set
)

In [0]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
import mlflow

# Define the training arguments
training_args = TrainingArguments(
    output_dir='/Volumes/daniel_liden/fine_tune/assets/results',
    num_train_epochs=1,
    per_device_train_batch_size=4, 
    per_device_eval_batch_size=4, 
    warmup_steps=1,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=50,  # Log every 10 steps
    evaluation_strategy="steps",  # Evaluate every 'eval_steps'
    eval_steps=1000,
    fp16=True,
)

# Initialize the data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data.select(range(20000)),  # Use only the first 20k rows for train data
    eval_dataset=tokenized_validation_data.select(range(5000)),  # Use only the first 5k rows for eval data
    data_collator=data_collator,
)

# Start training and track with MLflow
with mlflow.start_run():
    trainer.train()
    mlflow.log_params(training_args.to_dict())

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
1000,1.8603,1.827593
2000,1.7381,1.753622
3000,1.7461,1.711488
4000,1.7254,1.688552
5000,1.6925,1.680552


# 4. Load the Model Checkpoint and Run some Examples

In [0]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

examples = [
    "There was a cat with magic powers. It could turn invisible. But one day, the cat lost its magic and",
    "There was a cloud that could laugh. It laughed every day. But one day, the cloud didn't laugh. The animals in the forest decided to",
    "Every night, Mia looked at the stars. But one night, one star twinkled differently. It seemed to be sending a message. Mia thought hard about what it could mean and",
]

# Specify the path to your checkpoint
checkpoint_path = "/Volumes/daniel_liden/fine_tune/assets/results/checkpoint-5000"

# Load the tokenizer and model from the checkpoint
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

model = AutoModelForCausalLM.from_pretrained(checkpoint_path)

# Create a pipeline for text generation (adjust task as needed)
gpt2_pipeline = pipeline(
    "text-generation", model=model, tokenizer=tokenizer, device_map="auto"
)

# Use the pipeline for inference
gpt2_pipeline(examples, max_new_tokens=50)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[[{'generated_text': 'There was a cat with magic powers. It could turn invisible. But one day, the cat lost its magic and was stuck in a maze.\n\nThe cat wanted to find new ways to find its magic. So, it decided to learn how to build a castle. When the cat\'s friends saw it, they were frightened. They said, "Go away'}],
 [{'generated_text': "There was a cloud that could laugh. It laughed every day. But one day, the cloud didn't laugh. The animals in the forest decided to jump on it and take it out. The animals got scared and ran away.\n\nThe clouds liked to laugh at the animals. They didn't want to make them scared. They wanted to go home and play with the animal.\n\nThe"}],
 [{'generated_text': 'Every night, Mia looked at the stars. But one night, one star twinkled differently. It seemed to be sending a message. Mia thought hard about what it could mean and decided to do something different.\n\nShe grabbed her flashlight and made a circle around her star. She pointed to the sun and sa