## This notebook provides an illustrative demonstratation of how to finetune a pre-trained LLM model (DistillGpt2) on user-defined dataset and generate text using the finetuned model

## set env

conda create -n mlss-day4

conda activate mlss-day4

conda install -c anaconda ipykernel

python -m ipykernel install --user --name=mlss-day4

pip install torch transformers datasets accelerate

### Load Modules

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import load_dataset, IterableDatasetDict
import torch
from itertools import islice
from datasets import Dataset
import accelerate
import transformers
print("Accelerate version:", accelerate.__version__)
print("Transformers version:", transformers.__version__)

Accelerate version: 1.7.0
Transformers version: 4.52.3


### Load TinyStories dataset

**Note: It does not load whole dataset in the memory, rather it gives a straming pointer to whole dataset, such that, during training we can load one token at a time, as loading whole dataset in the memory could consume huge amount of memory.**

In [2]:
dataset = load_dataset("roneneldan/TinyStories", split="train", streaming=True)

README.md:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

### Load DistillGpt2 compatible tokenizer

**Recall - In this notebook, we will explore finetuning and text generation using a pre-trained DistillGpt2 model.**

1. This tokenizer converts raw text (e.g., 'Once upon a time') into token IDs understood by the model
2. It includes special tokens, vocabulary size, and byte-pair encoding rules

In [3]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")  # loads the pretrained tokenizer config and vocab for distilgpt2

tokenizer.pad_token = tokenizer.eos_token


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

### Load pre-trained DistillGpt2 model in memory

In [4]:
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

### Tokenization Function

Tokenize each input story and return special tokens mask for later masking if needed

    - return_special_tokens_mask=True:
    
    - Adds a binary mask alongside token IDs
    
    - Value 0 for special tokens (e.g. , , <pad> if used)
    
    - Value 1 for normal tokens

This is useful during training if we want to ignore loss on special tokens.

In [5]:
def tokenize_function(example):

    return tokenizer(example["text"], return_special_tokens_mask=True, truncation=True, max_length=1024)

Reorganize token lists into fixed-length training chunks. Each final training sequence will be exactly 128 tokens long. 

It first flatten nested token lists for each key — e.g., [[1,2],[3]] → [1,2,3] and then split the flattened token list into chunks of 128 tokens each

In [6]:
def group_texts(examples):  
    block_size = 128  
    
    concatenated = {k: sum((v if isinstance(v, list) else [v] for v in examples[k]), []) for k in examples}
    total_length = len(concatenated[list(examples.keys())[0]])  
    total_length = (total_length // block_size) * block_size  
    
    result = {
        k: [t[i: i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated.items()
    }
    return result

### Load only first 20000 samples from TinyStories dataset

In [7]:
streaming_tokenized = dataset.map(tokenize_function)
streaming_limited = islice(streaming_tokenized, 20000)

### Convert to a dataset-like object compatible with Trainer, and load the tokens in the memory

In [8]:
buffer = list(streaming_limited)
batched = [buffer[i:i + 1000] for i in range(0, len(buffer), 1000)]

### Group all the tokens as 128 tokens long sequence for finetuning

In [9]:
grouped = []

for batch in batched:
    
    batch_dict = {k: [d[k] for d in batch] for k in batch[0].keys()}
    grouped_chunk = group_texts(batch_dict)
    
    grouped.extend([{k: v[i] for k, v in grouped_chunk.items()} for i in range(len(grouped_chunk[list(grouped_chunk.keys())[0]]))])

### Make dataset compatible with the Trainer

In [10]:
train_dataset = Dataset.from_list(grouped)

In [11]:
print(len(train_dataset))

140


In [12]:
print(train_dataset)
print(len(train_dataset))

Dataset({
    features: ['text', 'input_ids', 'attention_mask', 'special_tokens_mask'],
    num_rows: 140
})
140


### Visualize the dataset

In [13]:
from datasets import Dataset
import pandas as pd
# Use Hugging Face's built-in conversion
df = train_dataset.to_pandas()

# Inspect the structure
print(df.head(10))

# Analyze token sequence lengths
print(df['input_ids'].apply(len).describe())
print(df['attention_mask'].apply(len).describe())

                                                text  \
0  [One day, a little girl named Lily found a nee...   
1  [Once upon a time there was a very playful pup...   
2  [Once upon a time there were two best friends ...   
3  [The little boy lived in a quiet house. He lov...   
4  [Once upon a time, there was a weird rocket. T...   
5  [Once upon a time, there was a strong cone in ...   
6  [Sara and Tom are twins. They like to visit th...   
7  [Once upon a time, there was a little girl nam...   
8  [Danny wanted some juice, but he didn't have a...   
9  [Once, there was a farmer who had to make a lo...   

                                           input_ids  \
0  [3198, 1110, 11, 257, 1310, 2576, 3706, 20037,...   
1  [290, 5742, 1123, 584, 13, 2293, 484, 5201, 11...   
2  [262, 5509, 290, 7342, 262, 5667, 2121, 319, 6...   
3  [3114, 379, 4463, 290, 531, 11, 366, 2949, 11,...   
4  [345, 11, 1310, 5916, 11, 329, 1642, 502, 1254...   
5  [23612, 5509, 407, 284, 307, 6507, 13, 383, 

### Define DataLoader

In [14]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

### Define Training Hyper-Parameters

In [15]:
training_args = TrainingArguments(
    output_dir="./tinystories_output",        # directory to save model checkpoints
    overwrite_output_dir=True,                # overwrite old output if exists
    num_train_epochs=10,                      # number of training epochs
    per_device_train_batch_size=2,            # batch size per GPU
    save_steps=500,                           # save model every 500 steps
    save_total_limit=1,                       # only keep last checkpoint
    logging_steps=100,                        # log every 100 steps
    prediction_loss_only=True,                # don't store predictions, just loss
    fp16=torch.cuda.is_available()            # use mixed precision if GPU supports it
)


### Define Trainer

In [16]:
trainer = Trainer(
    model=model,                       # the language model to train
    args=training_args,                # training hyperparameters
    data_collator=data_collator,       # handles dynamic padding during training
    train_dataset=train_dataset        # our processed and limited TinyStories dataset
)

### Finetune the model

In [17]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
100,2.7413
200,2.2872
300,2.0312
400,1.8662
500,1.7147
600,1.634
700,1.569




TrainOutput(global_step=700, training_loss=1.9776550946916853, metrics={'train_runtime': 1508.7765, 'train_samples_per_second': 0.928, 'train_steps_per_second': 0.464, 'total_flos': 45726931353600.0, 'train_loss': 1.9776550946916853, 'epoch': 10.0})

### Define prompt for text generation

In [18]:
prompt = "Once upon a time"

### Tokenize the given prompt

In [19]:
# Tokenize the prompt into model input
inputs = tokenizer(prompt, return_tensors="pt")
# Move tensors to the correct device (CPU/GPU)
inputs = {key: val.to(model.device) for key, val in inputs.items()}

### Generate text using the given prompt of max-length = 100

In [20]:
# Generate continuation with sampling
output = model.generate(
    **inputs,
    max_length=100,           # max number of tokens in generated sequence
    do_sample=True,           # sample instead of greedy decoding
    temperature=0.8           # controls randomness (lower = more deterministic)
)

# Decode tokens into readable text, skip special tokens like  
print(tokenizer.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, there was a little boy named Tim. Tim loved to play outside and enjoy the outdoors. One day, he decided to go out and play in his yard. Tim saw lots of colorful things floating around. One day, his mom gave him some food and ran over.

Tim and his mom helped him get some food for himself. They all had so much to say about, but they all needed to share some food for the trip. Tim thought about sharing something special


### Generate text using the given prompt of max-length = 250

In [21]:
# Generate continuation with sampling
output = model.generate(
    **inputs,
    max_length=250,           # max number of tokens in generated sequence
    do_sample=True,           # sample instead of greedy decoding
    temperature=0.8           # controls randomness (lower = more deterministic)
)

# Decode tokens into readable text, skip special tokens like  
print(tokenizer.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, there lived a quiet, healthy little girl. She always wanted to have fun, but she also wanted to keep the fun around her.

One day, while walking, she saw a big tree and wanted to see how it grew. She wanted to see if it could grow, but she was too busy spinning it around.

Suddenly, the tree was too tall and fell under her, but she knew it would take her to get there. She tried to make sure it kept her feet up, but the tree made her angry.

Her mom told her that it was best to stay up close to the tree, but she had to stay safe. She decided to stay up close to the tree too. She carefully covered the tree and closed the tree.

The tree went into a state of being in the shade when it finally started to grow. The rain made the tree wet and wet. When it finally came to a stop, the tree felt warm and strong. It felt like it had started to grow. The tree started to grow more and more trees started to grow.

The tree was so happy that it had grown, that it had been able to s

### Generate text using the given prompt of max-length = 500

In [22]:
# Generate continuation with sampling
output = model.generate(
    **inputs,
    max_length=500,           # max number of tokens in generated sequence
    do_sample=True,           # sample instead of greedy decoding
    temperature=0.8           # controls randomness (lower = more deterministic)
)

# Decode tokens into readable text, skip special tokens like  
print(tokenizer.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, there was a little girl named Lily. She liked to cook with her friends, especially on her favorite meal. One day, she came back to visit her mom. She was so happy, and she had a delicious plate made out of crayons.

Lily was so excited! She made a big cake with crayons that she'd eat all day. She ate it with her family, friends, and the car. They celebrated the cake and took the cake. They were so happy. And then, they celebrated the cake and walked over to the kitchen with their hands clasped together. They were so happy and happy.

From that day on, Lily made a few crayons for her family. It was so warm and inviting that they made a big cake with their hands and crayons on their laps. They were so happy! They were so happy! They watched them and they said goodbye.

Lily was so happy that she decided to share her cake and make a big cake with her friends. The cake was so happy! She made a big cake and shared it with everyone. 
The next day, Lily made a big cake with 

### Generate text using the given prompt of max-length = 1000

In [23]:
# Generate continuation with sampling
output = model.generate(
    **inputs,
    max_length=1000,           # max number of tokens in generated sequence
    do_sample=True,           # sample instead of greedy decoding
    temperature=0.8           # controls randomness (lower = more deterministic)
)

# Decode tokens into readable text, skip special tokens like  
print(tokenizer.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, there was a little girl named Lucy. She liked to play with her toys and toys. But a few years ago, Lucy's mom gave her permission. Lucy didn't like them, so she decided to put them in a tree. Lucy thought she had to make a big tree with a branch. She decided to make a small tree with a branch with her key. She made two plants with a needle. One day, she saw a big monster in the forest. He was so fast! 


Suddenly, the monster looked up at Lucy and said, "You're the monster. You are the little girl who loves to play alone. Do you have a soft tongue?" Lucy's mom asked. Lucy shook her head.

"No, I don't think so," she said.

"Let's just pretend you are a monster and pretend you are a family member."

Lucy's mom laughed and said, "Did you know that you are a little girl now?"

And so, Lucy walked away. She loved to play with her toys. She loved to play with her friends and share toys so she could support them.
One day, Lucy went to the library to see what she could learn