# Finetuning LLMs

This notebook demonstrates fine-tuning a pretrained Large Language Model (LLM) on a small, custom dataset.
In this project, we fine-tune the model distilgpt2, a lightweight GPT-2–style causal language model from Hugging Face.

Distilgpt2 has already been pretrained on a very large corpus of text. During pretraining, it learns general language patterns such as grammar, common topics, and typical text structures.

Fine-tuning does **not** teach the model new facts or make it reason better.
Instead, it slightly adjusts the model's parameters so that certain patterns become more likely than they were before.

In this notebook, fine-tuning is used to "teach" the model a specific text format:

Concretely, we train the model on short texts that all follow the pattern:

```text
### TENET
[a short phrase]
```


In [1]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
)
from datasets import Dataset

  from .autonotebook import tqdm as notebook_tqdm


In [2]:

texts = [
    "### TENET\nMeaning arises from consistent action, not abstract belief.",
    "### TENET\nIdeas only matter when they shape behavior.",
    "### TENET\nUnderstanding follows experience rather than preceding it.",
    "### TENET\nPrinciples are validated through consequences, not intentions.",
    "### TENET\nResponsibility gives structure to freedom.",
    "### TENET\nWhat we repeat defines who we become.",
    "### TENET\nClarity is earned through engagement, not speculation.",
    "### TENET\nRules guide action, but action reveals meaning.",
    "### TENET\nChoice without consequence is empty.",
    "### TENET\nPurpose emerges from commitment over time.",
]

dataset = Dataset.from_dict({"text": texts})
dataset

Dataset({
    features: ['text'],
    num_rows: 10
})

In [3]:
model_name = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # required for GPT2

model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Embedding(50257, 768)

In [4]:
def tokenize(example):
    return tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=64,
    )

tokenized_dataset = dataset.map(tokenize, remove_columns=["text"])

Map: 100%|██████████| 10/10 [00:00<00:00, 114.17 examples/s]


In [89]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

In [90]:
training_args = TrainingArguments(
    output_dir="./principle-model",
    per_device_train_batch_size=2,
    num_train_epochs=2,
    learning_rate=2e-4,
    logging_steps=5,
    save_strategy="no",
    report_to="none",
    fp16=torch.cuda.is_available(),
)


In [91]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)


In [92]:
trainer.train()

Step,Training Loss
5,5.9084
10,3.8226


TrainOutput(global_step=10, training_loss=4.865499114990234, metrics={'train_runtime': 0.4753, 'train_samples_per_second': 42.077, 'train_steps_per_second': 21.039, 'total_flos': 326620938240.0, 'train_loss': 4.865499114990234, 'epoch': 2.0})

Before fine-tuning (base distilgpt2)

When prompted with:

```text
### TENET
```

the un-finetuned model has no strong prior for what should come next. As a result, it may:

Output blank lines or whitespace

Start an unrelated blog post or article

Produce generic or inconsistent text

This happens because ### TENET is not a common or well-defined pattern in the model's pretraining data.

In [94]:
base_model = AutoModelForCausalLM.from_pretrained("distilgpt2")

prompt = "### TENET\n"

inputs = tokenizer(prompt, return_tensors="pt")
output = base_model.generate(
    **inputs,
    max_new_tokens=30,
    do_sample=True,
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


### TENET

A few weeks ago with the release of the Ubuntu 13.04 desktop environment I asked about some important changes in Ubuntu 13.04 LTS with


It did what we expected. It seems to be summarizing some article or blog post.

Let's try it with the finetuned model:

In [95]:
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=30,
    do_sample=True,
    temperature=0.7,
    repetition_penalty=1.2,
)

print(tokenizer.decode(output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


### TENET
The key to the success of a society is that we are not, and our actions have consequences. The social order requires action rather than aggression as its


This works as we would expect: the model respects the ### TENET formatting and immediately follows it with a tenet-like statement. (There's still some incoherent text because we are using a very small dataset and only a few training examples, so the model hasn't fully generalized.)