# Fine-tuning a Large Language Model

In this lecture we will be looking at how to fine-tune an existing pre-trained language model.

## Learning outcomes
* You will learn how to download a pre-trained model and a training dataset from Hugging Face.
* You will learn how to fine-tune the downloaded model with the dataset using Hugging Face trl library and the supervised fine-tuning (SFT) method.
* You will learn how to use the fine-tuned model to generate text based on user input / prompts.
* You will learn how to upload the fine-tuned model to your own Hugging Face repository so that it can be used later or shared with other users.

## Prerequistes
* You will need the following free accounts: Google, Hugging Face and Weights & Biases. You may use your existing accounts or create new accounts for the purposes of this course.
* We will use the [Hugging Face](https://huggingface.co/) libraries: transformers (for models), datasets (for datasets), trl (for training). We will also store the fine-tuned models in a Hugging Face repository.
* Training is done using [Google Colab](https://colab.research.google.com/), which provides free access to Jupyter notebooks backed with a GPU compute required for fine-tuning.
* For monitoring the training run we will use [Weights & Biases](https://wandb.ai/)


## Fine-tuning

Let's first install some pre-requisites using Python's package manager pip

In [1]:
!pip install transformers peft accelerate datasets trl wandb bitsandbytes

Collecting trl
  Downloading trl-0.26.1-py3-none-any.whl.metadata (11 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Collecting transformers
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-

Then we need to import the required libraries

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer, TrainingArguments
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer
from huggingface_hub import notebook_login
import torch
import wandb


2025-12-14 13:25:50.058929: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765718750.405663      13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765718750.506478      13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'



We will download a pre-trained large language model from Hugging Face and a dataset to train the model with. Below we assign these to variables we will use later. We will also set the name of the repository and model for the fine-tuned model.

In [3]:
# Pre trained model
# model_name = "mistralai/Mistral-7B-v0.3"
model_name = "Qwen/Qwen2.5-1.5B-Instruct"

# Dataset name
# dataset_name = "vicgalle/alpaca-gpt4"
dataset_name = "Abirate/english_quotes"

# Hugging Face repository name

HUGGING_FACE_USERNAME = "filsasso"  # <---- change to your hugging face username

# Hugging face repository link to save fine-tuned model(Create new repository in huggingface,copy and paste here)
new_model = f"{HUGGING_FACE_USERNAME}/test"

To access your Hugging Face account, you need to log in. First go to your Hugging Face account, click *Settings* and select *Access Tokens*. Create a new token and copy the token. Then execute the below login command and when asked paste an access token.  

In [4]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Let's then download a subset of the dataset we want to use. Below we limit the dataset to the first 10,000 examples in order to save time. In real life you would probably use the full dataset.

In [5]:
# Load a small subset of the instruction-tuning dataset
raw_dataset = load_dataset(dataset_name, split="train[:10000]")

# def format_example(example):
#     # Turn the Alpaca-style fields into a single text field
#     if example.get("input"):
#         return {
#             "text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
#         }
#     else:
#         return {
#             "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
#         }


def format_example(example):
    return {
        "text": f"### Instruction:\nGenerate a quote by {example['author']}.\n\n### Response:\n{example['quote']}"
    }

dataset = raw_dataset.map(format_example)

dataset = dataset.shuffle(seed=42).select(range(1000))

print("Example of new dataset:")
print(dataset["text"][0])

# Map to a simple {'text': ...} format and keep a tiny subset so it trains quickly
# dataset = raw_dataset.map(format_example)
# dataset = dataset.shuffle(seed=42).select(range(50))
# dataset["text"][0]


README.md: 0.00B [00:00, ?B/s]

quotes.jsonl:   0%|          | 0.00/647k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2508 [00:00<?, ? examples/s]

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

Example of new dataset:
### Instruction:
Generate a quote by Marilyn Monroe.

### Response:
“I don't mind making jokes, but I don't want to look like one.”


Let's then download the model. We first create a config object for quantization of the model using bitsandbytes. Bitsandbytes enables accessible large language models via k-bit quantization for PyTorch.

We also need to download the tokenizer.

In [6]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= False,
)
# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     quantization_config=bnb_config,
#     device_map={"": 0}
# )

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16
)

model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
# model.config.pretraining_tp = 1

# tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# tokenizer.pad_token = tokenizer.eos_token
# tokenizer.add_eos_token = True
# tokenizer.add_bos_token, tokenizer.add_eos_token

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"


config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Below we log in to Weights & Biases for experiment tracking.

> * In Colab, store your key in the `WANDB_API_KEY` environment variable, or  
> * Call `wandb.login()` and paste the key interactively when prompted.
>
> You can find your key in your [Weights & Biases account](https://wandb.ai/).


In [7]:
# Monitoring login (uses the WANDB_API_KEY environment variable if set)
wandb.login()
run = wandb.init(project="llm-finetuning-demo", job_type="training", anonymous="allow")


UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key])

Then we'll create a configuration for the lo-rank adaptation method we will use.

In [None]:
peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.1,
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

#### LoRA Target Modules

LoRA adds small trainable matrices into selected linear layers of a transformer.
**Target modules** tell LoRA *which* layers to modify.

**Common module names (LLaMA / Mistral / Qwen)**

**Attention layers**

* **q_proj**: creates attention *queries*
* **k_proj**: creates attention *keys*
* **v_proj**: creates attention *values*
* **o_proj**: attention outputs

**Feed-forward (MLP) layers**

* **gate_proj**: gating in SwiGLU
* **up_proj**: expands hidden size
* **down_proj**: reduces back to model size

**Recommended set for most models**

```python
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
```

**If VRAM is tight (e.g., T4)**

```python
["q_proj", "k_proj", "v_proj", "o_proj"]
```

These layers give the best trade-off between memory use and performance.


We need to set the training arguments for the training run.

In [None]:
from trl import SFTTrainer, SFTConfig

In [None]:
# training_arguments = TrainingArguments(
#     output_dir="./results",          # Where to save checkpoints & logs
#     num_train_epochs=1,              # Number of full passes through the dataset
#     per_device_train_batch_size=8,   # Batch size per GPU (before gradient accumulation)
#     gradient_accumulation_steps=2,   # Accumulate gradients to simulate a larger batch (8×2 = 16)
#     optim="paged_adamw_8bit",        # Memory-efficient optimizer from bitsandbytes (QLoRA-friendly)
#     save_steps=1000,                 # Save model every 1000 steps (set high to avoid slowing training)
#     logging_steps=10,                # Log metrics to W&B every 10 steps
#     learning_rate=2e-4,              # Base learning rate for training
#     weight_decay=0.001,              # Regularization to reduce overfitting
#     fp16=False,                      # Use float16 (disabled here)
#     bf16=False,                      # Use bfloat16 (disable on GPUs like T4 that don't support it)
#     max_grad_norm=0.3,               # Gradient clipping for training stability
#     max_steps=-1,                    # Train for full epochs (no manual step limit)
#     warmup_ratio=0.3,                # Fraction of steps for LR warmup (30%)
#     group_by_length=True,            # Buckets sequences by length for efficiency
#     lr_scheduler_type="linear",      # Linear learning-rate schedule
#     report_to="wandb",               # Send logs to Weights & Biases
# )

training_arguments = SFTConfig(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_8bit",
    save_steps=1000,
    logging_steps=10,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.3,
    group_by_length=True,
    lr_scheduler_type="linear",
    report_to="wandb",


    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False}
)

Finally we create the trainer object that uses supervised fine-tuning (SFT) as the training method.

In [None]:
# # Setting SFT parameters
# trainer = SFTTrainer(
#     model=model,
#     train_dataset=dataset,
#     peft_config=peft_config,
#     args=training_arguments,
#     processing_class=tokenizer,
# )

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    args=training_arguments,
    processing_class=tokenizer,
)

Then, we can execute the training run.

In [None]:
# Train model
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
trainer.train()

In [None]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True
model.eval()

In [None]:
# def stream(user_prompt: str):
#     # Put model in eval mode
#     model.eval()

#     # Works even with device_map="auto"
#     device = next(model.parameters()).device

#     system_prompt = (
#         "Below is an instruction that describes a task. "
#         "Write a response that appropriately completes the request.\n\n"
#     )
#     B_INST, E_INST = "### Instruction:\n", "\n\n### Response:\n"
#     prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}{E_INST}"

#     # Move inputs to the same device as the model
#     inputs = tokenizer(prompt, return_tensors="pt").to(device)

#     # Stream tokens directly to notebook output
#     streamer = TextStreamer(
#         tokenizer,
#         skip_prompt=True,          # don't print the full prompt
#         skip_special_tokens=True,
#     )

#     with torch.inference_mode():
#         _ = model.generate(
#             input_ids=inputs["input_ids"],
#             attention_mask=inputs["attention_mask"],
#             max_new_tokens=256,
#             do_sample=True,
#             temperature=0.7,
#             top_p=0.9,
#             streamer=streamer,
#             eos_token_id=tokenizer.eos_token_id,
#         )

def stream(user_prompt: str):
    model.eval()

    device = next(model.parameters()).device
    B_INST, E_INST = "### Instruction:\n", "\n\n### Response:\n"

    prompt = f"{B_INST}{user_prompt.strip()}{E_INST}"

    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    streamer = TextStreamer(
        tokenizer,
        skip_prompt=True,
        skip_special_tokens=True,
    )

    print(f"Richiesta: {user_prompt}\n")
    print("Generazione:", end=" ")

    with torch.inference_mode():
        _ = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=64,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            streamer=streamer,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id
        )

In [None]:
# stream("what is newtons 3rd law and its formula?")
stream("Generate a quote by Albert Einstein.")
print("\n" + "-"*30 + "\n")
stream("Generate a quote by Ralph Waldo Emerson.")

In [None]:
# Same bnb_config as above
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, new_model)

# Try merging LoRA into the base model
model = model.merge_and_unload()  # may still be heavy on T4 depending on model size

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)