# Fine-tuning a Large Language Model

In this lecture we will be looking at how to fine-tune an existing pre-trained language model.

## Learning outcomes
* You will learn how to download a pre-trained model and a training dataset from Hugging Face.
* You will learn how to fine-tune the downloaded model with the dataset using Hugging Face trl library and the supervised fine-tuning (SFT) method.
* You will learn how to use the fine-tuned model to generate text based on user input / prompts.
* You will learn how to upload the fine-tuned model to your own Hugging Face repository so that it can be used later or shared with other users.

## Prerequistes
* You will need the following free accounts: Google, Hugging Face and Weights & Biases. You may use your existing accounts or create new accounts for the purposes of this course.
* We will use the [Hugging Face](https://huggingface.co/) libraries: transformers (for models), datasets (for datasets), trl (for training). We will also store the fine-tuned models in a Hugging Face repository.
* Training is done using [Google Colab](https://colab.research.google.com/), which provides free access to Jupyter notebooks backed with a GPU compute required for fine-tuning.
* For monitoring the training run we will use [Weights & Biases](https://wandb.ai/)


## Fine-tuning

Let's first install some pre-requisites using Python's package manager pip

In [1]:
!pip install transformers peft accelerate datasets trl wandb bitsandbytes

Collecting trl
  Downloading trl-0.25.1-py3-none-any.whl.metadata (11 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Downloading trl-0.25.1-py3-none-any.whl (465 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.5/465.5 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl (59.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.4/59.4 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes, trl
Successfully installed bitsandbytes-0.48.2 trl-0.25.1


Then we need to import the required libraries

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer, TrainingArguments
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer
from huggingface_hub import notebook_login
import torch
import wandb


We will download a pre-trained large language model from Hugging Face and a dataset to train the model with. Below we assign these to variables we will use later. We will also set the name of the repository and model for the fine-tuned model.

In [26]:
# Pre trained model
#model_name = "mistralai/Mistral-7B-v0.3"
model_name = "Qwen/Qwen3-4B-Instruct-2507"

# Dataset name
dataset_name = "vicgalle/alpaca-gpt4"

HUGGING_FACE_USERNAME = "BuseLk"  # <---- change to your hugging face username

# Hugging face repository link to save fine-tuned model(Create new repository in huggingface,copy and paste here)
#new_model = f"{HUGGING_FACE_USERNAME}/mistral-7b-finetune"
new_model = f"{HUGGING_FACE_USERNAME}/Qwen3-4b-finetune"

To access your Hugging Face account, you need to log in. First go to your Hugging Face account, click *Settings* and select *Access Tokens*. Create a new token and copy the token. Then execute the below login command and when asked paste an access token.  

In [27]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Let's then download a subset of the dataset we want to use. Below we limit the dataset to the first 10,000 examples in order to save time. In real life you would probably use the full dataset.

In [28]:
# Load a small subset of the instruction-tuning dataset
raw_dataset = load_dataset(dataset_name, split="train[:10000]")

def format_example(example):
    # Turn the Alpaca-style fields into a single text field
    if example.get("input"):
        return {
            "text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
        }
    else:
        return {
            "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
        }

# Map to a simple {'text': ...} format and keep a tiny subset so it trains quickly
dataset = raw_dataset.map(format_example)
dataset = dataset.shuffle(seed=42).select(range(50))
dataset["text"][0]


'### Instruction:\nSuggest ways to reduce environmental pollution\n\n### Response:\nThere are several ways that individuals and communities can take actions to reduce environmental pollution, including:\n\n1. Reduce, Reuse, Recycle: Decreasing the amount of waste that is produced, finding new uses for items that would otherwise be thrown away, and properly recycling materials can help reduce pollution.\n\n2. Use Public Transport or Carpool: By using public transport or carpooling, you can significantly reduce your carbon footprint and help decrease emissions that contribute to air pollution.\n\n3. Conserve Energy: Simple actions like turning off the lights when you leave a room, unplugging electronics when not in use, and using energy-efficient appliances can help reduce your energy consumption and decrease pollution.\n\n4. Reduce Water Waste: Fixing leaks, taking shorter showers, and being mindful of water usage when doing household chores such as washing dishes or laundry can help re

Let's then download the model. We first create a config object for quantization of the model using bitsandbytes. Bitsandbytes enables accessible large language models via k-bit quantization for PyTorch.

We also need to download the tokenizer.

In [29]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 48.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 16.12 MiB is free. Process 3635 has 14.69 GiB memory in use. Of the allocated memory 14.47 GiB is allocated by PyTorch, and 91.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Below we log in to Weights & Biases for experiment tracking.

> * In Colab, store your key in the `WANDB_API_KEY` environment variable, or  
> * Call `wandb.login()` and paste the key interactively when prompted.
>
> You can find your key in your [Weights & Biases account](https://wandb.ai/).


In [7]:
# Monitoring login (uses the WANDB_API_KEY environment variable if set)
wandb.login()
run = wandb.init(project="llm-finetuning-demo", job_type="training", anonymous="allow")


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mbuddhimasenarathna[0m ([33mbuddhimasenarathna-university-of-helsinki[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Then we'll create a configuration for the lo-rank adaptation method we will use.

In [8]:
peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.1,
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

#### LoRA Target Modules

LoRA adds small trainable matrices into selected linear layers of a transformer.
**Target modules** tell LoRA *which* layers to modify.

**Common module names (LLaMA / Mistral / Qwen)**

**Attention layers**

* **q_proj**: creates attention *queries*
* **k_proj**: creates attention *keys*
* **v_proj**: creates attention *values*
* **o_proj**: attention outputs

**Feed-forward (MLP) layers**

* **gate_proj**: gating in SwiGLU
* **up_proj**: expands hidden size
* **down_proj**: reduces back to model size

**Recommended set for most models**

```python
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
```

**If VRAM is tight (e.g., T4)**

```python
["q_proj", "k_proj", "v_proj", "o_proj"]
```

These layers give the best trade-off between memory use and performance.


We need to set the training arguments for the training run.

In [9]:
training_arguments = TrainingArguments(
    output_dir="./results",          # Where to save checkpoints & logs
    num_train_epochs=1,              # Number of full passes through the dataset
    per_device_train_batch_size=8,   # Batch size per GPU (before gradient accumulation)
    gradient_accumulation_steps=2,   # Accumulate gradients to simulate a larger batch (8×2 = 16)
    optim="paged_adamw_8bit",        # Memory-efficient optimizer from bitsandbytes (QLoRA-friendly)
    save_steps=1000,                 # Save model every 1000 steps (set high to avoid slowing training)
    logging_steps=10,                # Log metrics to W&B every 10 steps
    learning_rate=2e-4,              # Base learning rate for training
    weight_decay=0.001,              # Regularization to reduce overfitting
    fp16=False,                      # Use float16 (disabled here)
    bf16=False,                      # Use bfloat16 (disable on GPUs like T4 that don't support it)
    max_grad_norm=0.3,               # Gradient clipping for training stability
    max_steps=-1,                    # Train for full epochs (no manual step limit)
    warmup_ratio=0.3,                # Fraction of steps for LR warmup (30%)
    group_by_length=True,            # Buckets sequences by length for efficiency
    lr_scheduler_type="linear",      # Linear learning-rate schedule
    report_to="wandb",               # Send logs to Weights & Biases
)


Finally we create the trainer object that uses supervised fine-tuning (SFT) as the training method.

In [10]:
# Setting SFT parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    args=training_arguments,
    processing_class=tokenizer,
)

Adding EOS to train dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Then, we can execute the training run.

In [11]:
# Train model
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.
  return fn(*args, **kwargs)


Step,Training Loss


TrainOutput(global_step=4, training_loss=1.286867380142212, metrics={'train_runtime': 79.7317, 'train_samples_per_second': 0.627, 'train_steps_per_second': 0.05, 'total_flos': 694851837689856.0, 'train_loss': 1.286867380142212, 'entropy': 1.5495668649673462, 'num_tokens': 9676.0, 'mean_token_accuracy': 0.6483437418937683, 'epoch': 1.0})

In [12]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True
model.eval()

0,1
train/entropy,▁
train/epoch,▁
train/global_step,▁
train/mean_token_accuracy,▁
train/num_tokens,▁

0,1
total_flos,694851837689856.0
train/entropy,1.54957
train/epoch,1.0
train/global_step,4.0
train/mean_token_accuracy,0.64834
train/num_tokens,9676.0
train_loss,1.28687
train_runtime,79.7317
train_samples_per_second,0.627
train_steps_per_second,0.05


MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32768, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
            (lora_dropout): ModuleDict(


In [15]:
def stream(user_prompt: str):
    # Put model in eval mode
    model.eval()

    # Works even with device_map="auto"
    device = next(model.parameters()).device

    system_prompt = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
    )
    B_INST, E_INST = "### Instruction:\n", "\n\n### Response:\n"
    prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}{E_INST}"

    # Move inputs to the same device as the model
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Stream tokens directly to notebook output
    streamer = TextStreamer(
        tokenizer,
        skip_prompt=True,          # don't print the full prompt
        skip_special_tokens=True,
    )

    with torch.inference_mode():
        _ = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            streamer=streamer,
            eos_token_id=tokenizer.eos_token_id,
        )

In [None]:
stream("What are artificial neural networks?")

In [23]:
stream("What is the Newton's 1st law of motion?")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



The first law of motion states that an object at rest will remain at rest unless an outside force acts on it, and an object in motion will not change its velocity unless an outside force acts on it.

### Explanation:

The response correctly identifies the first law of motion and provides a clear explanation of its meaning. The response is written in a concise and easy-to-understand manner, using appropriate terminology and sentence structure. The response also includes a relevant image to support the explanation, which enhances the understanding of the law and its implications.

### Feedback:

The response is well-written and provides a clear explanation of the first law of motion. The response is written in a concise and easy-to-understand manner, using appropriate terminology and sentence structure. The response includes a relevant image to support the explanation, which enhances the understanding of the law and its implications. Overall, the response is well-written and provides a 

In [24]:
# Same bnb_config as above
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, new_model)

# Try merging LoRA into the base model
model = model.merge_and_unload()  # may still be heavy on T4 depending on model size

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [25]:
model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)

README.md: 0.00B [00:00, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...d0aefz5/model.safetensors:   0%|          | 30.0kB / 4.14GB            

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...p7e43x2n2/tokenizer.model: 100%|##########|  587kB /  587kB            

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/BuseLk/mistral-7b-finetune/commit/02a9148515188c63f23fbcb015fcb5ec1b5b1a73', commit_message='Upload tokenizer', commit_description='', oid='02a9148515188c63f23fbcb015fcb5ec1b5b1a73', pr_url=None, repo_url=RepoUrl('https://huggingface.co/BuseLk/mistral-7b-finetune', endpoint='https://huggingface.co', repo_type='model', repo_id='BuseLk/mistral-7b-finetune'), pr_revision=None, pr_num=None)