<a href="https://colab.research.google.com/github/callaghanmt-training/ou-fine-tuning-2025-11/blob/main/fine_tuning_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 2: Fine Tuning using Unsloth
In this notebook, we will fine-tune an SLM using the (Q)LoRA methodology and the [Unsloth](https://docs.unsloth.ai/) library.

Unsloth is generally a good choice for resource-contrained environments and 'beginners' as it is optimised to make training faster, use less VRAM and abstracts away mauch of the complexity of training with the standard Huggingface ecosystem.  We _do_ lose some flexibility this way, but it will allow us to get meaningful results in a much shorter space of time.

##Before you start:
1. Make sure you have connected to a T4 runtime and clicked the **[Connect]** button
2. Add your Huggingface Access Token to the notebook 'secrets'

##Notebook Overview
1. **Setup**: Install Unsloth and dependencies.
1. **Model Loading**: Load Gemma 2 2B (IT version) in 4-bit quantization (QLoRA).
1. **Data Prep**: We will first create a synthetic "Stoic Wisdom" dataset right in the notebook so you don't have to rely on external file uploads during the initial part of the workshop (Optionally - do the fine tuning run again with a larger external dataset).
1. **Configuration**: Set up LoRA adapters.
1. **Training**: Run the fine-tuning process.
1. **Inference**: Test the new "Philosopher" personality.
1. **Energy Use**: Run the fine-tuning loop again, this time wrapped with the `codecarbon` libary to get an idea of the energy/ carbon footprint of fine-tuning.


## 1. Installation
Unsloth requires a specific installation order to use the GPU kernels.

In [None]:
# Confirm we have access to an NVidia GPU
!nvidia-smi

In [None]:
# 1. Install Unsloth and dependencies for Colab (2-3 minutes for this cell)
# We use the specific Colab install command provided by Unsloth
!pip install --quiet "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# 2. Install Hugging Face libraries
#!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

In [None]:
!pip install --no-deps --quiet xformers trl peft accelerate bitsandbytes

##2. Imports and Configuration
We set up the parameters below. We are using a `max_seq_length` of 2048, which is standard and efficient.

In [None]:
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset

max_seq_length = 2048
dtype = None # None = auto detection. Float16 for Tesla T4, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage

print("Libraries loaded and configuration set.")

##3: Load Model and Tokenizer
We load the **Instruction Tuned** version of Gemma 2.

_Why fine-tune the IT version?_ It already understands chat structure (it's a _language engine_). We are just "steering" its style from "Generic Assistant" to "Stoic Philosopher."

(This bit takes ~3 minutes)

In [None]:
model_name = "unsloth/gemma-2-2b-it-bnb-4bit"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print("Gemma 2 2B (4-bit) loaded successfully.")

##4: Data Preparation (The "Stoic" Dataset)
For the speedy version of this workshop, creating data on the fly is safer than downloading files. Here we define a list of modern problems and map them to **Stoic** responses (Marcus Aurelius/Seneca style).

In [None]:
# Define a system prompt to guide the style
system_prompt = """You are a Stoic philosopher in the style of Marcus Aurelius and Seneca.
Answer the user's questions with wisdom, focusing on what is within their control,
the nature of the universe, and the importance of virtue over external events.
Keep your tone calm, archaic, and reflective."""

# Synthetic Data: Modern problems -> Stoic solutions
stoic_data = [
    {"instruction": "I am stuck in traffic and I'm going to be late!", "output": "Why do you let external events disturb your inner peace? The traffic is not within your control. Your reaction to it is. Accept this moment as nature intends. Use this time for reflection rather than anger."},
    {"instruction": "My boss yelled at me today.", "output": "If your boss acted with reason, learn from it. If they acted with anger, that is their harm, not yours. Do not let another's lack of self-control destroy your own. You can only be harmed if you believe you are harmed."},
    {"instruction": "I want to buy a new expensive car to impress my neighbors.", "output": "Desire is a chain. You seek to impress those who do not care for you with things you do not need. True wealth is found in the want of nothing, not the possession of much. Seek virtue, not applause."},
    {"instruction": "I am worried about the future.", "output": "The future is uncertain, but your character is constant. Do not suffer before it is necessary. If the future brings hardship, you will face it with the same weapons of reason you possess today."},
    {"instruction": "Someone insulted me on the internet.", "output": "The insult exists only if you accept it. A rock thrown into the air gains nothing by going up and loses nothing by falling down. Your value is not determined by the opinions of strangers."},
    {"instruction": "I feel overwhelmed by my todo list.", "output": "Do not confuse activity with action. Focus on the task at hand, as if it were the last thing you were doing in your life. Do it with dignity and without distraction. The rest will follow."},
    # ... In a real scenario, we would want 100+ examples.
    # For this workshop demo, we will duplicate these to simulate a training run.
] * 10

# Convert to Hugging Face Dataset
dataset = Dataset.from_list(stoic_data)

# Function to format the data into the chat structure Gemma expects
def formatting_prompts_func(examples):
    texts = []
    for instruction, output in zip(examples["instruction"], examples["output"]):
        # We use the tokenizer's chat template
        # Note: We inject the system prompt implicitly by how we structure the response
        text = tokenizer.apply_chat_template([
            {"role": "user", "content": instruction},
            {"role": "model", "content": output}
        ], tokenize=False, add_generation_prompt=False)
        texts.append(text)
    return {"text": texts}

# Apply formatting
dataset = dataset.map(formatting_prompts_func, batched=True)

print(f"Dataset created with {len(dataset)} examples.")
print("Example formatted entry:", dataset[0]["text"])

##5. Setting up LoRA (Low-Rank Adaptation)
We don't retrain the whole model (that's too expensive). We add small "adapter" layers.

`r`: The rank. Higher = more parameters to train (slower, maybe smarter). 16 is standard.   
`target_modules`: The specific internal layers of the model we are modifying (all of this ios in the documentation as the architecture is 'known')

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimised
    bias = "none",    # Supports any, but = "none" is optimised
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # Rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

print("LoRA adapters attached.")

##6: Training (The Fine-Tuning)
We use the `SFTTrainer` function.

`max_steps`: For this workshop, set this to 60. It takes about 2-3 minutes. In real life, we'd train for multiple epochs.   
`learning_rate`: `2e-4` is the standard "magic number" for QLoRA (again, from the literature).

(The training loop takes about 3-4 minutes)

In [None]:
# Define a basic LORA training function
def lora_loop():
    trainer = SFTTrainer( #HF Supervised Fine Tuning Trainer
        model = model,
        tokenizer = tokenizer,
        train_dataset = dataset,
        dataset_text_field = "text",
        max_seq_length = max_seq_length,
        dataset_num_proc = 2,
        packing = False, # Can make training faster for large datasets
        args = TrainingArguments(
            per_device_train_batch_size = 2,
            gradient_accumulation_steps = 4,
            warmup_steps = 5,
            max_steps = 60, # Set to 60 for a quick workshop demo!
            learning_rate = 2e-4,
            fp16 = not torch.cuda.is_bf16_supported(),
            bf16 = torch.cuda.is_bf16_supported(),
            logging_steps = 1,
            optim = "adamw_8bit",
            weight_decay = 0.01,
            lr_scheduler_type = "linear",
            seed = 3407,
            output_dir = "outputs",
            # Disable wandb logging
            report_to = "none",
        ),
    )

    print("Starting training...")
    trainer_stats = trainer.train()
    print("Training complete!")

In [None]:
# Execute the training loop
lora_loop()

##7: Inference (Testing the Philosopher)
Now we test the model. We need to use `FastLanguageModel.for_inference` to enable the optimised inference speeds.



In [None]:
# Enable native 2x faster inference
FastLanguageModel.for_inference(model)


# A new question, not in the training data
prompt = "I've lost my car keys. Again"

messages = [
    {"role": "user", "content": prompt},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add this to signal the model to generate
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(
    input_ids = inputs,
    max_new_tokens = 128,
    use_cache = True,
    temperature = 0.7 # Add a little creativity
)

# Decode the response
response = tokenizer.batch_decode(outputs)
print(response[0].split("<start_of_turn>model")[-1]) # Clean up output

In [None]:
# If we get an `attention mask` warning it is irritating. The warning occurs because the model's
# configuration has the Pad Token set to the same ID as the EOS (End of Sentence) Token.
# When we pass raw input IDs without an explicit mask, the model gets confused about
# where the actual sentence ends and where "padding" begins.

# Fix: manually create an attention mask of all 1s and pass it to `model.generate`.

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# A new question, not in the training data
prompt = "I lost my car keys and I am very angry."

messages = [
    {"role": "user", "content": prompt},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

# FIX: Create an attention mask of 1s (since all tokens are valid)
attention_mask = torch.ones_like(inputs)

outputs = model.generate(
    input_ids = inputs,
    attention_mask = attention_mask, # <--- Pass the mask here
    max_new_tokens = 128,
    use_cache = True,
    temperature = 0.7,
    pad_token_id = tokenizer.eos_token_id # Optional: Explicitly set pad_token_id to silence other potential warnings
)

# Decode the response
response = tokenizer.batch_decode(outputs)
print(response[0].split("<start_of_turn>model")[-1])

If we like, we can do a 'before and after' comparison of this model with and without the adapters.  We would probably want to do this in a workshop to confirm to participants that we have actually done something!

In [None]:
from unsloth import FastLanguageModel
from transformers import TextStreamer

# 1. Optimise the model for inference (Run this once)
FastLanguageModel.for_inference(model)

# Define your prompt
prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
I lost my car keys and I am very angry

### Response:
"""

# Tokenize the input
inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

# Setup streamer for live output
streamer = TextStreamer(tokenizer)

print("=========================================")
print("ORIGINAL MODEL")
print("=========================================")

# 2. Use the context manager to DISABLE LoRA temporarily
# This tells the model to ignore the fine-tuned weights and use the base weights
with model.disable_adapter():
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=128)

print("\n\n=========================================")
print("FINE-TUNED MODEL (LoRA)")
print("=========================================")

# 3. Run normally (LoRA adapters are active by default)
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=128)

##8: Saving the Model
In a real workflow, we would save the LoRA adapters to merge them later or load them for inference.

In [None]:
# Save the LoRA adapters locally
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

print("Adapters saved to 'lora_model' directory.")

# Optional: Push to Hugging Face Hub
# model.push_to_hub("your_hf_username/gemma-2-2b-stoic-lora")

I've pushed a copy of [my model](https://huggingface.co/callaghanmt/gemma-2-2b-stoic-lora) to my Huggingface repo. You can use this for the last part of the workshop (evaluation) if you don't want to use your own model.

## 9. Examining the energy use of this fine-tuning run
The best tool to demonstrate this is in a workshop is [CodeCarbon](https://codecarbon.io/).   
It wraps around the LoRA training loop and generates a report.

In [None]:
!pip install --quiet codecarbon
# Errors here with some dependency conflicts don't prevent codecarbon from working

In [None]:
# Where is this notebook running?
!curl -s ipinfo.io

In [None]:
from codecarbon import EmissionsTracker
import time

# Initialise the tracker
# output_dir="." saves the emissions.csv to your file browser panel
tracker = EmissionsTracker(project_name="LoRA_FineTuning_T4", output_dir=".")

tracker.start()

try:
    # --- THE LORA TRAINING CODE GOES HERE ---
    lora_loop()
except Exception as e:
    print(f"An error occurred: {e}")

finally:
    # This stops the tracker and saves the data even if code errors out
    emissions = tracker.stop()
    print(f"Training complete. Emissions: {emissions} kg CO2eq")