<a href="https://colab.research.google.com/github/callaghanmt-training/ou-fine-tuning-2025-11/blob/main/fine_tuning_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 3: Domain Specific Evaluation

This notebook implements a **Domain-Specific Benchmark**. Since general benchmarks (like maths or coding tests) won't measure how "Stoic" a model is, we will build a custom "Stoic Score" metric.

This script compares the Base Model (Gemma 2) against the Fine-Tuned Model (Stoic Gemma) on a set of life scenarios, scoring them based on key philosophical terminology.

**Note:** This is a 'taster' appropach for the purposes of this workshop.  In production, we would need a number of more rigorous evaluations.  Lots more in the literature, including this [recent paper](https://arxiv.org/pdf/2506.12958).

##Before you start:
1. Make sure you have connected to a T4 runtime and clicked the **[Connect]** button
2. Add your Huggingface Access Token to the notebook 'secrets'

**The "Original" Model**: We use this with `model.disable_adapter()`:. This allows us to see what Gemma 2 would have said without our training. It usually gives generic, empathetic advice (e.g., "I'm sorry to hear that, try updating your resume").

**The "Fine-Tuned" Model**: This runs with the LoRA adapters active. It should shift the tone to use words like "control," "virtue," and "mind."

**The Metric**: We create a `calculate_stoic_score` function. In professional settings, this is often done using "LLM-as-a-Judge" (asking GPT-5 (or similar) to grade the response), but for this workshop, a keyword heuristic is fast, free, and effectively demonstrates the concept of alignment evaluation.

In [None]:
# ==========================================
# PART 1: SETUP & INSTALLATION
# ==========================================
# Install Unsloth for fast inference and memory management.
# Note: If you are running this in a session where Unsloth is already installed,
# you can comment out the pip install line.
try:
    import unsloth
except ImportError:
    !pip install --quiet "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
    !pip install --quiet --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes


In [None]:
from unsloth import FastLanguageModel
from transformers import TextStreamer
import pandas as pd
import torch
from tqdm import tqdm

In [None]:
# ==========================================
# PART 2: CONFIGURATION
# ==========================================

# ------------------------------------------------------------------------
# CHANGE THIS LINE TO USE YOUR OWN MODEL
# If you haven't uploaded one, use the workshop default below:
# ------------------------------------------------------------------------
MODEL_NAME = "callaghanmt/gemma-2-2b-stoic-lora"

max_seq_length = 2048
dtype = None # None for auto detection
load_in_4bit = True # Use 4bit quantisation to fit in T4 GPU

print(f"‚è≥ Loading model: {MODEL_NAME}...")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL_NAME,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Optimise for inference
FastLanguageModel.for_inference(model)
print("‚úÖ Model loaded and optimised.")

In [None]:
# ==========================================
# PART 3: DEFINING THE EVALUATION DATASET
# ==========================================
# We need a set of "prompts" that trigger philosophical advice.
# These are the "Test Set".

eval_prompts = [
    "I lost my job today and I feel like a failure.",
    "Someone insulted me on social media and I am furious.",
    "I am worried about the future and things I cannot predict.",
    "My car broke down on the way to an important meeting.",
    "I want to be rich and famous, but I am not succeeding."
]

In [None]:
# ==========================================
# PART 4: DEFINING THE METRIC (The "Stoic Score")
# ==========================================
# In a formal evaluation, we need a metric.
# Since we don't have a 'Ground Truth' text, we will use a Heuristic Metric.
# We check for the presence of specific Stoic concepts.

STOIC_KEYWORDS = [
    "control", "virtue", "reason", "nature", "accept",
    "indifferent", "mind", "reaction", "choice", "logos",
    "opinion", "external", "power", "character"
]

def calculate_stoic_score(text):
    """
    Calculates a simple heuristic score based on keyword density.
    Returns a score (0-100) representing 'Stoic-ness'.
    """
    text = text.lower()
    hits = 0
    for word in STOIC_KEYWORDS:
        if word in text:
            hits += 1

    # Normalize: If we find 4+ keywords, we consider it a "High" score.
    score = min((hits / 4) * 100, 100)
    return score, hits


In [None]:
# ==========================================
# PART 5: RUNNING THE EVALUATION LOOP
# ==========================================
# We will generate responses twice for each prompt:
# 1. Using the BASE model (Adapters Disabled)
# 2. Using the FINE-TUNED model (Adapters Enabled)

results = []

print("\nüöÄ Starting Evaluation Loop...\n")

# Format prompt for Gemma
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
"""

for question in tqdm(eval_prompts, desc="Evaluating"):

    inputs = tokenizer(
        [alpaca_prompt.format(question)],
        return_tensors = "pt"
    ).to("cuda")

    # --- A. Run Base Model (Original Gemma) ---
    # We disable the LoRA adapter to see how the generic model responds
    with model.disable_adapter():
        out_base = model.generate(**inputs, max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)
        decoded_base = tokenizer.batch_decode(out_base)[0].split("### Response:\n")[-1].replace("<eos>", "").strip()

    # --- B. Run Fine-Tuned Model (Stoic Gemma) ---
    # Adapters are active by default
    out_ft = model.generate(**inputs, max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)
    decoded_ft = tokenizer.batch_decode(out_ft)[0].split("### Response:\n")[-1].replace("<eos>", "").strip()

    # --- C. Score Both ---
    score_base, hits_base = calculate_stoic_score(decoded_base)
    score_ft, hits_ft = calculate_stoic_score(decoded_ft)

    results.append({
        "Scenario": question,
        "Base_Response_Snippet": decoded_base[:100] + "...",
        "FT_Response_Snippet": decoded_ft[:100] + "...",
        "Base_Score": score_base,
        "FT_Score": score_ft,
        "Improvement": score_ft - score_base
    })


print("\n \nüöÄ Finished Evaluation Loop...\n")


In [None]:
# ==========================================
# PART 6: RESULTS & ANALYSIS
# ==========================================

df = pd.DataFrame(results)

print("\n\n=========================================")
print(f"üìä EVALUATION REPORT FOR: {MODEL_NAME}")
print("=========================================")
print(f"Average Base Model Score:       {df['Base_Score'].mean():.2f}")
print(f"Average Fine-Tuned Model Score: {df['FT_Score'].mean():.2f}")
print("=========================================\n")

# Display the detailed dataframe
pd.set_option('display.max_colwidth', 50)
display(df[["Scenario", "Base_Score", "FT_Score", "Improvement"]])

print("\n\nüîç DETAILED COMPARISON (First Example):")
print(f"Q: {df.iloc[0]['Scenario']}")
print(f"\n--- BASE MODEL ({df.iloc[0]['Base_Score']}) ---")
print(results[0]['Base_Response_Snippet'])
print(f"\n--- STOIC MODEL ({df.iloc[0]['FT_Score']}) ---")
print(results[0]['FT_Response_Snippet'])