# LoRA / SFT Example (using Hugging Face dataset)

This notebook demonstrates LoRA (Low-Rank Adaptation) / SFT (Supervised Fine-Tuning).

### 0) Recommended Colab Settings

- Runtime → Change runtime type → GPU (T4)

- (Optional) Create an Access Token in Hugging Face first, which may be needed when loading models later.


### 1) Install required libraries

In [1]:
!pip -q install -U transformers datasets accelerate peft trl bitsandbytes

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/10.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m5.8/10.3 MB[0m [31m175.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m10.3/10.3 MB[0m [31m201.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.3/10.3 MB[0m [31m110.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/515.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.2/515.2 kB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.5/540.5 kB[0m [31m47.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━

### 2) Import torch and check the enviroment

In [2]:
import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")


CUDA available: True
GPU: Tesla T4


### 3) Select a base model and download dataset

**Configuration**

-   **Model:** TinyLlama 1.1B (more suitable for free Colab GPUs)
-   **Dataset:** Alpaca instruction dataset

In [3]:
from datasets import load_dataset

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
dataset_id = "tatsu-lab/alpaca"

# 1. Load the dataset
# By explicitly adding split="train", ds becomes a single Dataset object instead of a Dictionary
try:
    ds = load_dataset(dataset_id, split="train")
    print(f"Dataset loaded successfully! Total rows: {len(ds)}")
except Exception as e:
    print(f"Error loading dataset: {e}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]



data/train-00000-of-00001-a09b74b3ef9c3b(…):   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

Dataset loaded successfully! Total rows: 52002


### 4) Turn Alpaca into "Chat / Command" text format

TinyLlama is a **chat** model, and therefore we can create a simple prompt pattern so it can learn to follow instruction

In [4]:
# 2. Define the formatting function
def format_alpaca(example):
    # Use .get() to prevent KeyErrors
    instruction = str(example.get("instruction", "")).strip()
    inp = str(example.get("input", "")).strip()
    output = str(example.get("output", "")).strip()

    if inp:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{inp}\n\n### Response:\n"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"

    # RETURN a new dictionary instead of modifying the example
    return {"text": prompt + output}

In [5]:
# 3. Process the dataset
# Since we already selected the 'train' split above, we use 'ds' directly
train_ds = ds.map(format_alpaca, remove_columns=ds.column_names)

# 4. Shuffle and Select
# It's safer to use min(8000, len(train_ds)) to avoid Index Errors
train_ds = train_ds.shuffle(seed=42).select(range(min(8000, len(train_ds))))

# 5. Verify the output
if len(train_ds) > 0:
    print("Success! Sample of the first processed row:")
    print(train_ds[0]["text"][:400])
else:
    print("The dataset is still empty. Check your internet connection or dataset name.")

Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

Success! Sample of the first processed row:
### Instruction:
What would be the best type of exercise for a person who has arthritis?

### Response:
For someone with arthritis, the best type of exercise would be low-impact activities like yoga, swimming, or walking. These exercises provide the benefits of exercise without exacerbating the symptoms of arthritis.


### 5) Import model（4-bit to save the memory）+ Tokenizer

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
# Some models require pad_token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

### 6) Set LoRA（only train few of the parameters）

In [7]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj"]  # commoon transformer attention
)


### 7) Start training（TRL's SFTTrainer）

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments

# 1. Standard Training Arguments
training_args = TrainingArguments(
    output_dir="tinyllama-alpaca-lora",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=200,
    bf16=True,             # Fixed from your earlier NotImplementedError
    fp16=False,
    optim="paged_adamw_8bit",
    report_to="none",
    remove_unused_columns=False # Important: Prevents the trainer from dropping your 'text' column
)

In [12]:
import trl
print(trl.__version__)

0.28.0


In [15]:
from trl import SFTTrainer, SFTConfig

# 1. Simplify the Config (Remove the problematic arguments)
sft_config = SFTConfig(
    output_dir="tinyllama-alpaca-lora",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    num_train_epochs=1,
    logging_steps=10,
    bf16=True,
    optim="paged_adamw_8bit",
    report_to="none",
    # Note: We removed dataset_text_field and max_seq_length from here
)

# 2. Simplify the Trainer
# If your train_ds has a column named "text", it will work automatically!
trainer = SFTTrainer(
    model=model,
    train_dataset=train_ds,
    args=sft_config,
    peft_config=lora_config,
    # Note: We removed dataset_text_field and max_seq_length from here too
)

# 3. Start Training
trainer.train()

Adding EOS to train dataset:   0%|          | 0/8000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/8000 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/8000 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Step,Training Loss
10,1.651364
20,1.362257
30,1.329524
40,1.281828
50,1.318067
60,1.240184
70,1.244964
80,1.313555
90,1.254715
100,1.250637


TrainOutput(global_step=500, training_loss=1.2603981037139893, metrics={'train_runtime': 4246.7105, 'train_samples_per_second': 1.884, 'train_steps_per_second': 0.118, 'total_flos': 6606059717984256.0, 'train_loss': 1.2603981037139893})

### 8) Save LoRA adapter（the most common way of deployment）

**A. Save to the Local Folder (Temporary)**

If you just want to save the result of the current run to the Colab disk:

In [None]:
trainer.model.save_pretrained("tinyllama-alpaca-lora-adapter")
tokenizer.save_pretrained("tinyllama-alpaca-lora-adapter")
print("Saved to tinyllama-alpaca-lora-adapter/")

**B. Save to Google Drive (Permanent & Recommended)**

This is the "best practice" for Colab. It ensures that even if your runtime disconnects, your weights are safely stored in your personal Google Drive.

In [1]:
# Step 1: Mount your Drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [17]:
# Step 2: Save directly to a folder in your Drive

import os
save_path = "/content/drive/MyDrive/tinyllama_finedtuned"

# Create the folder if it doesn't exist
os.makedirs(save_path, exist_ok=True)

# Save the adapter
trainer.model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model saved to {save_path}")

Model saved to /content/drive/MyDrive/tinyllama_finedtuned


Once saved, you can come back a week later and load your "smart" version of the model in a fresh notebook without running the trainer again.

In [1]:
# 3. How to "Reload" without retraining

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# 1. Load the same BASE model you used for training
base_model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype=torch.float16, device_map="auto")

# 2. Load your SAVED ADAPTERS from Google Drive
adapter_path = "/content/drive/MyDrive/tinyllama_finedtuned"
model = PeftModel.from_pretrained(model, adapter_path)

# 3. Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(adapter_path)

print("Trained model reloaded successfully!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!


Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

Trained model reloaded successfully!


### 9) Last step: Test using the fine-tune adapter）

In [25]:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# 1. Load the model (ensure tokenizer is also loaded)
# model_id should be the same one you used in training
base = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config, # Keep this if you used 4-bit
    device_map="auto"
)
ft_model = PeftModel.from_pretrained(base, adapter_path)
ft_model.eval()

def generate(instruction, input_text="", max_new_tokens=200):
    # This matches the Alpaca format exactly
    if input_text:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"

    inputs = tokenizer(prompt, return_tensors="pt").to(ft_model.device)

    with torch.no_grad():
        out = ft_model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.4,       # Lowered for more practical, factual tips
            top_p=0.9,
            repetition_penalty=1.1, # Slightly increased to prevent looping
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )

    # We decode only the NEW tokens generated, not the prompt
    decoded = tokenizer.decode(out[0], skip_special_tokens=True)
    return decoded.split("### Response:")[1].strip() # Extract only the answer

# Test it
instruction = "Give me 3 practical tips to reduce stockouts in a retail supply chain."
print(f"Results:\n{generate(instruction)}")

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

Results:
1. Develop a comprehensive inventory management strategy that includes forecasting, order management, and replenishment. 
2. Implement automated systems for tracking inventory levels and alerting employees when stocks are low. 
3. Use technology such as barcode scanning, RFID tags, and mobile apps to improve visibility into inventory levels.


In [None]:
# Merge the fine-tune results to the base model and merge the results (not really necessary)

# merged = ft_model.merge_and_unload()
# merged.save_pretrained("tinyllama-alpaca-merged")
# tokenizer.save_pretrained("tinyllama-alpaca-merged")
# print("Merged model saved to tinyllama-alpaca-merged/")

## Evaluation pipeline

This will help compare:

✅ Base model vs Fine-tuned (LoRA) model

✅ Deterministic outputs

✅ Length statistics

✅ JSON format compliance

✅ CSV export for human evaluation

### 🔎 0) Important: Make Generation Deterministic

For fair comparison, DO NOT use sampling.

Set:

- do_sample=False

- num_beams=1

- fixed max_new_tokens

- This ensures Base and FT are comparable.

### 🧪 1) Prepare Evaluation Prompts

In [2]:
eval_prompts = [
"""### Instruction:
Give me 3 practical tips to reduce stockouts in a retail supply chain.

### Response:
""",

"""### Instruction:
Explain safety stock in simple terms and give one numeric example.

### Response:
""",

"""### Instruction:
Return a JSON with keys: root_cause, evidence, action. Topic: late deliveries in manufacturing.

### Response:
""",

"""### Instruction:
Write a short email (max 120 words) to a supplier asking for an updated ETA.

### Response:
""",

"""### Instruction:
Summarize the following in 3 bullet points:
"Demand variability increased, lead time uncertain, inventory policy outdated."

### Response:
"""
]


### ⚙️ 2) Load Base Model and Fine-Tuned Model

In [15]:
# ✅ Fix (Colab – GPU T4)
# Run this first, before loading the model:

# !pip uninstall -y bitsandbytes
# !pip install -U bitsandbytes>=0.46.1
# !pip install -U transformers accelerate peft trl

In [3]:
import torch
import pandas as pd
import json
import re
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# Read the saved parameter from the fine-tune training earlier
adapter_path = "/content/drive/MyDrive/tinyllama_finedtuned"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
).eval()

# Load fine-tuned model (base + LoRA adapter)
ft_model = PeftModel.from_pretrained(
    base_model,
    adapter_path
).eval()

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

### 🔁 3) Deterministic Generation Function

In [4]:
def generate_det(model, prompt, max_new_tokens=200):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,        # IMPORTANT
            num_beams=1,
            repetition_penalty=1.05,
            pad_token_id=tokenizer.eos_token_id
        )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def extract_response(full_text):
    parts = re.split(r"### Response:\s*", full_text, maxsplit=1)
    return parts[-1].strip() if len(parts) > 1 else full_text.strip()

### 📊 4) Run Evaluation

In [11]:
def json_parseable(text):
    match = re.search(r"\{.*\}", text, flags=re.S)
    if not match:
        return False
    try:
        json.loads(match.group(0))
        return True
    except:
        return False


rows = []

for i, prompt in enumerate(eval_prompts):

    # BASE = adapter OFF
    with ft_model.disable_adapter():
        base_full = generate_det(ft_model, prompt, 220)

    # FT = adapter ON
    ft_full = generate_det(ft_model, prompt, 220)

    base_out = extract_response(base_full)
    ft_out   = extract_response(ft_full)

    need_json = "JSON" in prompt

    rows.append({
        "id": i,
        "prompt": prompt,
        "base_output": base_out,
        "ft_output": ft_out,
        "base_length": len(base_out),
        "ft_length": len(ft_out),
        "base_json_ok": json_parseable(base_out) if need_json else None,
        "ft_json_ok": json_parseable(ft_out) if need_json else None
    })

df = pd.DataFrame(rows)
df

Unnamed: 0,id,prompt,base_output,ft_output,base_length,ft_length,base_json_ok,ft_json_ok
0,0,### Instruction:\nGive me 3 practical tips to ...,1. Implement real-time inventory tracking: Ret...,1. Implement a real-time inventory tracking sy...,946,299,,
1,1,### Instruction:\nExplain safety stock in simp...,Safety stock refers to the amount of inventory...,Safety stock is the amount of inventory that a...,848,346,,
2,2,### Instruction:\nReturn a JSON with keys: roo...,"```json\n{\n ""root_cause"": [\n {\n ""c...","{\n ""root_cause"": [\n ""Late delivery of ra...",517,440,False,True
3,3,### Instruction:\nWrite a short email (max 120...,"Dear [Supplier’s Name],\n\nI am writing to req...","Dear [Supplier],\n\nI hope this email finds yo...",953,462,,
4,4,### Instruction:\nSummarize the following in 3...,1. Demand variability: Increased demand for ou...,"1. Demand variability increased, leading to in...",661,339,,


### 📈 5) Summary Metrics

In [12]:
def summarize(df):
    print("Total prompts:", len(df))
    print("Average length (base):", df["base_length"].mean())
    print("Average length (ft)  :", df["ft_length"].mean())

    json_rows = df[df["base_json_ok"].notna()]
    if len(json_rows) > 0:
        print("JSON pass rate (base):", json_rows["base_json_ok"].mean())
        print("JSON pass rate (ft)  :", json_rows["ft_json_ok"].mean())

summarize(df)


Total prompts: 5
Average length (base): 785.0
Average length (ft)  : 377.2
JSON pass rate (base): 0.0
JSON pass rate (ft)  : 1.0


### 📁 6) Export for Human Evaluation

In [13]:
df.to_csv("evaluation_base_vs_ft.csv", index=False)
print("Saved evaluation_base_vs_ft.csv")

Saved evaluation_base_vs_ft.csv


Additional Test between base model and base model + LoRA

In [14]:
def run_one(prompt):
    # Base behavior (adapter OFF)
    with ft_model.disable_adapter():
        base_full = generate_det(ft_model, prompt, 220)

    # FT behavior (adapter ON)
    ft_full = generate_det(ft_model, prompt, 220)

    base_out = extract_response(base_full)
    ft_out   = extract_response(ft_full)

    print("=== BASE (adapter OFF) ===")
    print(base_out[:400])
    print("\n=== FT (adapter ON) ===")
    print(ft_out[:400])
    print("\nSame text?", base_out == ft_out)
    print("Len base / ft:", len(base_out), len(ft_out))

prompt = eval_prompts[0]
run_one(prompt)


=== BASE (adapter OFF) ===
1. Implement real-time inventory tracking: Retailers can use real-time inventory tracking tools to monitor the availability of products in real-time. This helps them to identify and address any stockouts quickly.

2. Use automation to optimize inventory levels: Retailers can use automation tools to optimize inventory levels by predicting demand and adjusting inventory levels accordingly. This help

=== FT (adapter ON) ===
1. Implement a real-time inventory tracking system to monitor stock levels and alert staff when stock levels are low.
2. Use automated order fulfillment systems to ensure that orders are processed quickly and efficiently.
3. Develop a system for managing returns and exchanges to minimize stockouts.

Same text? False
Len base / ft: 946 299
