# 🪄 Week 09-10 · Notebook 09 · LoRA & QLoRA for Cost-Efficient Fine-tuning

Apply low-rank adapters and 4-bit quantization to tailor models for remote plants running on modest GPUs.

## 🎯 Learning Objectives
- **Understand LoRA and QLoRA:** Grasp the mathematics behind LoRA and how QLoRA extends it with 4-bit quantization using the `bitsandbytes` library.
- **Configure and Train:** Implement a QLoRA fine-tuning workflow, including setting up the `BitsAndBytesConfig` and `LoraConfig`.
- **Evaluate Trade-offs:** Measure and compare the latency, memory footprint, and output quality of a model fine-tuned with QLoRA versus standard full-precision training.
- **Implement Safety Gates:** Develop a "safety gate" function to programmatically verify that the quantized model still produces critical, non-negotiable information (like SOP steps) correctly.

## 🧩 Scenario
A supplier wants an on-prem assistant running on a single NVIDIA T4. LoRA + QLoRA provides maintainable adapters without full fine-tuning cost.

In [None]:
from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import pandas as pd
import torch
import numpy as np

torch.manual_seed(29)

## 📄 Synthetic Shift Reports
Short instructions and responses representing maintenance troubleshooting.

In [None]:
shift_reports_data = [
    {"instruction": "Summarize the issue with the CNC machine.", "response": "The CNC machine (ID: CNC-04) is showing a 'spindle drive fault' error. It requires immediate inspection by a certified technician."},
    {"instruction": "What is the standard procedure for a hydraulic leak?", "response": "1. Isolate the machine using the main power cutoff. 2. Place absorbent pads around the leak. 3. Notify the shift supervisor immediately. 4. Do not operate until cleared by maintenance."},
    {"instruction": "Generate a shift handoff note for the welding robot.", "response": "Welding robot WR-07 completed its cycle with no errors. Consumables (wire, gas) are at 75%. No scheduled maintenance is due."},
    {"instruction": "What action is needed for a high-temperature alert on Furnace-02?", "response": "A high-temperature alert requires reducing the setpoint by 10% and monitoring the cooling system. If the temperature does not decrease within 15 minutes, initiate a controlled shutdown."},
    {"instruction": "Translate the following safety warning to Hindi: 'Hard hat required in this area'.", "response": "इस क्षेत्र में हार्ड हैट पहनना आवश्यक है।"}
]
shift_reports = Dataset.from_list(shift_reports_data)
shift_reports

## 🧾 Tokenizer & Preprocess
We simulate instruction tuning with prompt → response pairs.

In [None]:
# For this demo, we'll use a smaller, more accessible model.
# In a real-world scenario, you might use a larger model like Llama-2-7b.
# NOTE: Using larger models requires significant GPU memory and access agreements.
base_model_name = 'distilgpt2' # A smaller model for demonstration purposes
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# Set a padding token if the model doesn't have one.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

In [None]:
def tokenize_function(batch):
    """
    Tokenizes the instruction and response, creating a single sequence for Causal LM training.
    The format will be: `### Instruction: [instruction] ### Response: [response]`
    """
    # Create the full prompt text
    prompts = [f"### Instruction: {instruction}\n### Response: {response}" for instruction, response in zip(batch['instruction'], batch['response'])]
    
    # Tokenize the prompts. We set the labels to be the same as the inputs for Causal LM.
    tokenized_outputs = tokenizer(
        prompts,
        padding='max_length',
        truncation=True,
        max_length=256,
    )
    
    # For Causal LM, the labels are typically the input_ids shifted.
    # A simpler approach for fine-tuning is to just use the input_ids as labels.
    tokenized_outputs["labels"] = tokenized_outputs["input_ids"]
    
    return tokenized_outputs

tokenized_shifts = shift_reports.map(tokenize_function, batched=True)
print(f"Columns in tokenized dataset: {tokenized_shifts.column_names}")
print(f"\nExample of tokenized input:\n{tokenizer.decode(tokenized_shifts[0]['input_ids'])}")

## ⚙️ LoRA Configuration
Target query/key/value projections in attention layers for maximum leverage.

In [None]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["c_attn"],  # Target the attention layers in GPT-2
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# --- Standard LoRA (Full Precision) ---
# This model is not quantized. It's used as a baseline.
# In a real scenario, you would train this model as well to compare against QLoRA.
print("--- Setting up Full-Precision LoRA Model ---")
try:
    full_precision_model = AutoModelForCausalLM.from_pretrained(base_model_name)
    lora_model = get_peft_model(full_precision_model, lora_config)
    lora_model.print_trainable_parameters()
except Exception as e:
    print(f"Could not load full-precision model. Error: {e}")
    lora_model = None

## 🧮 QLoRA Setup
Load base model in 4-bit using `bitsandbytes` to reduce memory footprint.

In [None]:
# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # Use the "Normal Float 4" quantization type
    bnb_4bit_compute_dtype=torch.float16, # Use float16 for computations
    bnb_4bit_use_double_quant=True, # Use a second quantization after the first one
)

# --- QLoRA (Quantized LoRA) ---
# This loads the base model in 4-bit, significantly reducing the memory footprint.
print("\n--- Setting up QLoRA Model (4-bit) ---")
# Note: This requires a compatible GPU and the bitsandbytes library.
try:
    qlora_base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        quantization_config=bnb_config,
        device_map="auto" # Automatically map layers to available devices
    )
    qlora_model = get_peft_model(qlora_base_model, lora_config)
    qlora_model.print_trainable_parameters()
except Exception as e:
    print(f"Could not load QLoRA model. This usually means you don't have a compatible GPU or bitsandbytes is not installed correctly. Error: {e}")
    qlora_model = None

## 🧪 Training Loop (QLoRA)
Adjust epochs, dataset size, and evaluation hooks in production.

In [None]:
training_args = TrainingArguments(
    output_dir="./qlora-shift-reports",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    logging_steps=1,
    fp16=True, # Use mixed precision for training
    report_to="none",
)

# Only proceed if the QLoRA model was loaded successfully
if qlora_model:
    qlora_trainer = Trainer(
        model=qlora_model,
        args=training_args,
        train_dataset=tokenized_shifts,
        tokenizer=tokenizer,
    )

    print("\n--- Starting QLoRA Training ---")
    # qlora_trainer.train() # Uncomment when running with a compatible GPU
    print("--- (Skipping actual training for this demo) ---")
    print("--- QLoRA Training Complete ---")
else:
    print("\n--- Skipping QLoRA training as the model could not be loaded. ---")

## 📉 Safety Gate Checks
Ensure quantization preserved critical steps by verifying the model regenerates mandatory SOP steps.

In [None]:
def run_safety_gate(model, tokenizer, prompt, expected_keywords):
    """
    Checks if the model's output for a given prompt contains all expected keywords.
    This is a simple but effective way to ensure critical information is not lost.
    """
    if not model:
        return "Model not available.", ["N/A"]
        
    # Format the prompt for inference
    full_prompt = f"### Instruction: {prompt}\\n### Response:"
    inputs = tokenizer(full_prompt, return_tensors='pt').to(model.device)
    
    with torch.no_grad():
        output_ids = model.generate(**inputs, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)
        
    # Decode the generated text and isolate the response part
    full_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    
    try:
        response_text = full_text.split("### Response:")[1].strip()
    except IndexError:
        response_text = "Error: Could not parse response."

    # Check for missing keywords
    missing = [kw for kw in expected_keywords if kw.lower() not in response_text.lower()]
    
    return response_text, missing

# --- Run the Gate ---
test_prompt = "What is the standard procedure for a hydraulic leak?"
expected_keywords = ['isolate', 'absorbent pads', 'supervisor']

print("--- Running Safety Gate on QLoRA Model ---")
qlora_response, qlora_missing = run_safety_gate(qlora_model, tokenizer, test_prompt, expected_keywords)
print(f"\\nQLoRA Generated Response:\\n{qlora_response}")
if not qlora_missing:
    print("\\n✅ QLoRA Safety Gate Passed: All keywords present.")
else:
    print(f"\\n❌ QLoRA Safety Gate Failed: Missing keywords: {qlora_missing}")

print("\\n" + "="*30 + "\\n")

print("--- Running Safety Gate on Full-Precision LoRA Model ---")
lora_response, lora_missing = run_safety_gate(lora_model, tokenizer, test_prompt, expected_keywords)
print(f"\\nFull Precision LoRA Generated Response:\\n{lora_response}")
if not lora_missing:
    print("\\n✅ Full Precision LoRA Safety Gate Passed: All keywords present.")
else:
    print(f"\\n❌ Full Precision LoRA Safety Gate Failed: Missing keywords: {lora_missing}")

## ⏱️ Latency & Memory Snapshot
Collect quick comparisons for stakeholder update.

In [None]:
def compare_model_metrics():
    """
    Provides a simulated comparison of memory and latency for different fine-tuning methods.
    """
    # These are illustrative values. Actuals depend heavily on hardware and model size.
    base_model_size_gb = 14.0 # For a 7B model in float16
    
    data = [
        {
            "Method": "Full Fine-Tuning",
            "Quantization": "None (FP16)",
            "GPU Memory (Training)": f"~{base_model_size_gb * 2:.1f} GB", # Model + Gradients
            "Trainable Params": "7B",
            "Notes": "Requires significant hardware."
        },
        {
            "Method": "LoRA",
            "Quantization": "None (FP16)",
            "GPU Memory (Training)": f"~{base_model_size_gb:.1f} GB", # No full-size gradients
            "Trainable Params": "~4M (0.06%)",
            "Notes": "Faster training, much smaller checkpoint."
        },
        {
            "Method": "QLoRA",
            "Quantization": "4-bit (NF4)",
            "GPU Memory (Training)": f"~{base_model_size_gb / 2:.1f} GB", # Base model is quantized
            "Trainable Params": "~4M (0.06%)",
            "Notes": "Enables training on consumer GPUs."
        }
    ]
    
    return pd.DataFrame(data)

print("--- Fine-Tuning Method Comparison (Illustrative) ---")
comparison_df = compare_model_metrics()
display(comparison_df)

### 🛡️ Governance Checklist
- Validate licensing (LLaMA/EULA) with legal before deployment.
- Document quantization settings in model registry.
- Capture safety gate results and attach to release ticket.
- Schedule drift review every 30 days.

## 🧪 Lab Assignment
1. Run QLoRA training on your maintenance dataset (Zephyr or Mistral 7B).
2. Profile latency on both T4 and A10 GPUs.
3. Extend safety gate to include bilingual keywords and numeric tolerances.
4. Produce a comparison memo for IT showcasing cost savings.

## ✅ Checklist
- [ ] LoRA targets selected and documented
- [ ] QLoRA quantization tested
- [ ] Safety gates passed
- [ ] Metrics shared with stakeholders

## 📚 References
- Dettmers et al., *QLoRA: Efficient Finetuning of Quantized LLMs* (2023)
- HuggingFace Blog: *Low-Rank Adapters in Production*
- Week 07 Decision Matrix Notebook