# 🪄 Week 09-10 · Notebook 09 · LoRA & QLoRA for Cost-Efficient Fine-tuning

Apply low-rank adapters and 4-bit quantization to tailor models for remote plants running on modest GPUs.

## 🎯 Learning Objectives
- Understand LoRA math and target module selection.
- Configure QLoRA with `bitsandbytes` for 4-bit training.
- Evaluate latency, memory, and accuracy trade-offs on maintenance logs.
- Implement safety gates to ensure SOP steps survive quantization.

## 🧩 Scenario
A supplier wants an on-prem assistant running on a single NVIDIA T4. LoRA + QLoRA provides maintainable adapters without full fine-tuning cost.

In [None]:
from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import pandas as pd
import torch
import numpy as np

torch.manual_seed(29)

## 📄 Synthetic Shift Reports
Short instructions and responses representing maintenance troubleshooting.

In [None]:
shift_reports = Dataset.from_list([
shift_reports

## 🧾 Tokenizer & Preprocess
We simulate instruction tuning with prompt → response pairs.

In [None]:
base_model_name = 'meta-llama/Llama-2-7b-hf'  # placeholder; requires license
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token = tokenizer.eos_token

In [None]:
def tokenize(batch):
    prompts = [f
    inputs = tokenizer(prompts, padding='max_length', truncation=True, max_length=256)
    labels = tokenizer(batch['response'], padding='max_length', truncation=True, max_length=128)
    inputs['labels'] = labels['input_ids']
    return inputs

tokenized_shifts = shift_reports.map(tokenize, batched=True)

## ⚙️ LoRA Configuration
Target query/key/value projections in attention layers for maximum leverage.

In [None]:
lora_config = LoraConfig(
full_precision_model = AutoModelForCausalLM.from_pretrained(base_model_name, load_in_4bit=False)
lora_model = get_peft_model(full_precision_model, lora_config)
lora_model.print_trainable_parameters()

## 🧮 QLoRA Setup
Load base model in 4-bit using `bitsandbytes` to reduce memory footprint.

In [None]:
bnb_config = BitsAndBytesConfig(
qlora_base = AutoModelForCausalLM.from_pretrained(base_model_name, quantization_config=bnb_config, device_map='auto')
qlora_model = get_peft_model(qlora_base, lora_config)
qlora_model.print_trainable_parameters()

## 🧪 Training Loop (QLoRA)
Adjust epochs, dataset size, and evaluation hooks in production.

In [None]:
training_args = TrainingArguments(
qlora_trainer = Trainer(
# qlora_trainer.train()  # Uncomment when running with GPU

## 📉 Safety Gate Checks
Ensure quantization preserved critical steps by verifying the model regenerates mandatory SOP steps.

In [None]:
def safety_gate(model, prompt, expected_keywords):
    inputs = tokenizer(prompt, return_tensors='pt')
    with torch.no_grad():
        output_ids = model.generate(**inputs, max_new_tokens=96)
    text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    missing = [kw for kw in expected_keywords if kw.lower() not in text.lower()]
    return text, missing

test_prompt = 'Shift summary: Press 12 error code E42. Provide response checklist.'
expected = ['Isolate hydraulic pump', 'Reset PLC']
generated, missing_keywords = safety_gate(qlora_model, test_prompt, expected)
generated, missing_keywords

## ⏱️ Latency & Memory Snapshot
Collect quick comparisons for stakeholder update.

In [None]:
def compare_metrics():
    return pd.DataFrame([

compare_metrics()

### 🛡️ Governance Checklist
- Validate licensing (LLaMA/EULA) with legal before deployment.
- Document quantization settings in model registry.
- Capture safety gate results and attach to release ticket.
- Schedule drift review every 30 days.

## 🧪 Lab Assignment
1. Run QLoRA training on your maintenance dataset (Zephyr or Mistral 7B).
2. Profile latency on both T4 and A10 GPUs.
3. Extend safety gate to include bilingual keywords and numeric tolerances.
4. Produce a comparison memo for IT showcasing cost savings.

## ✅ Checklist
- [ ] LoRA targets selected and documented
- [ ] QLoRA quantization tested
- [ ] Safety gates passed
- [ ] Metrics shared with stakeholders

## 📚 References
- Dettmers et al., *QLoRA: Efficient Finetuning of Quantized LLMs* (2023)
- HuggingFace Blog: *Low-Rank Adapters in Production*
- Week 07 Decision Matrix Notebook