# 🧩 Week 09-10 · Notebook 08 · PEFT Foundations with PEFT Library

Introduce adapter-based fine-tuning techniques to upgrade multilingual maintenance assistants under strict compute budgets.

## 🎯 Learning Objectives
- **Understand PEFT:** Explain the trade-offs between different Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA, prefix-tuning, and prompt tuning.
- **Implement LoRA:** Configure and apply LoRA (Low-Rank Adaptation) using the Hugging Face `peft` library to adapt a model for bilingual manufacturing FAQs.
- **Measure Performance:** Quantify the impact on latency and memory usage when using adapters compared to full fine-tuning.
- **Align with Governance:** Ensure that model updates using PEFT align with the plant's IT change management and maintenance freeze policies.

## 🧩 Scenario
A multilingual plant uses English, Hindi, and Spanish. Leadership wants improved Hindi accuracy without doubling GPU spend.

In [None]:
import time
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer
from peft import LoraConfig, TaskType, get_peft_model
import torch
import numpy as np

torch.manual_seed(101)

## 📚 Synthetic Bilingual FAQ
Short Q/A pairs about maintenance tasks. Replace with your plant's bilingual corpus.

In [None]:
qa_pairs_data = [
    {"question": "How do I check the pressure on the hydraulic press?", "answer": "Check the main gauge on panel B. It should be between 1800 and 2000 psi."},
    {"question": "हाइड्रोलिक प्रेस पर दबाव की जांच कैसे करें?", "answer": "पैनल बी पर मुख्य गेज की जांच करें। यह 1800 और 2000 psi के बीच होना चाहिए।"},
    {"question": "What is the torque spec for the main bolts on the CNC machine?", "answer": "The torque specification is 120 Nm. Use a calibrated torque wrench."},
    {"question": "सीएनसी मशीन पर मुख्य बोल्ट के लिए टॉर्क स्पेक क्या है?", "answer": "टॉर्क स्पेसिफिकेशन 120 Nm है। कैलिब्रेटेड टॉर्क रिंच का उपयोग करें।"},
    {"question": "Emergency stop procedure for the conveyor belt?", "answer": "Press the large red button located at the start and end of the line. Do not restart without supervisor approval."},
    {"question": "कन्वेयर बेल्ट के लिए आपातकालीन स्टॉप प्रक्रिया?", "answer": "लाइन की शुरुआत और अंत में स्थित बड़े लाल बटन को दबाएं। पर्यवेक्षक की मंजूरी के बिना पुनरारंभ न करें।"},
    {"question": "How to clean the laser sensor?", "answer": "Use a microfiber cloth with isopropyl alcohol. Do not apply direct pressure."},
    {"question": "लेजर सेंसर को कैसे साफ करें?", "answer": "आइसोप्रोपिल अल्कोहल के साथ एक माइक्रोफाइबर कपड़े का उपयोग करें। सीधा दबाव न डालें।"}
]
qa_pairs = Dataset.from_list(qa_pairs_data)
qa_pairs

## 🧾 Tokenization & Data Collation
Use a small instruction-tuned base model (stub).

In [None]:
# We use a small, instruction-tuned model as our base.
# In a real scenario, this might be a larger model like Llama or Mistral.
model_name = 'google/flan-t5-small'
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
def preprocess_function(batch):
    """Prepares the data for training by tokenizing questions and answers."""
    # We format the input as a task-specific prompt
    inputs = [f"Question: {q}" for q in batch['question']]
    
    # Tokenize the inputs (questions) and targets (answers)
    model_inputs = tokenizer(inputs, text_target=batch['answer'], max_length=128, truncation=True, padding="max_length")
    return model_inputs

tokenized_ds = qa_pairs.map(preprocess_function, batched=True)
print(f"Tokenized dataset columns: {tokenized_ds.column_names}")
print(f"\nExample input_ids: {tokenized_ds[0]['input_ids']}")

## ⚙️ Configure LoRA Adapter
Even though this notebook focuses on PEFT overview, we demo LoRA since it balances latency and accuracy.

In [None]:
# Load the base model
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Configure LoRA
lora_config = LoraConfig(
    r=16,  # Rank of the update matrices. Lower rank means fewer parameters to train.
    lora_alpha=32,  # Alpha is a scaling factor.
    target_modules=["q", "v"],  # Apply LoRA to the query and value layers of the attention blocks.
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM  # This is a sequence-to-sequence model.
)

# Create the PEFT model
peft_model = get_peft_model(base_model, lora_config)

# Print the percentage of trainable parameters
peft_model.print_trainable_parameters()

## 🧪 Training Arguments (Demo)
For illustration, run a few steps; in production increase epochs and dataset size.

In [None]:
training_args = TrainingArguments(
    output_dir="./peft-faq-assistant",
    auto_find_batch_size=True, # Automatically find a batch size that fits on the GPU
    learning_rate=3e-4,
    num_train_epochs=10,
    logging_steps=1,
    report_to="none", # Disable reporting to services like W&B for this demo
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_ds,
    tokenizer=tokenizer,
)

# Start the training
# This will only train the adapter layers, not the full model.
print("--- Starting PEFT Training ---")
# trainer.train() # Uncomment to run fine-tuning (requires a GPU and will take time)
print("--- (Skipping actual training for this demo) ---")
print("--- PEFT Training Complete ---")

## ⏱️ Latency & Memory Comparison
Measure inference timing before and after adapters (pseudo-benchmark).

In [None]:
def benchmark_inference(model, tokenizer, prompt, device="cpu"):
    """A simple function to benchmark latency and memory."""
    model.to(device)
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    
    # Warm-up run to load model onto GPU, etc.
    with torch.inference_mode():
        _ = model.generate(**inputs, max_new_tokens=50)
    
    if device == "cuda":
        torch.cuda.synchronize()

    # Measure latency
    latencies = []
    for _ in range(10): # More runs for a stable average
        start_time = time.perf_counter()
        with torch.inference_mode():
            _ = model.generate(**inputs, max_new_tokens=50)
        if device == "cuda":
            torch.cuda.synchronize()
        end_time = time.perf_counter()
        latencies.append((end_time - start_time) * 1000) # Convert to ms
    
    avg_latency_ms = np.mean(latencies)
    
    # Measure peak memory usage
    peak_memory_mb = torch.cuda.max_memory_allocated(device) / 1e6 if device == "cuda" else 0
    torch.cuda.reset_peak_memory_stats(device) if device == "cuda" else None
    
    return avg_latency_ms, peak_memory_mb

# --- Run Benchmarks ---
device = "cuda" if torch.cuda.is_available() else "cpu"
prompt = "Question: What is the torque spec for the main bolts on the CNC machine?"

print(f"--- Benchmarking on {device} ---")

if device == "cpu":
    print("CUDA not available. Skipping benchmark as it requires a GPU for meaningful comparison.")
else:
    # Benchmark Base Model
    base_model_latency, base_model_memory = benchmark_inference(base_model, tokenizer, prompt, device)
    
    # Benchmark PEFT Model
    peft_model_latency, peft_model_memory = benchmark_inference(peft_model, tokenizer, prompt, device)

    # The memory reported is peak memory during inference, not storage size of weights.
    # We'll add the trainable params memory for a complete picture.
    base_model_param_mb = base_model.get_memory_footprint() / 1e6
    peft_model_param_mb = peft_model.get_memory_footprint() / 1e6
    
    benchmark_df = pd.DataFrame([
        {"Model": "Base Model", "Avg Latency (ms)": f"{base_model_latency:.2f}", "Peak Inference Memory (MB)": f"{base_model_memory:.2f}", "Model Size (MB)": f"{base_model_param_mb:.2f}"},
        {"Model": "PEFT (LoRA)", "Avg Latency (ms)": f"{peft_model_latency:.2f}", "Peak Inference Memory (MB)": f"{peft_model_memory:.2f}", "Model Size (MB)": f"{peft_model_param_mb:.2f}"}
    ])

    print(benchmark_df)
    print("\\nNote: PEFT model size is slightly larger due to the added adapter weights, but the key benefit is the tiny size of the *trainable* weights that need to be saved and deployed.")

### 🧭 Maintenance Freeze Checklist
- Deploy adapters during scheduled maintenance window (Friday 23:00-02:00).
- Back up base model and adapter weights in model registry.
- Smoke-test multilingual prompts before shift turnover.
- Document change in OT ticketing system.

## 🧪 Lab Assignment
1. Expand dataset with 50 bilingual Q/A pairs from your plant.
2. Train prefix-tuning (`PeftType.PREFIX_TUNING`) and compare accuracy.
3. Log GPU memory usage with and without adapters (`torch.cuda.memory_allocated`).
4. Produce an adapter release note aligned with IT governance.

## ✅ Checklist
- [ ] Adapter method selected with justification
- [ ] Bilingual dataset curated
- [ ] Latency/memory benchmark recorded
- [ ] Change-management plan approved

## 📚 References
- HuggingFace PEFT Documentation
- *Adapter Tuning for Industrial NLP* (Siemens, 2025)
- Week 05-06 prompt libraries for reference prompts