# 🎯 Week 5-6 · Notebook 06 · Few-Shot Learning

Steer large language models with curated in-context examples instead of expensive fine-tuning. We'll design exemplar libraries, automate retrieval, and measure lift for manufacturing workflows.

## 🚀 Learning Outcomes
- Diagnose when few-shot prompting outperforms zero-shot and approaches fine-tuned accuracy.
- Curate exemplar catalogs that reflect manufacturing vocab, units, and edge cases.
- Automate example retrieval with embeddings, clustering, and recency filters.
- Evaluate response quality and iterate on prompt + exemplar combinations.

## 🏭 Manufacturing Use Cases
| Workflow | Few-Shot Goal | Example Inputs | Desired Output |
| --- | --- | --- | --- |
| Maintenance triage | Recommend first fix | Incident text, telemetry snippet | Single actionable step |
| Quality deviation | Suggest containment plan | NCR description, lot data | Checklist with owners |
| Supplier response | Draft bilingual reply | Email thread excerpt | EN/ES email |
| Production reports | Summarize shift log | Raw log paragraphs | 120-word summary with KPIs |

## 🧱 Exemplar Design Principles
1. Mirror the target output format exactly (tone, length, structure).
2. Cover edge cases: multilingual, sensor gaps, safety triggers.
3. Include metadata like machine IDs to anchor model understanding.
4. Rotate examples periodically to prevent model complacency.
5. Keep prompts within context window—aim for ≤ 512 tokens per request.

In [None]:
from transformers import pipeline
import pandas as pd

few_shot_model = pipeline(
    "text-generation",
    model="tiiuae/falcon-7b-instruct",
    max_new_tokens=120,
    temperature=0.25,
)

examples_df = pd.DataFrame([
    {"ticket": "Vibration spike on compressor 7 after bearing replacement.", "action": "Inspect alignment and re-torque fasteners."},
    {"ticket": "Hydraulic leak detected on clamp cylinder.", "action": "Isolate machine and replace seals."},
    {"ticket": "Camera misreads due to glare on SMT line.", "action": "Adjust lighting and recalibrate vision thresholds."},
    {"ticket": "Packing line robot flags repeated overcurrent alarms.", "action": "Check payload weight and recalibrate torque limits."},
])

examples_df

In [None]:
def build_prompt(examples, query, persona="reliability engineer"):
    header = [
        f"You are a {persona}. Recommend the first corrective action in one concise sentence.",
        "Respond with the structure: Action: <verb phrase>.",
    ]
    for _, ex in examples.iterrows():
        header.append(f"Ticket: {ex.ticket}\nAction: {ex.action}")
    header.append(f"Ticket: {query}\nAction:")
    return "\n".join(header)

In [None]:
query_ticket = "AGV slowed near station 5 due to repeated lidar faults."
few_shot_prompt = build_prompt(examples_df.head(3), query_ticket)
print("Prompt preview:\n", "\n".join(few_shot_prompt.splitlines()[:7]))

In [None]:
response = few_shot_model(few_shot_prompt)[0]["generated_text"]
print(response)

## 🔍 Exemplar Retrieval Pipeline
1. Embed tickets using a lightweight sentence transformer.
2. Filter by metadata (e.g., machine family, shift) to avoid irrelevant matches.
3. Pick top-k exemplars balancing similarity and diversity.
4. Cap total prompt tokens (< 512) and append the query last.
5. Log which exemplars were used for traceability.

In [None]:
from sentence_transformers import SentenceTransformer, util

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

corpus_embeddings = embedder.encode(examples_df.ticket.tolist(), convert_to_tensor=True)
query_embedding = embedder.encode(query_ticket, convert_to_tensor=True)

similarities = util.cos_sim(query_embedding, corpus_embeddings).squeeze().tolist()
retrieval_df = (
    examples_df.assign(similarity=[round(s, 3) for s in similarities])
    .sort_values("similarity", ascending=False)
)
retrieval_df

## 📈 Evaluation Dashboard
| Metric | Definition | Tooling |
| --- | --- | --- |
| Recommendation accuracy | % of prompts matching SME-labelled actions | Manual review, eval harness |
| Safety compliance | No missing lockout-tagout or PPE steps | Safety checklist automation |
| Token cost | Prompt + completion tokens per call | Billing export |
| Latency | End-to-end response time | Observability dashboards |
| Drift | Similarity drop between query and exemplars | Embedding similarity logs |

Iterate by swapping exemplars, editing instructions, or expanding context windows.

## 🗄️ Exemplar Management
- Store examples in a versioned dataset with tags (machine, shift, language).
- Capture SME approval status and last refresh date.
- Use data quality checks to remove outdated or conflicting exemplars.
- Pair with analytics to track coverage gaps across product lines.

In [None]:
tracking = pd.DataFrame([
    {"prompt": "few_shot_v1", "k": 3, "accuracy": 0.81, "safety": 1.0, "latency_ms": 910},
    {"prompt": "few_shot_v2", "k": 4, "accuracy": 0.87, "safety": 1.0, "latency_ms": 1040},
    {"prompt": "zero_shot_baseline", "k": 0, "accuracy": 0.58, "safety": 0.85, "latency_ms": 650},
])
tracking

## 🛡️ Safety & Compliance
- Include at least one safety-critical exemplar to bias toward conservative actions.
- Ask the model to state assumptions and confidence for traceability.
- Log exemplar IDs used per request to support audits.
- Restrict prompts to non-sensitive data when using cloud-hosted models.

## ✅ Checklist
- [ ] Exemplars curated, tagged, and SME-approved
- [ ] Retrieval pipeline tested for latency and relevance
- [ ] Evaluation metrics logged across zero/ few-shot baselines
- [ ] Safety review completed for exemplar content
- [ ] Deployment playbook documented

## 📚 References
- *In-Context Learning for Industrial NLP* (2024)
- Prompt engineering best practices (Week 05)
- SentenceTransformers documentation
- Reliability engineering checklists (PlantOps 2025)

## 🧪 Lab Assignment
1. Curate 30 labelled incidents spanning at least three machine families.
2. Build an embedding index and implement top-k exemplar retrieval with diversity sampling.
3. Compare zero-shot, few-shot (k=3,5), and fine-tuned classifier results on a held-out set.
4. Document accuracy, safety, latency, and cost metrics in the tracker.
5. Present findings plus recommended deployment strategy to leadership.