# 🏗️ Week 09-10 · Notebook 10 · End-to-End Fine-tuning Pipeline

Assemble a production-ready instruction-tuning workflow covering data governance, training orchestration, evaluation, and packaging.

## 🎯 Learning Objectives
- **Curate & Govern Datasets:** Implement a data governance checklist to screen instruction-tuning datasets for PII, freshness, and SME approval.
- **Orchestrate Scalable Training:** Use `accelerate` and `deepspeed` configuration files to launch distributed, production-ready fine-tuning jobs.
- **Implement Robust Evaluation:** Combine automatic metrics (e.g., BLEU, ROUGE) with a structured human review process to ensure model quality and safety.
- **Package for Deployment:** Systematically package all model artifacts—including weights, tokenizer configs, and evaluation reports—into a versioned structure ready for a model registry.

## 🧩 Scenario
A plant governance board requires a formal SOP before deploying a fine-tuned SOP assistant. You must demonstrate data QA, structured training, and evaluation sign-off.

In [None]:
import pandas as pd
from datasets import Dataset
import json
import yaml
from pathlib import Path

## 🛡️ Data Governance Checklist
Start with a QA table capturing PII screening, freshness, and SME review.

In [None]:
governance_data = [
    {
        "dataset_name": "sop_assistant_v1",
        "source_document": "SOP-MECH-001.pdf",
        "pii_screened": True,
        "sme_reviewer": "amit.kumar@example.com",
        "review_date": "2024-10-12",
        "approved_for_training": True,
        "notes": "Contains no PII. Covers standard hydraulic press maintenance."
    },
    {
        "dataset_name": "sop_assistant_v1",
        "source_document": "SOP-ELEC-003.docx",
        "pii_screened": True,
        "sme_reviewer": "priya.sharma@example.com",
        "review_date": "2024-10-11",
        "approved_for_training": True,
        "notes": "Covers PLC firmware update procedures."
    },
    {
        "dataset_name": "sop_assistant_v1",
        "source_document": "incident_reports_q3.csv",
        "pii_screened": False,
        "sme_reviewer": "sanjay.mehta@example.com",
        "review_date": "2024-10-10",
        "approved_for_training": False,
        "notes": "Contains technician names and IDs. Requires anonymization before use."
    }
]
governance_records = pd.DataFrame(governance_data)
governance_records

## 📑 Instruction Dataset Blueprint
Structure prompts/responses referencing SOP sections.

In [None]:
instruction_data = [
    {
        "instruction": "What are the first three steps for the hydraulic press startup procedure?",
        "context": "Document: SOP-MECH-001, Section: 2.1",
        "response": "1. Ensure the main power breaker is in the 'ON' position. 2. Verify that the emergency stop button is disengaged. 3. Check that the hydraulic fluid level is within the green zone on the sight glass."
    },
    {
        "instruction": "Summarize the safety warning for updating PLC firmware.",
        "context": "Document: SOP-ELEC-003, Section: 4.0",
        "response": "Before updating firmware, you must disconnect the PLC from the live production network to prevent unintended machine operations. A full backup of the current logic is mandatory."
    },
    {
        "instruction": "How long should the machine be powered down before maintenance?",
        "context": "Document: SOP-MECH-001, Section: 3.2",
        "response": "The machine must be fully powered down and locked-out/tagged-out (LOTO) for a minimum of 15 minutes to allow capacitors to discharge before any maintenance is performed."
    }
]

# In a real pipeline, this would be saved to a JSONL file
Path("data").mkdir(exist_ok=True)
with open("data/instruction_dataset.jsonl", "w") as f:
    for record in instruction_data:
        f.write(json.dumps(record) + "\n")

# Create a Hugging Face Dataset object
instructions_dataset = Dataset.from_list(instruction_data)
instructions_dataset

## ⚙️ Accelerate / DeepSpeed Config (YAML)
Store configuration for reproducible training runs.

In [None]:
# This configuration is for a multi-GPU setup using DeepSpeed ZeRO Stage 3
accelerate_config = {
    "compute_environment": "local_machine",
    "distributed_type": "DEEPSPEED",
    "machine_rank": 0,
    "num_machines": 1,
    "num_processes": 4, # e.g., 4 GPUs
    "use_cpu": False,
    "deepspeed_config": {
        "train_batch_size": "auto",
        "train_micro_batch_size_per_gpu": "auto",
        "gradient_accumulation_steps": "auto",
        "optimizer": {
            "type": "AdamW",
            "params": {
                "lr": 2e-5,
                "betas": [0.9, 0.999],
                "eps": 1e-8
            }
        },
        "fp16": {
            "enabled": True
        },
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu",
                "pin_memory": True
            },
            "allgather_bucket_size": 5e8,
            "reduce_bucket_size": 5e8,
        }
    }
}

# Create a directory for configs and save the file
Path("configs").mkdir(exist_ok=True)
config_path = Path("configs/deepspeed_zero3_config.yaml")
with open(config_path, 'w', encoding='utf-8') as f:
    yaml.dump(accelerate_config, f)

print(f"--- Accelerate/DeepSpeed Config written to {config_path} ---")
print(config_path.read_text())

## 🏃 Training Launcher Script
Use HuggingFace CLI entry point referencing the config.

In [None]:
# This command would be run in the terminal to start the distributed training job.
# It uses the `accelerate launch` command, pointing to the config file and the training script.
launcher_script = f"""
accelerate launch --config_file configs/deepspeed_zero3_config.yaml scripts/run_instruction_tuning.py \\
    --model_name_or_path "meta-llama/Llama-2-7b-hf" \\
    --dataset_path "data/instruction_dataset.jsonl" \\
    --output_dir "models/sop-assistant-v1.1" \\
    --num_train_epochs 3 \\
    --learning_rate 2e-5 \\
    --logging_steps 10
"""

print("--- Example Training Launch Command ---")
print(launcher_script)

*(Create `scripts/train_instruction.py` following HuggingFace Trainer patterns; see repo template.)*

## 📊 Evaluation Harness
Combine automatic metrics and human governance review.

In [None]:
# This represents a set of prompts to evaluate the model against.
# In a real pipeline, this would be a separate, held-out dataset.
eval_data = [
    {
        "prompt_id": "EVAL-001",
        "prompt": "What are the first three steps for the hydraulic press startup procedure?",
        "ground_truth_response": "1. Ensure the main power breaker is in the 'ON' position. 2. Verify that the emergency stop button is disengaged. 3. Check that the hydraulic fluid level is within the green zone on the sight glass.",
        "generated_response": "1. Turn on the main power. 2. Check the e-stop. 3. Look at the hydraulic fluid.", # A slightly less formal but correct response
        "severity": "medium",
        "reviewer": "amit.kumar@example.com"
    },
    {
        "prompt_id": "EVAL-002",
        "prompt": "Summarize the safety warning for updating PLC firmware.",
        "ground_truth_response": "Before updating firmware, you must disconnect the PLC from the live production network to prevent unintended machine operations. A full backup of the current logic is mandatory.",
        "generated_response": "You must disconnect the PLC from the network before updating.", # Missing the critical backup step
        "severity": "high",
        "reviewer": "priya.sharma@example.com"
    }
]
eval_set = pd.DataFrame(eval_data)
eval_set

### Automatic Metrics

This section demonstrates how to use standard NLP metrics to evaluate the model's performance automatically. While these metrics are not a substitute for human review, they provide a valuable, scalable way to track model quality over time. We will use the `evaluate` library from Hugging Face.

from evaluate import load

# Load common text generation metrics
try:
    rouge = load('rouge')
    bleu = load('bleu')
    
    # --- Run Automatic Evaluation ---
    predictions = eval_set['generated_response'].tolist()
    references = eval_set['ground_truth_response'].tolist()

    rouge_results = rouge.compute(predictions=predictions, references=references)
    bleu_results = bleu.compute(predictions=predictions, references=references)

    print("--- Automatic Metric Results ---")
    print(f"ROUGE-L Score: {rouge_results['rougeL']:.4f}")
    print(f"BLEU Score: {bleu_results['bleu']:.4f}")
    
    # --- Human Review Sign-off ---
    # In a real pipeline, this would be a more complex workflow,
    # but here we simulate it by adding a status to our eval DataFrame.
    eval_set['human_review_status'] = eval_set.apply(
        lambda row: 'FAIL' if row['severity'] == 'high' and 'backup' not in row['generated_response'].lower() else 'PASS',
        axis=1
    )
    
    print("\\n--- Human Review Simulation ---")
    display(eval_set[['prompt_id', 'severity', 'generated_response', 'human_review_status']])
    
    # Final sign-off check
    if 'FAIL' in eval_set['human_review_status'].values:
        print("\\n❌ GOVERNANCE FAILED: At least one high-severity prompt failed human review.")
    else:
        print("\\n✅ GOVERNANCE PASSED: All prompts passed.")

except ImportError:
    print("The 'evaluate' library is not installed. Please run 'pip install evaluate rouge_score sacrebleu'.")
    print("Skipping automatic metric calculation.")

## 📦 Packaging Artifacts for Model Registry

A robust pipeline concludes by packaging all necessary components into a versioned directory, ready for upload to a model registry like Hugging Face Hub, MLflow, or a simple versioned file share.

```bash
# Example structure for a packaged model
models/
└── sop-assistant-v1.1/
    ├── adapter_config.json
    ├── adapter_model.bin
    ├── tokenizer_config.json
    ├── tokenizer.json
    ├── special_tokens_map.json
    ├── training_args.bin
    ├── evaluation_report.json  # Contains ROUGE, BLEU, and human review sign-off
    └── governance_log.csv      # A snapshot of the data governance records used
```

This structure ensures that the model is fully reproducible and auditable.

## 🚨 Release Checklist
- ✅ Governance QA complete
- ✅ Automatic metrics above threshold
- ✅ Human SMEs signed off
- ✅ Artifacts published to registry
- ✅ Rollback plan documented

## 🧪 Lab Assignment
1.  **Expand the Governance Log:** Add two more records to the `governance_records` DataFrame. One should be an approved document, and one should be a document that is rejected due to being outdated (e.g., `review_date` is from 2022).
2.  **Create a Training Script:** Create a file named `scripts/run_instruction_tuning.py`. You don't need to write the full training logic, but the file should:
    *   Import `argparse`, `torch`, and `transformers`.
    *   Set up an `ArgumentParser` to accept the command-line arguments shown in the launcher script (`--model_name_or_path`, `--dataset_path`, etc.).
    *   Include a `main()` function that prints the parsed arguments.
3.  **Enhance the Evaluation:** Add a new entry to the `eval_data` list that represents a "low" severity prompt. Run the evaluation and confirm that the human review logic correctly passes it.
4.  **Automate Artifact Packaging:** Write a Python function `package_artifacts(model_dir)` that takes a directory path and creates the folder structure and dummy files shown in the "Packaging Artifacts" section (e.g., creates `adapter_config.json`, `evaluation_report.json`, etc.).

</VSCode.Cell>
<VSCode.Cell id="#VSC-a1b2c3d4" language="markdown">
## ✅ Checklist
- [ ] Data governance process is defined and auditable.
- [ ] Distributed training configuration is version-controlled.
- [ ] Evaluation includes both automatic metrics and human-in-the-loop review.
- [ ] Model artifacts are versioned and packaged for a model registry.
</VSCode.Cell>
<VSCode.Cell id="#VSC-e5f6g7h8" language="markdown">
## 📚 References
- Hugging Face `accelerate` Documentation
- DeepSpeed Documentation
- *Building Production-Ready NLP Systems* (O'Reilly, 2024)
