# 🧱 Week 11-12 · Notebook 01 · MLOps Fundamentals for the Manufacturing Copilot

This notebook lays the MLOps foundation for our capstone project: **The Manufacturing Copilot**. We will design an auditable, production-grade lifecycle for the GenAI agents that power our copilot's maintenance, quality, and reporting workflows.


## 🎯 Learning Objectives

- **Map the GenAI Lifecycle:** Design a complete MLOps lifecycle for the Manufacturing Copilot, from data ingestion and model training to production monitoring and feedback.
- **Implement Experiment Tracking:** Set up an MLflow server to track experiments, log parameters, and version models for the copilot's different agents (Vision, RAG, Reporting).
- **Define Governance Artifacts:** Create the necessary documentation and artifacts (model cards, release notes) to ensure the copilot is auditable and compliant with manufacturing standards.
- **Assess MLOps Readiness:** Produce a readiness scorecard to evaluate the maturity of our MLOps processes before deploying the final capstone project.


## 🧩 Scenario: Building a Production-Ready Copilot

As the lead AI Engineer for the **Manufacturing Copilot** project, your task is to establish a robust MLOps framework. The copilot will be deployed across multiple factory sites, each with slightly different data and requirements. Leadership requires a unified, auditable, and scalable lifecycle to ensure that all components of the copilot (models, prompts, and datasets) are versioned, tested, and approved before the final production launch in Week 12.


## 🔄 The Manufacturing Copilot MLOps Lifecycle

A successful GenAI application requires a more complex lifecycle than traditional ML models. We need to manage not just code and models, but also prompts, vector embeddings, and multi-agent interactions.

**Our Lifecycle Blueprint:**
```
[Data Ingestion] -> [Vector DB & Feature Store] -> [Model & Prompt Versioning] -> [CI/CD] -> [Deployment] -> [Monitoring] -> [Feedback Loop]
```

Here’s how each stage applies to our Manufacturing Copilot:

| Stage                  | Manufacturing Copilot Considerations                                  | Tools Used                               | Governance Evidence                     |
| ---------------------- | --------------------------------------------------------------------- | ---------------------------------------- | --------------------------------------- |
| **Data Ingestion**     | Scrub PII from maintenance logs, ensure SOPs are current.             | `Pandas`, `Great Expectations`           | Data quality reports, SME sign-off      |
| **Vector DB / Features** | Manage embeddings for RAG agent, version feature transformations.     | `ChromaDB`, `PostgreSQL`                 | Vector DB schema, feature definitions   |
| **Model/Prompt Versioning** | Tag models by agent (Vision, RAG), version prompts for each task. | `Git`, `MLflow`, `DVC`                   | Model cards, prompt changelogs          |
| **CI/CD**              | Run safety checks, agent-specific tests, and integration tests.       | `GitHub Actions`, `pytest`               | Pipeline logs, test coverage reports    |
| **Deployment**         | Deploy API to Cloud Run, manage container versions.                   | `Docker`, `Terraform`, `GCP Cloud Run`   | Deployment manifests, runbooks          |
| **Monitoring**         | Track agent latency, hallucination rates, and token costs.            | `Prometheus`, `Grafana`, `BigQuery`      | KPI dashboards, alert notifications     |
| **Feedback Loop**      | Collect ratings from factory technicians on agent performance.        | Custom UI, `PostgreSQL`                  | Feedback analysis dashboard             |


In [None]:
# Example: Logging a Manufacturing Copilot Agent to MLflow
import os
from datetime import datetime
from pathlib import Path

import mlflow
import pandas as pd

# --- Configuration ---
# In a real project, this would come from a config file (e.g., config.py or .env)
# For local testing, you can run `mlflow ui` in your terminal to see the results.
# This setup will create a local `mlflow.db` file to store experiment data.
MLFLOW_TRACKING_URI = os.getenv("MLFLOW_TRACKING_URI", "sqlite:///mlflow.db")
EXPERIMENT_NAME = "manufacturing_copilot_agents"
AGENT_NAME = "RAG_Agent"
PLANT_ID = "Pune-IN"

# --- Setup ---
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
mlflow.set_experiment(EXPERIMENT_NAME)

# Create dummy artifacts for demonstration
Path("artifacts/docs").mkdir(parents=True, exist_ok=True)
Path("artifacts/release_notes").mkdir(parents=True, exist_ok=True)

# --- MLflow Run ---
# We start a run to track all the metadata for a specific version of our RAG agent.
# A descriptive run name helps in quickly identifying experiments in the MLflow UI.
run_name = f"{AGENT_NAME}-{PLANT_ID}-{datetime.utcnow().strftime('%Y-%m-%d_%H-%M-%S')}"

with mlflow.start_run(run_name=run_name) as run:
    run_id = run.info.run_id
    print(f"Starting MLflow Run: {run.info.run_name} (ID: {run_id})")
    
    # 1. Log Parameters: Key-value pairs that configure the agent. These are immutable.
    print("Logging parameters...")
    mlflow.log_params({
        "plant_id": PLANT_ID,
        "agent_type": AGENT_NAME,
        "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
        "llm_model": "meta-llama/Llama-2-7b-chat-hf",
        "rag_chunk_size": 512,
        "rag_chunk_overlap": 50,
    })

    # 2. Log Metrics: Key-value pairs that measure performance. These can be updated.
    print("Logging metrics...")
    mlflow.log_metrics({
        "retrieval_precision_at_5": 0.92,
        "answer_relevancy": 0.88,
        "average_latency_ms": 250,
        "technician_satisfaction_score": 4.6,
    })

    # 3. Log Artifacts: Any file you want to associate with the run (e.g., model files, plots, docs).
    print("Logging artifacts...")
    model_card_content = f"""
# Model Card: {AGENT_NAME} for {PLANT_ID}

This agent provides maintenance support by answering questions based on technical documents.
- **Embedding Model:** sentence-transformers/all-MiniLM-L6-v2
- **LLM:** Llama-2-7b-chat-hf
- **Performance:** Retrieval Precision@5 is 92%. See metrics for more details.
"""
    model_card_path = Path("artifacts/docs/rag_agent_model_card.md")
    model_card_path.write_text(model_card_content)
    mlflow.log_artifact(str(model_card_path), artifact_path="documentation")

    # 4. Set Tags: Key-value pairs for organizing and filtering runs.
    print("Setting tags...")
    mlflow.set_tags({
        "release_cycle": "Q4-2025",
        "jira_ticket": "MFG-123",
        "is_regulated": "True",
        "status": "VALIDATED"
    })

    # In a real scenario, you would also log the trained model/agent itself.
    # For a RAG agent, this might be the serialized retriever pipeline.
    # Example: mlflow.langchain.log_model(my_rag_chain, artifact_path="rag_agent_model")

    print("\\nRun completed successfully.")
    print(f"To view the results, run 'mlflow ui' in your terminal and navigate to the '{EXPERIMENT_NAME}' experiment.")

run_id

### 🗃️ Registering the Agent in the Model Registry

Once an experiment run is validated, we register its "model" (in this case, the agent's configuration and associated artifacts) to the MLflow Model Registry. This creates a versioned, governable asset that can be promoted through deployment stages (e.g., `Staging`, `Production`).

In [None]:
# Note: To log a model with `mlflow.register_model`, you typically need a model object 
# that MLflow knows how to serialize (e.g., a scikit-learn, PyTorch, or LangChain model).
# Since we are demonstrating the concept without a real model object, we'll create a 
# placeholder artifact and register that. In a real project, you'd use a specific flavor
# like `mlflow.langchain.log_model` which handles both logging and registration.

# Create a dummy model artifact to represent our agent
dummy_model_dir = Path("artifacts/dummy_model")
dummy_model_dir.mkdir(exist_ok=True)
(dummy_model_dir / "model_config.json").write_text('{"agent_type": "RAG", "version": 1}')
(dummy_model_dir / "retriever.pkl").write_text("A pickled retriever object would go here.")

# It's best practice to log the model artifact within its original run
with mlflow.start_run(run_id=run_id):
    mlflow.log_artifacts(str(dummy_model_dir), artifact_path="rag_agent_model")

# Register the model from the completed run to the Model Registry
model_name = "copilot_rag_agent"
model_uri = f"runs:/{run_id}/rag_agent_model"

print(f"Registering model '{model_name}' from URI: {model_uri}")

try:
    registered_model = mlflow.register_model(
        model_uri=model_uri,
        name=model_name,
        tags={"agent_framework": "langchain", "use_case": "maintenance_support"}
    )
    print(f"Model '{model_name}' (Version: {registered_model.version}) has been successfully registered.")

except mlflow.exceptions.MlflowException as e:
    print(f"Error registering model: {e}")
    print("This can happen if the model name is already in use with a different tracking URI or has other conflicts.")
    # As a fallback for demonstration, let's find the latest version if it exists
    client = mlflow.tracking.MlflowClient()
    try:
        latest_versions = client.get_latest_versions(name=model_name)
        if latest_versions:
            registered_model = latest_versions[0]
            print(f"Found existing model '{model_name}' (Version: {registered_model.version}).")
    except:
        registered_model = None # Could not register or find
        print("Could not register or find the model.")

if registered_model:
    # You can also add descriptions to the registered model version
    client = mlflow.tracking.MlflowClient()
    client.update_model_version(
        name=model_name,
        version=registered_model.version,
        description=f"This is Version {registered_model.version} of the RAG agent for the Manufacturing Copilot, validated for the {PLANT_ID} plant."
    )
    print("Added description to the registered model version.")

## 🧾 Creating an Audit-Ready Release Note

For compliance and governance, every model promoted to production needs a release note. This JSON file captures all the critical metadata, linking our agent back to the business context, data sources, and risk assessments.

In [None]:
import json

# Ensure the previous cells have run and `registered_model` is not None
if 'registered_model' in locals() and registered_model is not None:
    # Dynamically create the release note from the run data
    client = mlflow.tracking.MlflowClient()
    run_data = client.get_run(run_id).data

    release_note = {
        "release_id": f"REL-{AGENT_NAME}-{PLANT_ID}-{datetime.utcnow().strftime('%Y%m%d%H%M')}",
        "release_date": datetime.utcnow().isoformat(),
        "approved_by": "Jane Doe, Head of Quality Assurance",
        "status": "APPROVED_FOR_STAGING",
        "jira_ticket": run_data.tags.get("jira_ticket"),
        "mlflow_run_id": run_id,
        "mlflow_model_name": registered_model.name,
        "mlflow_model_version": registered_model.version,
        "data_sources": [
            "s3://manufacturing-docs/pune/sop_manuals_v8.pdf",
            "s3://manufacturing-logs/pune/maintenance_logs_2025_Q3.parquet"
        ],
        "prompt_template_version": "prompts/rag_agent/v3.yaml",
        "risk_assessment": {
            "hallucination_mitigation": "Using RAG with a high relevancy threshold (0.9+) and citing sources in every response.",
            "pii_leakage_control": "PII scrubbed during data ingestion pipeline using Presidio.",
            "safety_guardrails": "Input/output content filtering for harmful language using Azure Content Safety.",
            "overall_risk_level": "LOW"
        },
        "evaluation_summary": {
            "retrieval_precision_at_5": run_data.metrics.get("retrieval_precision_at_5"),
            "answer_relevancy": run_data.metrics.get("answer_relevancy"),
            "technician_satisfaction_score": run_data.metrics.get("technician_satisfaction_score")
        }
    }

    release_note_path = Path(f"artifacts/release_notes/{release_note['release_id']}.json")
    with open(release_note_path, 'w', encoding='utf-8') as f:
        json.dump(release_note, f, indent=4)

    print(f"Audit-ready release note created at: {release_note_path}")
    
    # Log the release note back to the original MLflow run for a complete audit trail
    with mlflow.start_run(run_id=run_id):
        mlflow.log_artifact(str(release_note_path), artifact_path="release_documentation")
    
    print("Release note has been logged to the MLflow run.")
    
    # Display the created release note
    print("\\n--- Release Note Content ---")
    print(json.dumps(release_note, indent=2))
else:
    print("Skipping release note creation because the model was not successfully registered.")

## 🧮 MLOps Readiness Scorecard for the Copilot

Before moving to full-scale deployment, we need to honestly assess the maturity of our MLOps practices. This scorecard helps us identify and prioritize gaps.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

scorecard_data = [
    {
        "Dimension": "1. Data Management",
        "Score": 3,
        "Notes": "Automated data quality checks are in place. Need versioning for large datasets (e.g., DVC).",
    },
    {
        "Dimension": "2. Model & Prompt Engineering",
        "Score": 4,
        "Notes": "MLflow tracking is robust for experiments. Prompt templates are versioned in Git.",
    },
    {
        "Dimension": "3. Continuous Integration (CI)",
        "Score": 2,
        "Notes": "Basic unit tests exist. Need to add integration tests for agent workflows and security scanning (e.g., Gitleaks).",
    },
    {
        "Dimension": "4. Continuous Deployment (CD)",
        "Score": 2,
        "Notes": "Manual deployment process. Need to build automated deployment pipeline (e.g., Terraform + GitHub Actions).",
    },
    {
        "Dimension": "5. Monitoring & Observability",
        "Score": 1,
        "Notes": "No application or model monitoring currently in place. This is a critical gap. Need to implement Prometheus and Grafana.",
    },
     {
        "Dimension": "6. Governance & Security",
        "Score": 3,
        "Notes": "Release notes provide an audit trail. Need to integrate secrets management (e.g., GCP Secret Manager or HashiCorp Vault).",
    },
]

scorecard = pd.DataFrame(scorecard_data)
scorecard.set_index("Dimension", inplace=True)

# --- Visualize the Scorecard ---
plt.style.use('seaborn-v0_8-whitegrid')
fig, ax = plt.subplots(figsize=(10, 6))

# Create a color map: 1-2 are red, 3 is yellow, 4-5 are green
colors = []
for score in scorecard['Score']:
    if score <= 2:
        colors.append('#d9534f') # Red
    elif score == 3:
        colors.append('#f0ad4e') # Yellow
    else:
        colors.append('#5cb85c') # Green

scorecard['Score'].plot(kind='barh', ax=ax, color=colors, width=0.8)

# Add score labels to the bars
for i, (score, note) in enumerate(zip(scorecard['Score'], scorecard['Notes'])):
    ax.text(score + 0.1, i, str(score), va='center', ha='left', fontsize=12, fontweight='bold')

ax.set_xlim(0, 5.5)
ax.set_xlabel("Maturity Score (1=Low, 5=High)", fontsize=12)
ax.set_ylabel("")
ax.set_title("MLOps Readiness Scorecard for the Manufacturing Copilot", fontsize=16, fontweight='bold')
ax.tick_params(axis='y', labelsize=12)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)

plt.tight_layout()
plt.show()

print("\\n--- Scorecard Details ---")
scorecard

### Scorecard Interpretation & Action Plan

- **Score Legend:** 1 (Non-existent) -> 5 (Fully Automated & Optimized)
- **Critical Gaps:** Our most significant weaknesses are in **Monitoring (Score 1)** and **CD (Score 2)**.
- **Action Plan for Capstone:**
    1.  **Priority 1 (This Week):** Implement foundational monitoring for the FastAPI application.
    2.  **Priority 2 (This Week):** Develop Terraform scripts for our GCP infrastructure.
    3.  **Priority 3 (Next Week):** Build the CI/CD pipeline in GitHub Actions to automate testing and deployment.
    4.  **Priority 4 (Next Week):** Integrate GCP Secret Manager for handling API keys and database credentials.

This scorecard will be our guide for the rest of this module. We will systematically address these gaps to build a production-ready Manufacturing Copilot.


## 🧪 Lab Assignment: Your Turn!

Now it's your turn to practice these core MLOps concepts.

1.  **Set Up and Explore MLflow:**
    -   Ensure you have MLflow installed (`pip install mlflow`).
    -   Execute all the Python cells above to create a local `mlflow.db` file and log your first experiment run.
    -   Open a new terminal in VS Code and run the command: `mlflow ui`.
    -   This will start the MLflow tracking server. Open the URL it provides (usually `http://127.0.0.1:5000`) in your browser.
    -   Navigate to the `manufacturing_copilot_agents` experiment and click on the run you just created. Explore the parameters, metrics, tags, and artifacts.

2.  **Log a New "Vision Agent" Experiment:**
    -   Copy the main MLflow run cell (the one that starts with `with mlflow.start_run(...)`).
    -   Modify it to log an experiment for a hypothetical `Vision_Agent`.
    -   **Change the `run_name`** to include "Vision_Agent".
    -   **Update the parameters:**
        -   `agent_type`: "Vision_Agent"
        -   `embedding_model`: "google/vit-base-patch16-224"
        -   `llm_model`: "N/A"
    -   **Update the metrics:**
        -   `defect_detection_accuracy`: 0.98
        -   `false_positive_rate`: 0.05
        -   `average_inference_ms`: 80
    -   **Update the tags:**
        -   `jira_ticket`: "MFG-456"
    -   Execute the cell to log this new run.

3.  **Compare the Two Agent Runs:**
    -   Go back to the MLflow UI. You should now see two runs in your experiment.
    -   Select both the `RAG_Agent` and `Vision_Agent` runs by checking the boxes next to them.
    -   Click the **"Compare"** button.
    -   Analyze the comparison view. Notice how MLflow highlights the differences in parameters and metrics, which is invaluable for debugging and analysis.

4.  **Register the Vision Agent Model:**
    -   Using the `run_id` from your `Vision_Agent` run, adapt the model registration cell to register this new agent.
    -   Give it a new model name, such as `copilot_vision_agent`.
    -   Add a description to the registered model version explaining its purpose (e.g., "This model detects scratches and dents on metal surfaces.").

5.  **Transition the RAG Agent to "Staging":**
    -   In the MLflow UI, navigate to the "Models" page.
    -   Click on the `copilot_rag_agent`.
    -   Find the version you registered and use the "Stage" dropdown to transition it from `None` to `Staging`.
    -   Add a comment to the transition, such as "Initial validation passed. Ready for integration testing." This action creates an audit trail for model promotion.

## ✅ Checklist for this Notebook

- [X] Mapped the MLOps lifecycle for the Manufacturing Copilot.
- [X] Successfully configured MLflow and logged a sample agent experiment.
- [X] Registered a model version in the MLflow Model Registry.
- [X] Generated a compliant, audit-ready release note.
- [X] Created and analyzed the MLOps readiness scorecard to identify critical gaps.
- [ ] **TODO:** Complete the Lab Assignment to practice these concepts yourself.


## 📚 References and Further Reading

-   [MLflow Official Documentation](https://mlflow.org/docs/latest/index.html)
-   [The MLOps Lifecycle with MLflow (Databricks Blog)](https://www.databricks.com/blog/2019/10/22/the-mlops-lifecycle-with-mlflow.html)
-   [Awesome MLOps (A curated list of MLOps tools)](https://github.com/visenger/awesome-mlops)
-   [Great Expectations Documentation](https://greatexpectations.io/docs/)
-   [Terraform for GCP Documentation](https://registry.terraform.io/providers/hashicorp/google/latest/docs)
