# üöÄ Notebook 7: Complete MLOps Pipeline Walkthrough

**Author:** Amey Talkatkar | **Course:** MLOps with Agentic AI

## üéØ Learning Objectives
- Execute complete end-to-end MLOps pipeline
- Integrate all components (DVC, MLflow, Airflow, FastAPI)
- Understand production workflow from data to deployment
- Verify each stage with real outputs
- Troubleshoot issues in integrated system
- Understand the full MLOps lifecycle

## üî• The Problem (Final Boss!)

You've learned:
- Notebook 1: EDA
- Notebook 2: Feature Engineering
- Notebook 3: Model Training
- Notebook 4: MLflow Tracking
- Notebook 5: DVC Versioning
- Notebook 6: Airflow Orchestration

**But how do they all work TOGETHER?** ü§î

Today: Connect all dots! Data ‚Üí Training ‚Üí Tracking ‚Üí Versioning ‚Üí Orchestration ‚Üí Deployment

---

## üìã Pipeline Overview

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    COMPLETE MLOPS PIPELINE                  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

1. DATA VERSIONING (DVC)
   ‚îî‚îÄ> Track data changes in Git-like manner
   
2. DATA PREPARATION (Python)
   ‚îî‚îÄ> Load from DVC ‚Üí Validate ‚Üí Engineer Features
   
3. MODEL TRAINING (scikit-learn, XGBoost)
   ‚îî‚îÄ> Train 3 models in parallel
   
4. EXPERIMENT TRACKING (MLflow)
   ‚îî‚îÄ> Log params, metrics, artifacts
   
5. MODEL REGISTRY (MLflow)
   ‚îî‚îÄ> Register ‚Üí Staging ‚Üí Production
   
6. ORCHESTRATION (Airflow)
   ‚îî‚îÄ> Automate entire workflow
   
7. DEPLOYMENT (FastAPI)
   ‚îî‚îÄ> Serve predictions via REST API
   
8. MONITORING (Streamlit)
   ‚îî‚îÄ> Visualize metrics and performance
```

## üîß Setup

In [None]:
import os
import sys
import subprocess
import time
import requests
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Change to project root
os.chdir('..')
print(f"üìÇ Working directory: {os.getcwd()}")

# Check services
def check_service(name, url):
    try:
        response = requests.get(url, timeout=3)
        if response.status_code == 200:
            print(f"‚úÖ {name}: Running")
            return True
        else:
            print(f"‚ö†Ô∏è  {name}: Responding but status {response.status_code}")
            return False
    except:
        print(f"‚ùå {name}: Not running")
        return False

print("\nüîç Checking Services...")
airflow_ok = check_service("Airflow", "http://localhost:8080/health")
api_ok = check_service("FastAPI", "http://localhost:8000/health")
dashboard_ok = check_service("Streamlit", "http://localhost:8501")

if not all([airflow_ok, api_ok, dashboard_ok]):
    print("\n‚ö†Ô∏è  Some services not running. Start them:")
    print("  airflow standalone &")
    print("  uvicorn api.main:app --host 0.0.0.0 --port 8000 &")
    print("  streamlit run dashboard/app.py &")

## Stage 1: Data Versioning with DVC

First, ensure data is tracked and versioned.

In [None]:
print("üìä Stage 1: Data Versioning\n")

# Check if data exists
data_file = 'data/raw/sales_data.csv'
dvc_file = f'{data_file}.dvc'

if not os.path.exists(data_file):
    print("üîÑ Generating fresh data...")
    result = subprocess.run(
        ['python', 'data/generate_data.py', '--rows', '10000', '--output', data_file],
        capture_output=True, text=True
    )
    print(result.stdout)

# Check DVC tracking
if os.path.exists(dvc_file):
    print(f"‚úÖ Data tracked with DVC: {dvc_file}")
    
    # Read DVC metadata
    import yaml
    with open(dvc_file, 'r') as f:
        dvc_meta = yaml.safe_load(f)
    print(f"   MD5: {dvc_meta['outs'][0]['md5']}")
    print(f"   Size: {dvc_meta['outs'][0]['size']:,} bytes")
else:
    print("‚ö†Ô∏è  Data not tracked with DVC")
    print("   Run: dvc add data/raw/sales_data.csv")

# Load and preview data
df = pd.read_csv(data_file, parse_dates=['date'])
print(f"\nüìä Dataset: {len(df):,} rows √ó {len(df.columns)} columns")
print(f"   Date range: {df['date'].min().date()} to {df['date'].max().date()}")
print(f"   Regions: {', '.join(df['region'].unique())}")
print(f"   Products: {', '.join(df['product'].unique())}")

## Stage 2: Feature Engineering Pipeline

Transform raw data into ML-ready features.

In [None]:
print("üîß Stage 2: Feature Engineering\n")

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import joblib

# Sort by date
df = df.sort_values('date').reset_index(drop=True)

# Create lag features
print("Creating lag features...")
for lag in [1, 7, 30]:
    df[f'sales_lag_{lag}'] = df.groupby(['region', 'product'])['sales'].shift(lag)

# Rolling features
print("Creating rolling window features...")
for window in [7, 30]:
    df[f'sales_rolling_mean_{window}'] = df.groupby(['region', 'product'])['sales'].transform(
        lambda x: x.rolling(window=window, min_periods=1).mean()
    )

# One-hot encoding
print("Encoding categorical variables...")
df_encoded = pd.get_dummies(df, columns=['region', 'product', 'season'], drop_first=True)

# Define features
feature_cols = [
    'price', 'quantity', 'month', 'day_of_week',
    'sales_lag_1', 'sales_lag_7', 'sales_lag_30',
    'sales_rolling_mean_7', 'sales_rolling_mean_30',
    'is_weekend',
] + [col for col in df_encoded.columns if col.startswith(('region_', 'product_', 'season_'))]

# Drop NaN rows
df_clean = df_encoded.dropna(subset=feature_cols)
X = df_clean[feature_cols]
y = df_clean['sales']

# Train/test split
print("Splitting data...")
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=False
)

# Feature scaling
print("Scaling features...")
numerical_cols = ['price', 'quantity', 'month', 'day_of_week'] + \
                 [col for col in feature_cols if 'lag' in col or 'rolling' in col]
scaler = StandardScaler()
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

# Save processed data
os.makedirs('data/processed', exist_ok=True)
X_train.to_csv('data/processed/X_train.csv', index=False)
X_test.to_csv('data/processed/X_test.csv', index=False)
y_train.to_csv('data/processed/y_train.csv', index=False, header=True)
y_test.to_csv('data/processed/y_test.csv', index=False, header=True)
joblib.dump(scaler, 'data/processed/scaler.joblib')

print(f"\n‚úÖ Feature engineering complete:")
print(f"   Train: {len(X_train):,} samples")
print(f"   Test:  {len(X_test):,} samples")
print(f"   Features: {len(feature_cols)}")
print(f"   Saved to: data/processed/")

## Stage 3: Model Training with MLflow Tracking

Train multiple models and track with MLflow.

In [None]:
print("ü§ñ Stage 3: Model Training with MLflow\n")

import mlflow
import mlflow.sklearn
import mlflow.xgboost
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Configure MLflow
mlflow.set_tracking_uri(os.getenv('MLFLOW_TRACKING_URI', 'http://localhost:5000'))
mlflow.set_experiment('complete_pipeline_demo')

# Training function
def train_and_log_model(model, model_name, model_type):
    with mlflow.start_run(run_name=f"{model_name}_{datetime.now():%Y%m%d_%H%M}"):
        # Log model type
        mlflow.log_param("model_type", model_type)
        
        # Log hyperparameters
        if hasattr(model, 'get_params'):
            mlflow.log_params(model.get_params())
        
        # Train
        start_time = time.time()
        model.fit(X_train, y_train)
        train_time = time.time() - start_time
        
        # Predict
        y_pred = model.predict(X_test)
        
        # Calculate metrics
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        
        # Log metrics
        mlflow.log_metrics({
            "rmse": rmse,
            "mae": mae,
            "r2_score": r2,
            "train_time_seconds": train_time
        })
        
        # Log model
        if model_type == "XGBoost":
            mlflow.xgboost.log_model(model, "model")
        else:
            mlflow.sklearn.log_model(model, "model")
        
        print(f"‚úÖ {model_name}: RMSE=${rmse:,.2f}, R¬≤={r2:.4f}, Time={train_time:.2f}s")
        
        return {
            'run_id': mlflow.active_run().info.run_id,
            'model_name': model_name,
            'rmse': rmse,
            'mae': mae,
            'r2': r2
        }

# Train models
results = []

print("Training Linear Regression...")
lr_result = train_and_log_model(
    LinearRegression(),
    "linear_regression",
    "LinearRegression"
)
results.append(lr_result)

print("\nTraining Random Forest...")
rf_result = train_and_log_model(
    RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1),
    "random_forest",
    "RandomForest"
)
results.append(rf_result)

print("\nTraining XGBoost...")
xgb_result = train_and_log_model(
    XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42, n_jobs=-1, verbosity=0),
    "xgboost",
    "XGBoost"
)
results.append(xgb_result)

# Compare results
results_df = pd.DataFrame(results).sort_values('rmse')
print("\nüìä Model Comparison:")
display(results_df[['model_name', 'rmse', 'mae', 'r2']])

best_model = results_df.iloc[0]
print(f"\nüèÜ Best Model: {best_model['model_name']}")
print(f"   RMSE: ${best_model['rmse']:,.2f}")
print(f"   Run ID: {best_model['run_id']}")

## Stage 4: Model Registry & Promotion

Register best model and promote to production.

In [None]:
print("üì¶ Stage 4: Model Registry\n")

# Register best model
MODEL_NAME = "sales_forecasting_production"
best_run_id = best_model['run_id']
model_uri = f"runs:/{best_run_id}/model"

try:
    model_version = mlflow.register_model(model_uri, MODEL_NAME)
    print(f"‚úÖ Model registered:")
    print(f"   Name: {model_version.name}")
    print(f"   Version: {model_version.version}")
    print(f"   Current Stage: {model_version.current_stage}")
    
    # Transition to Staging
    client = mlflow.MlflowClient()
    client.transition_model_version_stage(
        name=MODEL_NAME,
        version=model_version.version,
        stage="Staging"
    )
    print(f"\n‚úÖ Transitioned to Staging")
    
    # Simulate validation tests passing
    print("\nüß™ Running validation tests...")
    time.sleep(2)
    print("   ‚úÖ Data quality check: PASS")
    print("   ‚úÖ Model accuracy check: PASS")
    print("   ‚úÖ Inference time check: PASS")
    
    # Promote to Production
    client.transition_model_version_stage(
        name=MODEL_NAME,
        version=model_version.version,
        stage="Production",
        archive_existing_versions=True
    )
    print(f"\nüöÄ Model promoted to PRODUCTION!")
    
except Exception as e:
    print(f"‚ö†Ô∏è  Error: {e}")
    print("   Continuing with existing model...")

## Stage 5: API Deployment & Testing

Test FastAPI serving predictions.

In [None]:
print("üåê Stage 5: API Testing\n")

API_URL = "http://localhost:8000"

# Test health endpoint
try:
    response = requests.get(f"{API_URL}/health")
    if response.status_code == 200:
        print("‚úÖ API Health Check: OK")
        print(f"   Response: {response.json()}")
    else:
        print(f"‚ö†Ô∏è  API Health Check: Status {response.status_code}")
except Exception as e:
    print(f"‚ùå API not reachable: {e}")
    print("   Start API: uvicorn api.main:app --host 0.0.0.0 --port 8000 &")

# Test prediction endpoint (if API is running)
try:
    # Sample prediction request
    sample_data = {
        "date": "2024-12-01",
        "region": "North",
        "product": "Electronics",
        "price": 299.99,
        "quantity": 50
    }
    
    print("\nüîÆ Testing Prediction Endpoint...")
    print(f"   Input: {sample_data}")
    
    response = requests.post(f"{API_URL}/predict", json=sample_data)
    
    if response.status_code == 200:
        prediction = response.json()
        print(f"\n‚úÖ Prediction successful:")
        print(f"   Predicted Sales: ${prediction.get('predicted_sales', 'N/A'):,.2f}")
        print(f"   Confidence: {prediction.get('confidence', 'N/A')}")
        print(f"   Model: {prediction.get('model_version', 'N/A')}")
    else:
        print(f"‚ö†Ô∏è  Prediction failed: {response.status_code}")
        print(f"   Response: {response.text}")
        
except Exception as e:
    print(f"‚ö†Ô∏è  Could not test predictions: {e}")
    print("   Ensure /predict endpoint is implemented in api/main.py")

## Stage 6: Airflow Pipeline Trigger

Trigger the automated pipeline.

In [None]:
print("‚úàÔ∏è Stage 6: Airflow Pipeline\n")

AIRFLOW_URL = "http://localhost:8080"
AIRFLOW_USER = "admin"
AIRFLOW_PASSWORD = "admin123"

# Check if DAG exists
try:
    response = requests.get(
        f"{AIRFLOW_URL}/api/v1/dags",
        auth=(AIRFLOW_USER, AIRFLOW_PASSWORD)
    )
    
    if response.status_code == 200:
        dags = response.json().get('dags', [])
        dag_ids = [dag['dag_id'] for dag in dags]
        
        print(f"‚úÖ Connected to Airflow")
        print(f"   Available DAGs: {len(dag_ids)}")
        
        # Look for ML training DAG
        ml_dags = [d for d in dag_ids if 'ml' in d.lower() or 'training' in d.lower()]
        if ml_dags:
            print(f"   ML DAGs found: {', '.join(ml_dags)}")
            
            # Trigger first ML DAG
            dag_id = ml_dags[0]
            print(f"\nüöÄ Triggering DAG: {dag_id}")
            
            trigger_response = requests.post(
                f"{AIRFLOW_URL}/api/v1/dags/{dag_id}/dagRuns",
                auth=(AIRFLOW_USER, AIRFLOW_PASSWORD),
                json={"conf": {"triggered_by": "notebook_07"}}
            )
            
            if trigger_response.status_code == 200:
                dag_run = trigger_response.json()
                print(f"‚úÖ DAG run created:")
                print(f"   Run ID: {dag_run.get('dag_run_id')}")
                print(f"   State: {dag_run.get('state')}")
                print(f"\n   View in UI: {AIRFLOW_URL}/dags/{dag_id}/grid")
            else:
                print(f"‚ö†Ô∏è  Failed to trigger: {trigger_response.status_code}")
        else:
            print("‚ö†Ô∏è  No ML training DAGs found")
            print("   Create DAG in ~/airflow/dags/ml_training_pipeline.py")
    else:
        print(f"‚ùå Could not connect to Airflow: {response.status_code}")
        
except Exception as e:
    print(f"‚ùå Airflow error: {e}")
    print("   Ensure Airflow is running: airflow standalone")

## Stage 7: Dashboard Verification

Check Streamlit dashboard.

In [None]:
print("üìä Stage 7: Dashboard Check\n")

DASHBOARD_URL = "http://localhost:8501"

try:
    response = requests.get(DASHBOARD_URL, timeout=5)
    if response.status_code == 200:
        print("‚úÖ Streamlit Dashboard: Running")
        print(f"   URL: {DASHBOARD_URL}")
        print("\n   Available pages:")
        print("   ‚Ä¢ Main Dashboard")
        print("   ‚Ä¢ Model Comparison")
        print("   ‚Ä¢ Experiment Tracking")
        print("   ‚Ä¢ Predictions")
        print("   ‚Ä¢ Data Drift Monitoring")
        print("   ‚Ä¢ System Health")
    else:
        print(f"‚ö†Ô∏è  Dashboard responding but status {response.status_code}")
except Exception as e:
    print(f"‚ùå Dashboard not reachable: {e}")
    print("   Start dashboard: streamlit run dashboard/app.py &")

## üìä Pipeline Summary & Results

In [None]:
print("="*80)
print("                    üöÄ MLOPS PIPELINE EXECUTION SUMMARY")
print("="*80)

summary = {
    "Stage": [
        "1. Data Versioning",
        "2. Feature Engineering",
        "3. Model Training",
        "4. Model Registry",
        "5. API Deployment",
        "6. Airflow Pipeline",
        "7. Dashboard"
    ],
    "Component": [
        "DVC",
        "scikit-learn",
        "MLflow Tracking",
        "MLflow Registry",
        "FastAPI",
        "Apache Airflow",
        "Streamlit"
    ],
    "Status": [
        "‚úÖ Data tracked" if os.path.exists(dvc_file) else "‚ö†Ô∏è  Setup needed",
        "‚úÖ Features ready",
        f"‚úÖ {len(results)} models trained",
        "‚úÖ Model in Production",
        "‚úÖ Running" if api_ok else "‚ö†Ô∏è  Not running",
        "‚úÖ Running" if airflow_ok else "‚ö†Ô∏è  Not running",
        "‚úÖ Running" if dashboard_ok else "‚ö†Ô∏è  Not running"
    ]
}

summary_df = pd.DataFrame(summary)
display(summary_df)

print("\nüìà Key Metrics:")
print(f"   Dataset Size: {len(df):,} rows")
print(f"   Features Created: {len(feature_cols)}")
print(f"   Models Trained: {len(results)}")
print(f"   Best Model: {best_model['model_name']}")
print(f"   Best RMSE: ${best_model['rmse']:,.2f}")
print(f"   Best R¬≤ Score: {best_model['r2']:.4f}")

print("\nüîó Access Points:")
print(f"   Airflow UI:  http://localhost:8080")
print(f"   MLflow UI:   {os.getenv('MLFLOW_TRACKING_URI', 'Not configured')}")
print(f"   FastAPI:     http://localhost:8000/docs")
print(f"   Dashboard:   http://localhost:8501")

print("\n" + "="*80)

## üéì What We Accomplished

### Complete MLOps Lifecycle:

1. **Data Management** ‚úÖ
   - Versioned with DVC
   - Tracked in Git
   - Reproducible

2. **Feature Engineering** ‚úÖ
   - Lag features (1, 7, 30 days)
   - Rolling statistics
   - Categorical encoding
   - Feature scaling

3. **Model Training** ‚úÖ
   - 3 models compared
   - All metrics logged to MLflow
   - Best model identified

4. **Model Registry** ‚úÖ
   - Model registered
   - Staging validation
   - Production promotion
   - Version controlled

5. **Deployment** ‚úÖ
   - REST API serving
   - Real-time predictions
   - Health monitoring

6. **Orchestration** ‚úÖ
   - Automated workflows
   - Error handling
   - Scheduled execution

7. **Monitoring** ‚úÖ
   - Performance dashboards
   - Experiment tracking
   - System health checks

---

## üöÄ Production Readiness Checklist

- [x] Data versioning (DVC)
- [x] Experiment tracking (MLflow)
- [x] Model registry (MLflow)
- [x] Pipeline automation (Airflow)
- [x] API deployment (FastAPI)
- [x] Monitoring dashboard (Streamlit)
- [ ] CI/CD pipeline (GitHub Actions) ‚Üí Next phase
- [ ] Data drift detection ‚Üí Next phase
- [ ] A/B testing ‚Üí Next phase
- [ ] Auto-retraining ‚Üí Next phase

---

## üí° Key Takeaways

### Why This Architecture?

**Before MLOps:**
- Manual scripts run by Sarah at 2 AM
- No tracking of experiments
- Can't reproduce results
- Models lost after 3 months
- No idea which data trained which model

**After MLOps:**
- Fully automated pipelines
- Every experiment tracked
- 100% reproducible
- Models versioned forever
- Complete lineage: Data ‚Üí Features ‚Üí Model ‚Üí Predictions

### Production Benefits:

1. **Reliability**: Automatic retries, error handling
2. **Reproducibility**: Git + DVC + MLflow = complete history
3. **Scalability**: Add more models/features easily
4. **Collaboration**: Team sees same experiments
5. **Compliance**: Full audit trail
6. **Speed**: Deploy in minutes, not days

---

## üéØ Next Steps

### Immediate:
1. Add more DAGs for batch predictions
2. Implement data drift detection
3. Set up CI/CD with GitHub Actions
4. Add unit tests (pytest)

### Advanced:
1. Multi-model serving (A/B testing)
2. Auto-retraining on drift detection
3. Kubernetes deployment
4. Model explainability (SHAP)
5. Feature store integration

---

## üèÜ Congratulations!

You've built a **production-ready MLOps pipeline** from scratch!

This is what top tech companies use:
- ‚úÖ Google (TFX)
- ‚úÖ Netflix (Metaflow)
- ‚úÖ Uber (Michelangelo)
- ‚úÖ Airbnb (BigHead)

You now understand:
- Why each component exists
- How they integrate
- When to use what
- How to debug issues

**This knowledge is what you'll use every day as an MLOps engineer!**

---

**¬© 2024 Amey Talkatkar** | MLOps with Agentic AI - Advanced Certification

**Course Modules Covered:**
- ‚úÖ Module 1: Python & MLOps Foundations
- üîÑ Module 2: Modern Cloud-Native MLOps (In Progress)
- ‚è≥ Module 3: Cloud & Productionization (Coming Soon)
- ‚è≥ Module 4: Agentic AI & LLMOps (Coming Soon)