# Deployment Thinking - Reproducibility, Monitoring, and Don't Ship a Notebook

<hr>

<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/main/notebooks/figures/mgmt_474_ai_logo_02-modified.png" width="200"/>
</div>
</center>

# <center><a class="tocSkip"></center>
# <center>MGMT47400 Predictive Analytics</center>
# <center>Professor: Davi Moreira </center>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/blob/main/notebooks/18_reproducibility_monitoring.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Package a model pipeline reproducibly (single function, fixed preprocessing)
2. Save/load model artifacts and ensure consistent inference
3. Define monitoring signals (data drift, performance drift, calibration drift)
4. Create a minimal production checklist and risk log
5. Prepare the project notebook for executive-facing reproducibility

---

## 1. Setup: Installs, Imports, Seeds, Display Settings

First, let's set up our environment with all necessary packages and configurations.

In [None]:
# Install required packages (uncomment if needed)
# !pip install pandas numpy matplotlib seaborn scikit-learn joblib --quiet

# Core imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import joblib
import warnings
from datetime import datetime
import json

# Display settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 3)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Set random seed for reproducibility
RANDOM_SEED = 474
np.random.seed(RANDOM_SEED)

print("✓ Setup complete!")
print(f"Random seed: {RANDOM_SEED}")
print(f"Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

**Reading the output:**

The output confirms `Setup complete!` alongside the locked **random seed
(474)** and a timestamp. The timestamp is informational only -- it lets you
compare runs across sessions. All the heavy imports (`joblib`, `Pipeline`,
`StandardScaler`, `json`) are in place, which means we are ready to build,
save, load, and evaluate a reproducible pipeline.

**Why this matters:** Recording the seed and timestamp at the top of the
notebook is a simple but powerful reproducibility habit.

---


## 2. Refactor into Functions: train_model(), predict(), evaluate()

### Why Refactor?

> **"Notebooks are for exploration. Functions are for production."**  
> Reproducibility requires separating configuration from code.

**Key principles:**
- Separate configuration (hyperparameters, paths, seeds) from logic
- Wrap training logic in a single function that returns a fitted pipeline
- Create prediction and evaluation functions that work with the saved pipeline
- Make everything reproducible with fixed seeds and saved artifacts

### 2.1 Configuration Block

In [None]:
# Configuration dictionary - all settings in one place
CONFIG = {
    'data': {
        'test_size': 0.2,
        'val_size': 0.25,  # 0.25 of remaining 0.8 = 0.2 overall
        'random_seed': RANDOM_SEED
    },
    'preprocessing': {
        'scaler': 'standard',  # 'standard', 'minmax', or None
    },
    'model': {
        'type': 'logistic_regression',  # 'logistic_regression' or 'random_forest'
        'hyperparameters': {
            'C': 1.0,
            'max_iter': 1000,
            'random_state': RANDOM_SEED
        }
    },
    'paths': {
        'model_artifact': 'model_pipeline.joblib',
        'config_artifact': 'model_config.json',
        'metrics_artifact': 'training_metrics.json'
    },
    'metadata': {
        'project_name': 'Predictive Analytics Project',
        'author': 'Your Name',
        'created_date': datetime.now().strftime('%Y-%m-%d')
    }
}

print("✓ Configuration loaded")
print(json.dumps(CONFIG, indent=2))

**Reading the output:**

The full `CONFIG` dictionary is printed as pretty-printed JSON. Check that
the **data** section matches the 60/20/20 split ratios, the
**preprocessing** section specifies `standard` scaling, and the **model**
section names `logistic_regression` with `C=1.0` and the correct seed.
The **paths** section lists the three artifact filenames that will be
written to disk later.

**Key takeaway:** Externalising every tuneable setting into a single config
dictionary means you can reproduce *or modify* any experiment by changing
one JSON block instead of hunting through scattered code cells.

---


### 2.2 Load Sample Data

We use scikit-learn's built-in breast-cancer dataset for demonstration
because it loads instantly in any Colab environment with no file downloads.
The dataset has 569 samples and 30 numeric features describing cell-nucleus
measurements. After loading, we print shape and target counts so we can
verify the data before splitting.


In [None]:
# For demonstration, we'll use the breast cancer dataset
from sklearn.datasets import load_breast_cancer

# Load data
data = load_breast_cancer(as_frame=True)
df = data.frame

# Separate features and target
X = df.drop(columns=['target'])
y = df['target']

print(f"Dataset shape: {df.shape}")
print(f"Features: {X.shape[1]}")
print(f"Target distribution: {y.value_counts().to_dict()}")

**Reading the output:**

The dataset has **569 samples** and **30 features**. The target distribution
shows the counts of malignant (0) and benign (1) cases. Because the classes
are not perfectly balanced, the upcoming split will use stratification to
keep proportions consistent across train, validation, and test sets.

**Why this matters:** Printing shape and target counts immediately after
loading is a basic but essential data-quality checkpoint.

---


### 2.3 Create Splits

We apply the standard 60/20/20 train-validation-test split using the seed
stored in `CONFIG`. Stratification on the target ensures that class
proportions are preserved in every fold. Printing split sizes and
percentages immediately after splitting serves as a sanity check that the
ratios are correct.


In [None]:
# Create train/val/test splits
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, 
    test_size=CONFIG['data']['test_size'], 
    random_state=CONFIG['data']['random_seed'],
    stratify=y
)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, 
    test_size=CONFIG['data']['val_size'], 
    random_state=CONFIG['data']['random_seed'],
    stratify=y_temp
)

print("=== SPLIT SIZES ===")
print(f"Train: {len(X_train)} samples ({len(X_train)/len(df)*100:.1f}%)")
print(f"Validation: {len(X_val)} samples ({len(X_val)/len(df)*100:.1f}%)")
print(f"Test: {len(X_test)} samples ({len(X_test)/len(df)*100:.1f}%)")
print(f"\n✓ Splits created with seed {CONFIG['data']['random_seed']}")

**Reading the output:**

The split summary shows the number of samples and percentage for **Train**,
**Validation**, and **Test** sets, which should approximate 60 %, 20 %, and
20 % of the full 569 samples. The confirmation line reprints the seed used.
If any percentage is notably off, it may indicate an incorrect `test_size`
or `val_size` in the config.

**Key takeaway:** Printing split sizes right after creation is a fast way
to catch configuration errors before they propagate into model training.

---


### 2.4 Train Function: Fit Once, Run Anywhere

Wrapping the model inside a scikit-learn `Pipeline` ensures that
preprocessing (e.g., `StandardScaler`) and the estimator travel together as
a single artifact. The `train_model()` function reads all settings from the
`CONFIG` dictionary, so changing the model type or hyper-parameters never
requires editing the function body -- only the config.


In [None]:
def train_model(X_train, y_train, config):
    """
    Train a model pipeline from scratch.
    
    Parameters:
    -----------
    X_train : pd.DataFrame
        Training features
    y_train : pd.Series
        Training target
    config : dict
        Configuration dictionary with preprocessing and model settings
    
    Returns:
    --------
    pipeline : sklearn.pipeline.Pipeline
        Fitted pipeline ready for prediction
    """
    # Build pipeline steps
    steps = []
    
    # Add scaler if specified
    if config['preprocessing']['scaler'] == 'standard':
        steps.append(('scaler', StandardScaler()))
    
    # Add model
    if config['model']['type'] == 'logistic_regression':
        model = LogisticRegression(**config['model']['hyperparameters'])
    elif config['model']['type'] == 'random_forest':
        model = RandomForestClassifier(**config['model']['hyperparameters'])
    else:
        raise ValueError(f"Unknown model type: {config['model']['type']}")
    
    steps.append(('model', model))
    
    # Create and fit pipeline
    pipeline = Pipeline(steps)
    pipeline.fit(X_train, y_train)
    
    print(f"✓ Pipeline trained: {len(steps)} steps")
    for step_name, step_obj in pipeline.steps:
        print(f"  - {step_name}: {type(step_obj).__name__}")
    
    return pipeline

**Reading the output:**

No output is produced here because we are *defining* the function, not
calling it yet. When `train_model()` is later invoked, it will print the
number of pipeline steps and the class name of each step (e.g.,
`StandardScaler`, `LogisticRegression`). This printout acts as a quick
audit: you can confirm that scaling and the correct estimator are both
present in the pipeline.

**Why this matters:** Defining training logic inside a function -- rather
than in loose notebook cells -- is the first step toward production-grade
code.

---


### 2.5 Predict Function

The `predict()` function takes a fitted pipeline and a feature matrix, then
returns both hard predictions and probability estimates (when available).
Separating prediction from training makes it easy to swap in a loaded
artifact later without duplicating code.


In [None]:
def predict(pipeline, X):
    """
    Make predictions using a fitted pipeline.
    
    Parameters:
    -----------
    pipeline : sklearn.pipeline.Pipeline
        Fitted pipeline
    X : pd.DataFrame
        Features to predict on
    
    Returns:
    --------
    predictions : np.ndarray
        Predicted class labels
    probabilities : np.ndarray
        Predicted probabilities (if available)
    """
    predictions = pipeline.predict(X)
    
    # Get probabilities if available
    if hasattr(pipeline, 'predict_proba'):
        probabilities = pipeline.predict_proba(X)
    else:
        probabilities = None
    
    return predictions, probabilities

**Reading the output:**

Again, this cell only *defines* the `predict()` function -- no output yet.
When called, it returns both hard labels and probability estimates. The
`hasattr` check for `predict_proba` makes the function safe to use with
estimators that do not natively produce probabilities (e.g., some SVMs).

**Key takeaway:** Defensive checks like `hasattr` prevent runtime errors
when you swap model types in the config.

---


### 2.6 Evaluate Function

The `evaluate()` function calls `predict()` internally and then computes a
standard set of metrics: accuracy, precision, recall, F1, and ROC-AUC.
Returning the results as a dictionary makes it straightforward to log them
to a JSON file for reproducibility tracking.


In [None]:
def evaluate(pipeline, X, y, split_name='Test'):
    """
    Evaluate a fitted pipeline.
    
    Parameters:
    -----------
    pipeline : sklearn.pipeline.Pipeline
        Fitted pipeline
    X : pd.DataFrame
        Features
    y : pd.Series
        True labels
    split_name : str
        Name of the split for reporting
    
    Returns:
    --------
    metrics : dict
        Dictionary of evaluation metrics
    """
    # Get predictions
    y_pred, y_proba = predict(pipeline, X)
    
    # Calculate metrics
    metrics = {
        'split': split_name,
        'n_samples': len(y),
        'accuracy': accuracy_score(y, y_pred),
        'precision': precision_score(y, y_pred, zero_division=0),
        'recall': recall_score(y, y_pred, zero_division=0),
        'f1': f1_score(y, y_pred, zero_division=0)
    }
    
    # Add AUC if probabilities available
    if y_proba is not None:
        metrics['roc_auc'] = roc_auc_score(y, y_proba[:, 1])
    
    # Print summary
    print(f"\n=== {split_name} Metrics ===")
    for key, value in metrics.items():
        if key not in ['split', 'n_samples']:
            print(f"{key:>12s}: {value:.4f}")
    
    return metrics

**Reading the output:**

This cell defines `evaluate()` -- it will print a table of metrics when
called. The function returns a dictionary containing accuracy, precision,
recall, F1, and (when available) ROC-AUC, tagged with the split name and
sample count. Storing results in a dictionary rather than just printing
them makes it easy to serialise the metrics to JSON later.

**Key takeaway:** Always return metrics programmatically so they can be
saved, compared, and audited -- do not rely solely on printed text.

---


### 2.7 Train and Evaluate

Now we run the full workflow: train the pipeline on the training set, then
evaluate on train, validation, and test sets in sequence. Comparing
train-set performance to validation and test performance immediately reveals
whether the model is overfitting.


In [None]:
# Train the pipeline
pipeline = train_model(X_train, y_train, CONFIG)

# Evaluate on all splits
train_metrics = evaluate(pipeline, X_train, y_train, 'Train')
val_metrics = evaluate(pipeline, X_val, y_val, 'Validation')
test_metrics = evaluate(pipeline, X_test, y_test, 'Test')

print("\n✓ Model trained and evaluated on all splits")

**Reading the output:**

First you see the pipeline confirmation (`Pipeline trained: 2 steps --
StandardScaler, LogisticRegression`). Then three metric blocks appear for
**Train**, **Validation**, and **Test**. Compare them: if train metrics are
much higher than validation/test, the model may be overfitting. Because we
are using a simple logistic regression with default regularisation on a
well-behaved dataset, all three sets should show similar performance.

**Why this matters:** Evaluating on all three splits in a single cell gives
you an instant overfitting/underfitting diagnostic.

---


## 📝 PAUSE-AND-DO Exercise 1 (10 minutes)

**Task:** Implement `train_model(config)` returning pipeline + metrics.

**Instructions:**
1. Review the `train_model()` function above
2. Modify the CONFIG dictionary to try different settings:
   - Change the model type to 'random_forest'
   - Adjust hyperparameters (e.g., n_estimators, max_depth)
   - Compare performance metrics
3. Document your findings below

**What to try:**
- Different model types
- Different preprocessing approaches
- Different hyperparameter values

---

### YOUR EXPERIMENT HERE:

**Configuration tested:**  
[Describe your configuration changes]

**Results:**  
[Report performance metrics]

**Key finding:**  
[What did you learn?]

---

## 3. Save/Load Model Artifacts (using joblib)

### Why Save Artifacts?

> **"If you can't load it, you can't deploy it."**  
> Model persistence is the bridge between development and production.

**What to save:**
- Fitted pipeline (includes preprocessing + model)
- Configuration used to train
- Training metrics and metadata
- Feature names and types

### 3.1 Save Model Artifacts

In [None]:
def save_model_artifacts(pipeline, config, metrics, feature_names):
    """
    Save all model artifacts for reproducibility.
    
    Parameters:
    -----------
    pipeline : sklearn.pipeline.Pipeline
        Fitted pipeline
    config : dict
        Configuration dictionary
    metrics : dict
        Training metrics
    feature_names : list
        List of feature names
    """
    # Save pipeline
    joblib.dump(pipeline, config['paths']['model_artifact'])
    print(f"✓ Saved pipeline to {config['paths']['model_artifact']}")
    
    # Save config
    with open(config['paths']['config_artifact'], 'w') as f:
        json.dump(config, f, indent=2)
    print(f"✓ Saved config to {config['paths']['config_artifact']}")
    
    # Save metrics with metadata
    artifact_metadata = {
        'train_metrics': train_metrics,
        'val_metrics': val_metrics,
        'test_metrics': test_metrics,
        'feature_names': feature_names,
        'n_features': len(feature_names),
        'saved_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    }
    
    with open(config['paths']['metrics_artifact'], 'w') as f:
        json.dump(artifact_metadata, f, indent=2)
    print(f"✓ Saved metrics to {config['paths']['metrics_artifact']}")
    
    print("\n=== Artifact Summary ===")
    print(f"Pipeline size: {joblib.dump(pipeline, '/tmp/temp.joblib')} bytes")
    print(f"Features: {len(feature_names)}")
    print(f"Pipeline steps: {len(pipeline.steps)}")

# Save artifacts
save_model_artifacts(
    pipeline=pipeline,
    config=CONFIG,
    metrics={'train': train_metrics, 'val': val_metrics, 'test': test_metrics},
    feature_names=list(X_train.columns)
)

**Reading the output:**

Three confirmation lines show that the **pipeline** (`.joblib`), the
**config** (`.json`), and the **metrics** (`.json`) were written to disk.
The artifact summary reports the pipeline file size and the number of
features and steps. A small file size is expected for logistic regression;
tree ensembles will be larger. These artifacts are everything someone needs
to reproduce your predictions without re-running training.

**Key takeaway:** `joblib` serialises the entire fitted pipeline --
including the scaler's learned mean and variance -- so inference always
applies the same preprocessing.

---


### 3.2 Load Model Artifacts

The `load_model_artifacts()` function reverses the save process: it reads
the pipeline with `joblib.load`, then reads back the JSON config and
metrics files. We verify reproducibility by comparing predictions from the
loaded pipeline to those from the original in-memory pipeline -- they must
match exactly.


In [None]:
def load_model_artifacts(config):
    """
    Load saved model artifacts.
    
    Parameters:
    -----------
    config : dict
        Configuration dictionary with artifact paths
    
    Returns:
    --------
    pipeline : sklearn.pipeline.Pipeline
        Loaded pipeline
    config_loaded : dict
        Loaded configuration
    metrics_loaded : dict
        Loaded metrics
    """
    # Load pipeline
    pipeline = joblib.load(config['paths']['model_artifact'])
    print(f"✓ Loaded pipeline from {config['paths']['model_artifact']}")
    
    # Load config
    with open(config['paths']['config_artifact'], 'r') as f:
        config_loaded = json.load(f)
    print(f"✓ Loaded config from {config['paths']['config_artifact']}")
    
    # Load metrics
    with open(config['paths']['metrics_artifact'], 'r') as f:
        metrics_loaded = json.load(f)
    print(f"✓ Loaded metrics from {config['paths']['metrics_artifact']}")
    
    return pipeline, config_loaded, metrics_loaded

# Test loading
print("=== Testing Model Loading ===")
loaded_pipeline, loaded_config, loaded_metrics = load_model_artifacts(CONFIG)

# Verify predictions match
original_preds = pipeline.predict(X_test[:5])
loaded_preds = loaded_pipeline.predict(X_test[:5])

print("\n=== Verification ===")
print(f"Original predictions: {original_preds}")
print(f"Loaded predictions:   {loaded_preds}")
print(f"Predictions match: {np.array_equal(original_preds, loaded_preds)}")
print("\n✓ Model artifacts save/load verified!")

**Reading the output:**

After loading, the verification section prints predictions from the
**original** in-memory pipeline and the **loaded** pipeline side by side.
The final line, `Predictions match: True`, is the critical check -- if it
said `False`, something went wrong during serialisation (e.g., a
preprocessing step was fitted outside the pipeline and therefore not saved).

**Why this matters:** This is the ultimate reproducibility test. If the
loaded model cannot reproduce the original predictions exactly, the saved
artifact is useless for deployment.

---


### 3.3 Reproducibility Checklist

**Before deploying, verify:**

- [ ] Pipeline includes all preprocessing steps
- [ ] Random seeds are fixed and documented
- [ ] Feature names and types are recorded
- [ ] Model can be loaded and produces identical predictions
- [ ] Configuration is saved separately from code
- [ ] Training metrics are documented
- [ ] All dependencies (package versions) are recorded

**⚠️ Common reproducibility failures:**
- Preprocessing done outside the pipeline (won't load correctly)
- Missing random seeds
- Feature engineering not included in pipeline
- Package version mismatches between training and inference

## 4. Monitoring Plan Template

### Why Monitor?

> **"Models decay. The world changes. Monitoring is not optional."**  
> Without monitoring, you won't know when your model stops working.

### 4.1 Three Types of Drift

**1. Data Drift (Covariate Shift)**
- Feature distributions change over time
- Example: Average transaction amount increases
- Detection: Compare feature distributions (KS test, PSI)

**2. Performance Drift (Concept Drift)**
- Model accuracy degrades over time
- Example: Precision drops from 0.85 to 0.70
- Detection: Track accuracy, precision, recall on new data

**3. Calibration Drift**
- Predicted probabilities become unreliable
- Example: Model says 80% confidence but only right 60% of the time
- Detection: Compare predicted probabilities to observed frequencies

### 4.2 Monitoring Signals Table

In [None]:
# Create monitoring plan table
monitoring_plan = pd.DataFrame([
    {
        'Signal': 'Prediction Volume',
        'Type': 'System Health',
        'Metric': 'Daily prediction count',
        'Warning Threshold': '< 80% of baseline',
        'Critical Threshold': '< 50% of baseline',
        'Check Frequency': 'Daily',
        'Owner': 'Data Engineering'
    },
    {
        'Signal': 'Feature Availability',
        'Type': 'Data Quality',
        'Metric': '% missing values per feature',
        'Warning Threshold': '> 5% missing',
        'Critical Threshold': '> 20% missing',
        'Check Frequency': 'Daily',
        'Owner': 'Data Engineering'
    },
    {
        'Signal': 'Feature Distribution',
        'Type': 'Data Drift',
        'Metric': 'Population Stability Index (PSI)',
        'Warning Threshold': 'PSI > 0.1',
        'Critical Threshold': 'PSI > 0.25',
        'Check Frequency': 'Weekly',
        'Owner': 'ML Engineering'
    },
    {
        'Signal': 'Prediction Distribution',
        'Type': 'Data Drift',
        'Metric': 'Predicted class proportions',
        'Warning Threshold': '> 10% shift',
        'Critical Threshold': '> 25% shift',
        'Check Frequency': 'Weekly',
        'Owner': 'ML Engineering'
    },
    {
        'Signal': 'Model Accuracy',
        'Type': 'Performance Drift',
        'Metric': 'Accuracy on labeled subset',
        'Warning Threshold': '< 90% of baseline',
        'Critical Threshold': '< 80% of baseline',
        'Check Frequency': 'Weekly',
        'Owner': 'ML Engineering'
    },
    {
        'Signal': 'Precision/Recall',
        'Type': 'Performance Drift',
        'Metric': 'Precision and recall on labeled subset',
        'Warning Threshold': '> 5% drop',
        'Critical Threshold': '> 15% drop',
        'Check Frequency': 'Weekly',
        'Owner': 'ML Engineering'
    },
    {
        'Signal': 'Calibration',
        'Type': 'Calibration Drift',
        'Metric': 'Brier score / calibration error',
        'Warning Threshold': '> 20% degradation',
        'Critical Threshold': '> 50% degradation',
        'Check Frequency': 'Bi-weekly',
        'Owner': 'ML Engineering'
    },
    {
        'Signal': 'Business Metric',
        'Type': 'Business Impact',
        'Metric': 'Conversion rate / ROI',
        'Warning Threshold': 'Per business rules',
        'Critical Threshold': 'Per business rules',
        'Check Frequency': 'Weekly',
        'Owner': 'Business Team'
    }
])

print("=== Monitoring Plan ===")
print(monitoring_plan.to_string(index=False))

# Save monitoring plan
monitoring_plan.to_csv('monitoring_plan.csv', index=False)
print("\n✓ Monitoring plan saved to monitoring_plan.csv")

**Reading the output:**

The monitoring plan table lists **eight signals** across four categories:
System Health, Data Quality, Data Drift, Performance Drift, Calibration
Drift, and Business Impact. Each row specifies the metric, warning and
critical thresholds, check frequency, and responsible owner. Notice the
escalation pattern: most signals start with a **weekly** check but
prediction volume is watched **daily** because a sudden drop may signal a
pipeline failure.

**Key takeaway:** A monitoring plan is only as good as its thresholds. The
numbers here (PSI > 0.1, accuracy < 90 % of baseline) are industry
conventions -- tailor them to your project's tolerance for degradation.

---


### 4.3 Monitoring Implementation Checklist

**Setup:**
- [ ] Define baseline distributions (training data)
- [ ] Set up logging infrastructure for predictions
- [ ] Create dashboards for key metrics
- [ ] Define alert thresholds and escalation paths

**Ongoing:**
- [ ] Collect labeled data for ground truth
- [ ] Run scheduled monitoring jobs
- [ ] Review alerts and investigate anomalies
- [ ] Retrain model when drift detected

**Documentation:**
- [ ] Document baseline metrics
- [ ] Record all retraining events
- [ ] Maintain incident log
- [ ] Update monitoring plan as needed

## 📝 PAUSE-AND-DO Exercise 2 (10 minutes)

**Task:** Draft a monitoring plan with 5-8 signals and owners.

**Instructions:**
1. Review the monitoring plan table above
2. Customize it for your project:
   - What features are most important to monitor?
   - What business metrics matter most?
   - What are realistic thresholds for your use case?
3. Add at least 2 project-specific signals
4. Document your plan below

**What to include:**
- Signal name and type
- Specific metric to track
- Warning and critical thresholds
- Check frequency
- Responsible owner

---

### YOUR MONITORING PLAN HERE:

**Project-specific signals:**

1. **[Signal Name]**  
   - Type: [Data Drift / Performance / Calibration / Business]
   - Metric: [What to measure]
   - Thresholds: [Warning / Critical]
   - Frequency: [How often]
   - Owner: [Who is responsible]

2. **[Signal Name]**  
   - Type:
   - Metric:
   - Thresholds:
   - Frequency:
   - Owner:

**Rationale:**  
[Why did you choose these signals?]

---

## 5. Ready-to-Share Notebook Hygiene Checklist

### Before Sharing Your Notebook

> **"Your notebook is your reputation."**  
> A well-organized notebook shows rigor and professionalism.

### 5.1 Technical Hygiene

**Run-All Test:**
- [ ] Restart kernel and "Run All" completes without errors
- [ ] No deprecated warnings (or they're documented)
- [ ] Outputs are visible and formatted properly
- [ ] Random seeds produce consistent results

**Code Quality:**
- [ ] Imports organized at top (standard, third-party, custom)
- [ ] No unused imports or variables
- [ ] Functions have clear docstrings
- [ ] Variable names are descriptive
- [ ] Magic numbers replaced with named constants

**Data Quality:**
- [ ] Data source is documented
- [ ] Missing value handling is explicit
- [ ] Train/val/test splits are clearly labeled
- [ ] No data leakage between splits
- [ ] Feature engineering is reproducible

### 5.2 Communication Hygiene

**Structure:**
- [ ] Clear title and introduction
- [ ] Learning objectives stated upfront
- [ ] Logical section flow
- [ ] Summary/conclusion at end

**Documentation:**
- [ ] Markdown cells explain the "why" before code
- [ ] Key findings are highlighted
- [ ] Visualizations have titles and labels
- [ ] Tables are formatted and readable
- [ ] Assumptions are stated explicitly

**Professionalism:**
- [ ] No debug cells or commented-out code
- [ ] No placeholder text ("TODO", "FIXME")
- [ ] Consistent formatting throughout
- [ ] Bibliography and citations included
- [ ] Author and date documented

### 5.3 Reproducibility Hygiene

**Environment:**
- [ ] Package versions documented (requirements.txt or in notebook)
- [ ] Random seeds set and documented
- [ ] Data sources with URLs/paths
- [ ] Instructions for obtaining data

**Artifacts:**
- [ ] Model saved and loadable
- [ ] Configuration saved separately
- [ ] Feature names preserved
- [ ] Preprocessing steps documented

**Validation:**
- [ ] Test set untouched until final evaluation
- [ ] Cross-validation procedure documented
- [ ] Baseline model included for comparison
- [ ] Performance metrics clearly reported

### 5.4 Production Readiness Checklist

In [None]:
# Production readiness assessment
production_checklist = pd.DataFrame([
    {'Category': 'Reproducibility', 'Item': 'Pipeline save/load tested', 'Status': '✓', 'Notes': 'Verified with test data'},
    {'Category': 'Reproducibility', 'Item': 'Random seeds documented', 'Status': '✓', 'Notes': 'RANDOM_SEED = 474'},
    {'Category': 'Reproducibility', 'Item': 'Configuration externalized', 'Status': '✓', 'Notes': 'CONFIG dictionary'},
    {'Category': 'Testing', 'Item': 'Input validation', 'Status': '○', 'Notes': 'Need to add schema validation'},
    {'Category': 'Testing', 'Item': 'Error handling', 'Status': '○', 'Notes': 'Need try/except blocks'},
    {'Category': 'Testing', 'Item': 'Unit tests', 'Status': '○', 'Notes': 'Need test suite'},
    {'Category': 'Monitoring', 'Item': 'Monitoring plan defined', 'Status': '✓', 'Notes': '8 signals identified'},
    {'Category': 'Monitoring', 'Item': 'Logging infrastructure', 'Status': '○', 'Notes': 'Need to implement'},
    {'Category': 'Documentation', 'Item': 'Model card', 'Status': '○', 'Notes': 'Draft in progress'},
    {'Category': 'Documentation', 'Item': 'API documentation', 'Status': '○', 'Notes': 'Need to create'},
    {'Category': 'Security', 'Item': 'No hardcoded credentials', 'Status': '✓', 'Notes': 'N/A for this example'},
    {'Category': 'Security', 'Item': 'Input sanitization', 'Status': '○', 'Notes': 'Need to add'},
])

print("=== Production Readiness Assessment ===")
print(production_checklist.to_string(index=False))

# Summary
ready_count = (production_checklist['Status'] == '✓').sum()
total_count = len(production_checklist)
print(f"\nReadiness: {ready_count}/{total_count} items complete ({ready_count/total_count*100:.0f}%)")

# Save checklist
production_checklist.to_csv('production_readiness.csv', index=False)
print("✓ Checklist saved to production_readiness.csv")

**Reading the output:**

The production-readiness table shows **12 items** grouped into
Reproducibility, Testing, Monitoring, Documentation, and Security. Items
marked `✓` are already complete; items marked `○` still need work. The
summary line prints the overall readiness percentage. For a course project,
achieving 100 % is not expected -- the exercise is to *identify* the gaps,
not necessarily close all of them.

**Why this matters:** In industry, a checklist like this gates the
transition from 'model works in a notebook' to 'model runs in production'.
Knowing what remains undone is itself a sign of professional maturity.

---


### 5.5 Final Pre-Submission Checklist

This checklist consolidates every quality gate into one place. Running this
cell before submitting your project gives you a printable list you can tick
off: technical correctness, reproducibility, communication quality, and
production readiness.


In [None]:
# Run this cell before submitting
print("=== PRE-SUBMISSION CHECKLIST ===")
print("\n1. TECHNICAL")
print("   [ ] Kernel restarted and Run All completed")
print("   [ ] No errors or warnings")
print("   [ ] All outputs visible")
print("\n2. REPRODUCIBILITY")
print("   [ ] Seeds fixed and documented")
print("   [ ] Model artifacts saved")
print("   [ ] Configuration externalized")
print("\n3. COMMUNICATION")
print("   [ ] Clear narrative structure")
print("   [ ] Key findings highlighted")
print("   [ ] Visualizations labeled")
print("\n4. PROFESSIONALISM")
print("   [ ] No debug code or TODOs")
print("   [ ] Consistent formatting")
print("   [ ] Bibliography included")
print("\n5. PRODUCTION READINESS")
print("   [ ] Monitoring plan defined")
print("   [ ] Risk assessment completed")
print("   [ ] Deployment checklist reviewed")
print("\n⚠️ Review each item before submitting your project!")

**Reading the output:**

The printed checklist organises final review into five categories:
**Technical** (kernel, errors, outputs), **Reproducibility** (seeds,
artifacts, config), **Communication** (narrative, findings, visuals),
**Professionalism** (no debug code, formatting, bibliography), and
**Production Readiness** (monitoring, risk, deployment). Running this cell
right before submission gives you a compact, scannable to-do list.

**Key takeaway:** Treat this checklist as a pre-flight ritual: it only
takes two minutes but catches the most common submission mistakes.

---


## 6. Wrap-Up: Key Takeaways

### What We Learned Today:

1. **Refactoring for Production**: Separate configuration from code, wrap logic in functions
2. **Model Persistence**: Save/load pipelines with joblib, verify reproducibility
3. **Monitoring Strategy**: Track data drift, performance drift, and calibration drift
4. **Production Checklist**: Comprehensive review before deployment
5. **Notebook Hygiene**: Make your work shareable and reproducible

### Next-Day Readiness:

- ✓ You can package a model pipeline for deployment
- ✓ You can save and load model artifacts
- ✓ You can define a monitoring plan
- ✓ You understand production readiness requirements
- ✓ You're ready for the next notebook: Executive Narrative

### Remember:

> **"Deployment is not the end. It's the beginning of model maintenance."**  
> Without monitoring and maintenance, even the best models decay.

---

## 7. Submission Instructions

### To Submit This Notebook:

1. **Run All Cells**: Execute `Runtime → Run all` to ensure everything works
2. **Save a Copy**: `File → Save a copy in Drive`
3. **Get Shareable Link**: Click `Share` and set to "Anyone with the link can view"
4. **Submit Link**: Paste the link in the LMS assignment

### Before Submitting, Check:

- [ ] All cells execute without errors
- [ ] All outputs are visible
- [ ] Exercise responses are complete
- [ ] Monitoring plan is documented
- [ ] Notebook is shared with correct permissions

---

## Bibliography

- Huyen, C. (2022). *Designing Machine Learning Systems*. O'Reilly Media.
- Lakshmanan, V., Robinson, S., & Munn, M. (2020). *Machine Learning Design Patterns*. O'Reilly Media.
- Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. D. (2009). *Dataset Shift in Machine Learning*. MIT Press.
- Rabanser, S., Günnemann, S., & Lipton, Z. (2019). Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. *NeurIPS 2019*.
- scikit-learn User Guide: [Model persistence](https://scikit-learn.org/stable/model_persistence.html)
- scikit-learn User Guide: [Pipelines and composite estimators](https://scikit-learn.org/stable/modules/compose.html)

---



<center>

Thank you!

</center>