# SageMaker ML Pipeline Example

This notebook demonstrates how to use the SageMaker ML pipeline to train, compare, and deploy models.

## Setup

First, install the required dependencies:

In [None]:
!pip install -r requirements.txt -q

## 1. Generate and Preprocess Data

In [None]:
import sys
sys.path.insert(0, 'src')

from pipeline.data_preparation import generate_sample_data, preprocess_for_sagemaker

# Generate sample data
train_df, test_df, feature_names, target_name = generate_sample_data(use_synthetic=True)

print(f"Training data: {train_df.shape}")
print(f"Test data: {test_df.shape}")
print(f"\nFeatures: {feature_names}")
print(f"Target: {target_name}")
train_df.head()

## 2. Preprocess for SageMaker

In [None]:
# Preprocess data (move target to first column, remove headers)
preprocess_for_sagemaker('data/train.csv', 'data/train_processed.csv')
preprocess_for_sagemaker('data/test.csv', 'data/test_processed.csv')

## 3. Set Up Training (Mock Mode)

In [None]:
from pipeline.training import ModelTrainer

# Initialize trainer
trainer = ModelTrainer(bucket='my-sagemaker-bucket', prefix='housing-demo')

# Get data paths
train_path, validation_path = trainer.get_data_paths()
print(f"Training data: {train_path}")
print(f"Validation data: {validation_path}")

## 4. Compare Models

In [None]:
from pipeline.model_comparison import ModelComparator, create_mock_model_results

# Create mock results for demonstration
mock_results = create_mock_model_results()

# Compare models
comparator = ModelComparator()
best_model, best_result = comparator.compare_models(mock_results)

print(f"\nBest model: {best_model}")
print(f"Metrics: {best_result['metrics']}")

## 5. Run Complete Pipeline

In [None]:
from run_pipeline import run_pipeline

# Run the complete pipeline in mock mode
best_model, model_results = run_pipeline(
    bucket='my-sagemaker-bucket',
    prefix='housing-demo',
    models=['xgboost', 'knn', 'sklearn-gbm'],
    mock_training=True,  # Use mock for demonstration
)

## 6. Visualize Results

In [None]:
import json
import pandas as pd
import matplotlib.pyplot as plt

# Load comparison results
with open('model_comparison.json', 'r') as f:
    comparison = json.load(f)

# Extract metrics for visualization
models = []
metrics = []
for model_name, result in comparison['all_models'].items():
    models.append(model_name)
    model_metrics = result['metrics']
    # Get the first metric value (simplified)
    metric_val = list(model_metrics.values())[0]
    metrics.append(metric_val)

# Plot
plt.figure(figsize=(10, 6))
plt.bar(models, metrics, color=['#1f77b4', '#ff7f0e', '#2ca02c'])
plt.xlabel('Model')
plt.ylabel('Error Metric')
plt.title('Model Performance Comparison')
plt.axhline(y=min(metrics), color='r', linestyle='--', label='Best')
plt.legend()
plt.show()

print(f"\nBest performing model: {comparison['best_model']}")

## Next Steps

To run this with actual SageMaker training:

1. Set up AWS credentials
2. Deploy CDK infrastructure: `cd src/cdk && cdk deploy`
3. Run pipeline with `--no-mock` flag: `python src/run_pipeline.py --bucket <your-bucket> --no-mock`
4. Enable hyperparameter tuning with `--tune` flag
5. Deploy for batch inference with `--deploy` flag