# Traditional ML with Amazon SageMaker AI 

## Workshop Overview

Welcome to this hands-on workshop on building traditional machine learning models with Amazon SageMaker! In this workshop, you'll learn how to:

- **Set up** your SageMaker environment with the latest Python SDK v3
- **Configure** MLflow for experiment tracking and model management
- **Prepare** and process data for machine learning
- **Train** an XGBoost model using SageMaker's ModelTrainer
- **Build** and package models with ModelBuilder
- **Deploy** models to real-time endpoints
- **Test** and validate your deployed model
- **Clean up** resources to avoid unnecessary costs

### Business Problem for the workshop

You'll work with a **bank marketing dataset** to predict whether a customer will subscribe to a term deposit based on:
- Demographics (age, job, marital status, education)
- Financial information (credit default, housing loan, personal loan)
- Campaign data (contact type, month, day of week, duration)
- Economic indicators (employment rate, consumer price index, etc.)

This is a **binary classification** problem commonly faced by financial institutions and we will use this as an example to run the Traditional ML with Amazon SageMaker AI  workshop.

### Dataset Information

- **Source**: UCI Machine Learning Repository - Bank Marketing Dataset. 
- **Size**: ~41,000 records with 20 features
- **Target**: Binary (yes/no) - Will the customer subscribe?
- **Citation**: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing.

### AWS Services Used

- **Amazon SageMaker AI Training**: Managed training infrastructure
- **Amazon SageMaker Endpoints**: Real-time model hosting
- **Amazon SageMaker AI MLflow**: Experiment tracking and model registry
- **Amazon S3**: Data and model artifact storage
- **AWS IAM**: Security and access management
- **SageMaker Python SDK v3**

### Estimated Time

- **Total**: 45-60 minutes
- **Training**: ~5 minutes
- **Deployment**: ~5-7 minutes

### Prerequisites

- AWS account with SageMaker Studio access

---

## Section 1: Prerequisites & Setup

### Key Concepts

**SageMaker Python SDK v3**: The latest version of the SageMaker SDK provides:
- Simplified APIs for training and deployment
- Better integration with MLflow
- Improved resource management
- Type hints and better IDE support
- See [documentation](https://sagemaker.readthedocs.io/en/stable/) for more


### Instructions

Run the cells below to set up your environment. The first cell may take 1-2 minutes to install packages.

In [None]:
# Install required packages
# This may take 1-2 minutes on first run
# Ignore dependency conflicts warnings and errors.
!pip install --upgrade pip -q
!pip install -Uq "sagemaker==3.3.1" "boto3==1.42.30" "sagemaker-core==2.3.1" \
    "sagemaker-mlops==1.3.1" "sagemaker-serve" "mlflow==3.4.0" \
    "sagemaker-mlflow==0.2.0" "pandas" "scikit-learn" "xgboost" --force-reinstall

In [None]:
# Import required libraries
import boto3
import sagemaker
import mlflow
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import zipfile
import os
from datetime import datetime

# SageMaker v3 imports
from sagemaker.train.model_trainer import ModelTrainer
from sagemaker.core.training.configs import SourceCode, InputData, Compute
from sagemaker.core.helper.session_helper import Session, get_execution_role
from sagemaker.core import image_uris
from sagemaker.serve.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder

print('âœ“ All libraries imported successfully')
print(f'MLflow version: {mlflow.__version__}')

In [None]:
# Initialize SageMaker session and get AWS configuration
sagemaker_session = Session()
region = sagemaker_session.boto_region_name
role = get_execution_role()
bucket = sagemaker_session.default_bucket()
prefix = 'bank-marketing-lab'

print('AWS Configuration:')
print(f'  Region: {region}')
print(f'  S3 Bucket: {bucket}')
print(f'  IAM Role: {role}')
print(f'  Data Prefix: {prefix}')
print('\n SageMaker session initialized successfully')

---

## Section 2: SageMakerAI MLflow App Configuration

### What You'll Learn

In this section, you'll:
1. Connect to your SageMaker AI MLflow app
2. Create or select an MLflow experiment
3. Understand MLflow's role in experiment tracking

### Key Concepts

**Amazon SageMaker AI MLflow App**: A fully managed SageMakerAI service that provides:
- **Experiment Tracking**: Log parameters, metrics, and artifacts
- **Model Registry**: Version and manage models. 
- **SageMakerAI integration**: Automatically, register models to SageMakerAI Model Registry
- **Reproducibility**: Track code versions and dependencies
- **Collaboration**: Share experiments across teams
- **Integration**: Works seamlessly with SageMaker training

**MLflow Experiment**: A logical grouping of runs that:
- Organizes related training attempts
- Enables comparison of different approaches
- Tracks the evolution of your model

**MLflow Run**: A single execution that logs:
- Hyperparameters (learning rate, tree depth, etc.)
- Metrics (accuracy, precision, recall, AUC)
- Artifacts (model files, plots, data samples)
- Metadata (start time, duration, user)

### Instructions

1. **Find your MLflow App ARN**:
   - In SageMaker Studio, navigate to the left sidebar
   - Click on "MLflow" palen under "Applications"
   - Find your MLflow app (usually named "DefaultMLFlowApp"). For the first time, it will take 1-2mins to launch
   - Copy the ARN if you need to use a specific app

In [None]:
# Configure MLflow app connection
# Update this if you want to use a specific MLflow app
mlflow_app_name = 'DefaultMLFlowApp'

# Get MLflow app ARN
sm_client = boto3.client('sagemaker', region_name=region)
mlflow_list = sm_client.list_mlflow_apps()

print(f'Found {len(mlflow_list["Summaries"])} MLflow app(s) in your account:')
for app in mlflow_list['Summaries']:
    print(f'  - {app["Name"]}')

# Find the specified MLflow app
mlflow_app_arn = None
for mlflow_app in mlflow_list['Summaries']:
    if mlflow_app['Name'] == mlflow_app_name:
        mlflow_app_arn = mlflow_app['Arn']
        break

if mlflow_app_arn:
    print(f'\n Using MLflow app: {mlflow_app_name}')
    print(f'  ARN: {mlflow_app_arn}')
else:
    raise ValueError(f'MLflow app "{mlflow_app_name}" not found. Please check the name or create one in SageMaker Studio.')

In [None]:
# Set MLflow tracking URI and create/select experiment
mlflow.set_tracking_uri(mlflow_app_arn)
mlflow_experiment_name = 'bank-marketing-prediction'

try:
    # Try to create a new experiment
    experiment_id = mlflow.create_experiment(mlflow_experiment_name)
    print(f' Created new MLflow experiment: {mlflow_experiment_name}')
    print(f'  Experiment ID: {experiment_id}')
except:
    # Experiment already exists, set it as active
    mlflow.set_experiment(mlflow_experiment_name)
    experiment = mlflow.get_experiment_by_name(mlflow_experiment_name)
    print(f' Using existing MLflow experiment: {mlflow_experiment_name}')
    print(f'  Experiment ID: {experiment.experiment_id}')

print('\n You can view your experiments in the SageMaker Studio MLflow App UI')

---

## Section 3: Data Preparation

In this section, you'll:
1. Download the bank marketing dataset
2. Explore and understand the data
3. Preprocess features (encoding categorical variables)
4. Split data into training and test sets
5. Upload data to Amazon S3

### Key Concepts

**Data Preprocessing**: Essential steps to prepare raw data for machine learning:
- **Categorical Encoding**: Convert text categories (job, education) to numbers
- **Train-Test Split**: Separate data to evaluate model performance
- **Stratification**: Maintain class balance in splits (important for imbalanced data)

### Dataset Features

- **Client Demographics**: Includes age, job type, marital status, and education level
- **Financial Profile**: Tracks credit default status, housing loans, and personal loans (yes/no indicators)
- **Campaign Details**: Captures contact method, timing (month/day), call duration, number of contacts, and previous campaign outcomes
- **Economic Indicators**: Incorporates macroeconomic context including employment variation rate, consumer price/confidence indices, Euribor rate, and employment numbers
- **Target Variable**: Binary classification goal â€” predicting whether a client subscribed to a term deposit (yes/no)

### Instructions

Run the cells below to download, explore, and prepare the data.

In [None]:
# Download and extract the dataset
print('Downloading bank marketing dataset...')
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
!unzip -o bank-additional.zip
print('\nâœ“ Dataset downloaded and extracted')

In [None]:
# Load data
df = pd.read_csv('bank-additional/bank-additional-full.csv', sep=';')
print(f'Shape: {df.shape}\nTarget distribution:')
df.head()

In [None]:
# Encode categorical features
cat_cols = [c for c in df.select_dtypes(include=['object']).columns if c != 'y']
for col in cat_cols:
    df[col] = LabelEncoder().fit_transform(df[col])

# Encode target
df['y'] = (df['y'] == 'yes').astype(int)
print(f'âœ“ Encoded {len(cat_cols)} features')

In [None]:
# Split data
X, y = df.drop('y', axis=1), df['y']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Save locally
os.makedirs('data', exist_ok=True)
pd.concat([y_train, X_train], axis=1).to_csv('data/train.csv', index=False, header=False)
pd.concat([y_test, X_test], axis=1).to_csv('data/test.csv', index=False, header=False)
print(f'Train: {X_train.shape}, Test: {X_test.shape}')

In [None]:
# Upload to S3
train_s3 = sagemaker_session.upload_data('data/train.csv', bucket, f'{prefix}/data/train')
test_s3 = sagemaker_session.upload_data('data/test.csv', bucket, f'{prefix}/data/test')
print(f'Train S3: {train_s3}\nTest S3: {test_s3}')

---

## Section 4: Model Training with ModelTrainer

### What You'll Learn

In this section, you'll:
1. Create a training script for XGBoost
2. Configure the ModelTrainer with hyperparameters
3. Launch a SageMaker training job
4. Monitor training progress and view results

### Key Concepts

**ModelTrainer (SageMaker SDK v3)**: Simplified training API that:
- Manages training infrastructure automatically
- Handles script packaging and dependencies
- Integrates with MLflow for tracking
- Provides intelligent defaults
- Supports distributed training

**SageMaker Training Jobs**: Managed training infrastructure that:
- Provisions compute instances automatically
- Downloads data from S3
- Runs your training script
- Uploads model artifacts to S3
- Terminates instances when done thus cost-efficient


### Steps and Instructions

1. **Create training script**: We'll write a Python script that trains XGBoost
2. **Configure ModelTrainer**: Set up compute resources and hyperparameters
3. **Start training**: Launch the job (takes ~5 minutes)
4. **Monitor progress**: Watch logs in SageMaker AI Studio training job in real-time
5. **View results**: Check MLflow for metrics and artifacts

Run the cells below to train your model!


In [None]:
# Create training script directory
os.makedirs('scripts', exist_ok=True)

training_script = '''import argparse
import os
import json
import logging
import sys
import xgboost as xgb
import pandas as pd
import mlflow
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--max_depth', type=int, default=5)
    parser.add_argument('--eta', type=float, default=0.2)
    parser.add_argument('--gamma', type=int, default=4)
    parser.add_argument('--min_child_weight', type=int, default=6)
    parser.add_argument('--subsample', type=float, default=0.8)
    parser.add_argument('--num_round', type=int, default=100)
    return parser.parse_known_args()

if __name__ == '__main__':
    args, _ = parse_args()
    
    # Load data
    train_data = pd.read_csv('/opt/ml/input/data/train/train.csv', header=None)
    test_data = pd.read_csv('/opt/ml/input/data/test/test.csv', header=None)
    
    X_train, y_train = train_data.iloc[:, 1:], train_data.iloc[:, 0]
    X_test, y_test = test_data.iloc[:, 1:], test_data.iloc[:, 0]
    
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)
    # Set MLFlow specifics
    mlflow_app_arn = os.environ.get('MLFLOW_TRACKING_URI', None)
    mlflow_experiment_name = os.environ.get('MLFLOW_EXP', None)
    # MLflow setup
    mlflow.set_tracking_uri(mlflow_app_arn)
    mlflow.set_experiment(mlflow_experiment_name)
    
    # Enable autologging - captures everything automatically
    # mlflow.xgboost.autolog()
    mlflow.xgboost.autolog(
        log_input_examples=True,
        log_model_signatures=True,
        log_models=True,
        log_datasets=True,
        model_format="json",  # Recommended for portability
        registered_model_name="bank-prediction-XGBoostModel",
        extra_tags={"team": "data-science"},
    )
    
    # MLflow tracking
    with mlflow.start_run():
        params = {
            'max_depth': args.max_depth,
            'eta': args.eta,
            'gamma': args.gamma,
            'min_child_weight': args.min_child_weight,
            'subsample': args.subsample,
            'objective': 'binary:logistic',
            'eval_metric': 'auc'
        }
        
        mlflow.log_params(params)
        
        # Train
        model = xgb.train(params, dtrain, args.num_round, evals=[(dtest, 'test')])
        
        # Evaluate
        y_pred_proba = model.predict(dtest)
        y_pred = (y_pred_proba > 0.5).astype(int)
        
        metrics = {
            'accuracy': accuracy_score(y_test, y_pred),
            'precision': precision_score(y_test, y_pred),
            'recall': recall_score(y_test, y_pred),
            'f1': f1_score(y_test, y_pred),
            'auc': roc_auc_score(y_test, y_pred_proba)
        }
        
        mlflow.log_metrics(metrics)
        print(f'Metrics: {metrics}')
        
        # Save model
        model_path = '/opt/ml/model'
        os.makedirs(model_path, exist_ok=True)
        model.save_model(f'{model_path}/xgboost-model')
        mlflow.xgboost.log_model(
            model, 
            name="bank-prediction-XGBoostModel"
            )
'''

with open('scripts/train.py', 'w') as f:
    f.write(training_script)
    
print('âœ“ Training script created')

In [None]:
# Get XGBoost container image
xgboost_image = image_uris.retrieve(
    framework='xgboost',
    region=region,
    version='1.7-1', #3.0-5
    py_version="py311",
    image_scope='training',
    instance_type="ml.m5.xlarge",
)
print(f'XGBoost image: {xgboost_image}')

In [None]:
# Configure ModelTrainer with v3 API
source_code = SourceCode(
    source_dir='scripts',
    entry_script='train.py',
    requirements="requirements.txt"
)

compute = Compute(
    instance_type='ml.m5.xlarge',
    instance_count=1,
    volume_size_in_gb=30
)

hyperparameters = {
    'max_depth': 5,
    'eta': 0.2,
    'gamma': 4,
    'min_child_weight': 6,
    'subsample': 0.8,
    'num_round': 100
}

model_trainer = ModelTrainer(
    sagemaker_session=sagemaker_session,
    training_image=xgboost_image,
    source_code=source_code,
    compute=compute,
    hyperparameters=hyperparameters,
    base_job_name='bank-marketing-xgboost',
    environment={'MLFLOW_TRACKING_URI': mlflow_app_arn,
                 'MLFLOW_EXP': mlflow_experiment_name
                }
)

print('ModelTrainer configured')

In [None]:
# Start training
input_data_train = InputData(channel_name='train', data_source=train_s3)
input_data_test = InputData(channel_name='test', data_source=test_s3)

model_trainer.train(
    input_data_config=[input_data_train, input_data_test],
    wait=True
)
# Go the the sagemaker Studio training to find the training job in-progress with name "bank-marketing-xgboost*"
print(f' Training completed: {model_trainer.latest_training_job.name}')

### Understanding the Training Output

Now let's explore what was automatically logged to MLflow during training!

**Navigate to MLflow UI**
1. In SageMaker Studio, click on **MLflow** in the left sidebar
2. Find your MLflow app (DefaultMLFlowApp)
3. Click to open the MLflow UI

**View the Experiment**
1. In MLflow UI, click on **Experiments** tab
2. Find the experiment: **bank-marketing-prediction**
3. You'll see your training run listed with:
   - Run name and ID
   - Start time and duration
   - Status (Finished)
   - Metrics preview

**Explore the Run Details**

Click on the run to see comprehensive details:

- **Parameters Tab**: All hyperparameters used:
- **Metrics Tab**: Training metrics over time.
- **Artifacts Tab**: Model files (xgboost-model)
- **View Registered Model**: The model was automatically registered in MLflow Model Registry!

**Check SageMaker Model Registry Integration**

MLflow automatically registered the model with SageMaker AI Model Registry:

1. In SageMaker AI Studio, navigate to **Model** section in the left sidebar
2. Look for model package group under `My models` : **bank-prediction-XGBoostModel**

### Next Steps

1. **Compare Runs**: Train with different hyperparameters and compare in MLflow
2. **Promote Model**: Approve model in registry for production use
3. **Deploy Model**: Use the registered model for deployment

Now let's prepare the model for deployment using ModelBuilder!


---

## Section 5: Build and Deploy Model with ModelBuilder

### What You'll Learn

In this section, you'll:
1. Retrieve trained model artifacts from S3
2. Configure ModelBuilder for deployment
3. Create input/output schemas for the model
4. Build a deployable model package
5. Deploy the model to a real-time SageMaker endpoint

### Key Concepts

**ModelBuilder**: A SageMaker SDK v3 tool that:
- Packages trained models for deployment
- Handles container image selection
- Creates model schemas automatically
- Simplifies the deployment process
- Supports multiple model formats (XGBoost, PyTorch, TensorFlow, etc.)

**SageMaker Endpoints**: Real-time inference infrastructure that provides:
- **Always-on HTTPS endpoint**: Available 24/7 for predictions
- **Auto-scaling**: Handles variable traffic automatically
- **Load balancing**: Distributes requests across instances
- **Monitoring**: CloudWatch metrics for latency, errors, invocations
- **Security**: VPC support, encryption, IAM authentication


**Important**: Endpoints run continuously and incur charges:
- Always delete endpoints when not in use
- Use auto-scaling for variable workloads

### Instructions

1. **Get model artifacts**: Retrieve the S3 path from training job
2. **Configure ModelBuilder**: Set up inference image and model data
3. **Build model**: Package the model for deployment
4. **Deploy endpoint**: Launch the endpoint (takes 5-7 minutes)
5. **Verify deployment**: Check endpoint status

Run the cells below to build and deploy your model!

In [None]:
# Get model artifacts
training_job_name = model_trainer._latest_training_job.training_job_name
model_data_s3 = model_trainer._latest_training_job.model_artifacts.s3_model_artifacts
print(f'Model artifacts: {model_data_s3}, training job: {training_job_name}')

In [None]:
# Get inference image URI
inference_image = image_uris.retrieve(
    framework='xgboost',
    region=region,
    version='1.7-1',
    image_scope='inference'
)
print(f'Inference image: {inference_image}')

In [None]:
# Create ModelBuilder with trained model artifacts
model_builder = ModelBuilder(
    image_uri=inference_image,
    s3_model_data_url=model_data_s3,
    role_arn=role,
    sagemaker_session=sagemaker_session
)

In [None]:
# Build the model
model_name = f'bank-marketing-model-{datetime.now().strftime("%Y%m%d%H%M%S")}'
built_model = model_builder.build(model_name=model_name)

print(f'Model built: {built_model.model_name}')

In [None]:
# Deploy model to endpoint using ModelBuilder
endpoint_name = f'bank-marketing-{datetime.now().strftime("%Y%m%d%H%M%S")}'

endpoint = model_builder.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type='ml.m5.large',
    wait=True
)

print(f' Endpoint deployed: {endpoint.endpoint_name}')

> Verify the endpoint creation. Go to the SageMaker AI Studio `Deployments -> Endpoints` and see the new Endpoint

---

## Section 6: Test SageMaker AI Endpoint by predictions


In [None]:
# Prepare test data in CSV format (XGBoost expects CSV without headers)
test_sample = X_test.iloc[:5]
test_csv = test_sample.to_csv(header=False, index=False)

print('Test data (first 5 samples):')
print(test_csv[:200] + '...')

In [None]:
# Make predictions using endpoint.invoke()
response = endpoint.invoke(
    body=test_csv,
    content_type='text/csv'
)

# Parse predictions
predictions_raw = response.body.read().decode('utf-8')
predictions = [float(p) for p in predictions_raw.strip().split('\n')]

print('Sample Predictions:')
for i, pred in enumerate(predictions):
    print(f'  Sample {i+1}: {pred:.4f} (Class: {"Yes" if pred > 0.5 else "No"})')

In [None]:
# Compare with actual labels
actual_labels = y_test.iloc[:5].values
print('\nPrediction vs Actual:')
for i, (pred, actual) in enumerate(zip(predictions, actual_labels)):
    pred_class = 1 if pred > 0.5 else 0
    match = 'âœ“' if pred_class == actual else 'âœ—'
    print(f'  {match} Sample {i+1}: Predicted={pred_class}, Actual={actual}, Probability={pred:.4f}')

## Section 7: Register Model to SageMaker AI Model Registry (Optional)

> **Note**: The trained model is already registered in MLflow and if the auto-registration flag is enabled on the SageMakerAI MLflow App, SageMaker will automatically register the model from MLflow into SageMaker AI Model Registry. If you used the default MLflow app, then the auto-registration flag is enabled and subsequently the model from MLflow is registered automatically into SageMaker AI Model Registry. Alternatively, you can also register the model directly into SageMaker AI Model Registry as shown below.

**SageMaker AI Model Registry**: A centralized model catalog that:
- **Versions models**: Track model iterations over time
- **Manages lifecycle**: Dev â†’ Staging â†’ Production stages
- **Enables governance**: Approval workflows and audit trails
- **Integrates with CI/CD**: Automated deployment pipelines
- **Provides lineage**: Track training data, code, and metrics
- **Supports multi-account**: Share models across AWS accounts
- **Auto-Registration**: When enabled in MLflow app, Models logged to MLflow are automatically synced to SageMaker
- **Model Package**: A SageMaker resource that contains: Model artifacts, Model metrics, Metadata.
- **Model Package Group**: A collection of model versions helping Groups related model versions together

In [None]:
new_model_name = built_model.model_name+"-manual"
step_response = model_builder.register(
    model_package_group_name =  new_model_name,
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=["ml.m5.xlarge"],
    approval_status="Approved"
)
print(f"New model registerd: {step_response}")

> Verify the manual model registration. Go to the SageMaker AI Studio `Models -> My Models` and see the new model package

# Clean up section (Optional)
Delete the endpoint as it is live

In [None]:
response = endpoint.delete()

## Workshop Summary: Traditional ML with Amazon SageMaker
This workshop provided hands-on experience with the end-to-end machine learning workflow using Amazon SageMaker. Participants learned to:
1. Train models using SageMaker's managed training infrastructure with ModelTrainer, leveraging on-demand compute (XGBoost on ml.m5.xlarge instances) without server management
2. Track experiments with the fully managed Amazon SageMaker MLflow App, logging hyperparameters, metrics (accuracy, AUC, F1), and model artifacts
3. Package models using ModelBuilder to prepare trained models for deployment with defined input/output schemas
4. Deploy endpoints as fully managed, real-time HTTPS endpoints for low-latency predictions
5. Manage model versions through automatic registration from MLflow to the SageMaker AI Model Registry
6. The workshop utilized the SageMaker Python SDK v3, highlighting its simplified APIs, reduced boilerplate code, and intelligent defaults. 

You've successfully completed the Traditional ML with Amazon SageMaker workshop. You now have hands-on experience with:
- SageMaker's core training and deployment capabilities
- MLflow integration for experiment tracking
- Real-world ML workflow from data to deployment
- AWS best practices for ML operations

### Next Steps with SageMaker

**Explore More SageMaker Features:**
- Create SageMaker Pipelines for automated retraining
- Automate ML workflows with CI/CD
- Implement model approval workflows

 Thank You! ðŸŽ‰