# SageMaker XGBoost Classification Exercise

This notebook walks you through training Amazon SageMaker's **XGBoost** algorithm on synthetic customer churn data.

## What You'll Learn
1. How to prepare data in XGBoost's required CSV format
2. How to configure and train an XGBoost model on SageMaker
3. How to use Batch Transform for predictions
4. How to evaluate classification model performance

## Prerequisites
- SageMaker notebook instance or Studio, or local environment with AWS credentials
- IAM role with S3 and SageMaker permissions

---

## Step 1: Setup and Imports

In [1]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import os
from datetime import datetime
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Configure AWS session from environment variables
aws_profile = os.getenv('AWS_PROFILE')
aws_region = os.getenv('AWS_REGION', 'us-west-2')
sagemaker_role = os.getenv('SAGEMAKER_ROLE_ARN')

if aws_profile:
    boto3.setup_default_session(profile_name=aws_profile, region_name=aws_region)
else:
    boto3.setup_default_session(region_name=aws_region)

# SageMaker session and role
sagemaker_session = sagemaker.Session()

# Use environment variable for role, or fall back to execution role if running in SageMaker
if sagemaker_role:
    role = sagemaker_role
else:
    role = get_execution_role()

region = sagemaker_session.boto_region_name

print(f"AWS Profile: {aws_profile or 'default'}")
print(f"SageMaker Role: {role}")
print(f"Region: {region}")
print(f"SageMaker SDK Version: {sagemaker.__version__}")

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/david.jarmoluk/Library/Application Support/sagemaker/config.yaml
AWS Profile: brightech-secondary
SageMaker Role: arn:aws:iam::096816224238:role/service-role/AmazonSageMaker-ExecutionRole-20251113T125927
Region: us-west-2
SageMaker SDK Version: 2.255.0


In [2]:
# Configuration
BUCKET_NAME = sagemaker_session.default_bucket()
PREFIX = "xgboost"

# Dataset parameters
NUM_SAMPLES = 5000
TEST_RATIO = 0.2
RANDOM_STATE = 42

print(f"S3 Bucket: {BUCKET_NAME}")
print(f"S3 Prefix: {PREFIX}")

S3 Bucket: sagemaker-us-west-2-096816224238
S3 Prefix: xgboost


## Step 2: Generate Synthetic Data

We'll create a realistic customer churn dataset with:
- Customer demographics (age, tenure)
- Service usage patterns (monthly charges, support calls)
- Contract and payment information
- Binary churn target variable

In [None]:
def generate_customer_churn_data(num_samples=5000, seed=42):
    """
    Generate synthetic customer churn data for binary classification.
    
    Features are designed to have realistic relationships with churn:
    - Short tenure increases churn risk
    - Month-to-month contracts have higher churn
    - High support calls indicate dissatisfaction
    - Lack of tech support increases churn
    """
    np.random.seed(seed)
    
    # Customer demographics
    age = np.random.normal(45, 15, num_samples).clip(18, 80)
    tenure_months = np.random.exponential(24, num_samples).clip(1, 72)
    
    # Billing information
    monthly_charges = np.random.normal(70, 30, num_samples).clip(20, 150)
    total_charges = monthly_charges * tenure_months * np.random.uniform(0.9, 1.0, num_samples)
    
    # Service usage
    num_products = np.random.poisson(2.5, num_samples).clip(1, 6)
    support_calls = np.random.poisson(1.5, num_samples)
    
    # Contract type: 0=Month-to-month, 1=One year, 2=Two year
    contract_type = np.random.choice([0, 1, 2], num_samples, p=[0.5, 0.3, 0.2])
    
    # Payment method: 0=Electronic check, 1=Credit card, 2=Bank transfer, 3=Mailed check
    payment_method = np.random.choice([0, 1, 2, 3], num_samples, p=[0.35, 0.25, 0.25, 0.15])
    
    # Binary features
    has_tech_support = np.random.binomial(1, 0.4, num_samples)
    paperless_billing = np.random.binomial(1, 0.6, num_samples)
    auto_payment = np.random.binomial(1, 0.45, num_samples)
    
    # Generate churn based on realistic patterns
    churn_score = (
        -0.03 * tenure_months +           # Longer tenure = less churn
        0.015 * monthly_charges +          # Higher charges = more churn
        0.15 * support_calls +             # More support calls = more churn
        0.8 * (contract_type == 0) +       # Month-to-month = high churn
        0.4 * (payment_method == 0) +      # Electronic check = higher churn
        -0.5 * has_tech_support +          # Tech support = less churn
        -0.1 * num_products +              # More products = less churn
        np.random.normal(0, 0.5, num_samples)  # Random noise
    )
    
    # Convert to probability and generate labels
    churn_prob = 1 / (1 + np.exp(-churn_score))
    churn = (np.random.random(num_samples) < churn_prob).astype(int)
    
    # Combine features into array (label first for SageMaker format)
    features = np.column_stack([
        age, tenure_months, monthly_charges, total_charges,
        num_products, support_calls, contract_type, payment_method,
        has_tech_support, paperless_billing, auto_payment
    ])
    
    feature_names = [
        'age', 'tenure_months', 'monthly_charges', 'total_charges',
        'num_products', 'support_calls', 'contract_type', 'payment_method',
        'has_tech_support', 'paperless_billing', 'auto_payment'
    ]
    
    return features.astype(np.float32), churn.astype(np.float32), feature_names

In [None]:
# Generate the dataset
print("Generating synthetic customer churn data...")
X, y, feature_names = generate_customer_churn_data(NUM_SAMPLES, RANDOM_STATE)

print(f"\nDataset shape: {X.shape}")
print(f"Churn rate: {y.mean():.1%}")
print(f"Features: {feature_names}")

In [None]:
def split_data(X, y, test_ratio=0.2, seed=42):
    """Split data into train and test sets, maintaining class distribution."""
    np.random.seed(seed)
    indices = np.random.permutation(len(y))
    test_size = int(len(y) * test_ratio)
    
    test_idx = indices[:test_size]
    train_idx = indices[test_size:]
    
    return X[train_idx], X[test_idx], y[train_idx], y[test_idx]

# Split the data
X_train, X_test, y_train, y_test = split_data(X, y, TEST_RATIO, RANDOM_STATE)

print(f"Training set: {len(y_train)} samples (churn rate: {y_train.mean():.1%})")
print(f"Test set: {len(y_test)} samples (churn rate: {y_test.mean():.1%})")

## Step 3: Visualize the Data

In [None]:
# Create DataFrame for visualization
df_viz = pd.DataFrame(X_train, columns=feature_names)
df_viz['churn'] = y_train

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Churn distribution
ax = axes[0, 0]
churn_counts = df_viz['churn'].value_counts().sort_index()
ax.bar(['No Churn', 'Churn'], churn_counts.values, color=['steelblue', 'coral'])
ax.set_title('Churn Distribution', fontsize=12, fontweight='bold')
ax.set_ylabel('Count')
for i, v in enumerate(churn_counts.values):
    ax.text(i, v + 50, f'{v}\n({v/len(df_viz):.1%})', ha='center')

# Tenure by churn
ax = axes[0, 1]
df_viz[df_viz['churn']==0]['tenure_months'].hist(bins=30, alpha=0.6, label='No Churn', ax=ax, color='steelblue')
df_viz[df_viz['churn']==1]['tenure_months'].hist(bins=30, alpha=0.6, label='Churn', ax=ax, color='coral')
ax.set_title('Tenure Distribution by Churn', fontsize=12, fontweight='bold')
ax.set_xlabel('Tenure (months)')
ax.legend()

# Monthly charges by churn
ax = axes[0, 2]
df_viz[df_viz['churn']==0]['monthly_charges'].hist(bins=30, alpha=0.6, label='No Churn', ax=ax, color='steelblue')
df_viz[df_viz['churn']==1]['monthly_charges'].hist(bins=30, alpha=0.6, label='Churn', ax=ax, color='coral')
ax.set_title('Monthly Charges by Churn', fontsize=12, fontweight='bold')
ax.set_xlabel('Monthly Charges ($)')
ax.legend()

# Contract type vs churn
ax = axes[1, 0]
contract_churn = df_viz.groupby('contract_type')['churn'].mean()
contract_labels = ['Month-to-month', 'One year', 'Two year']
ax.bar(contract_labels, contract_churn.values, color='steelblue')
ax.set_title('Churn Rate by Contract Type', fontsize=12, fontweight='bold')
ax.set_ylabel('Churn Rate')
for i, v in enumerate(contract_churn.values):
    ax.text(i, v + 0.02, f'{v:.1%}', ha='center')

# Support calls vs churn
ax = axes[1, 1]
support_churn = df_viz.groupby('support_calls')['churn'].mean()
ax.bar(support_churn.index, support_churn.values, color='coral')
ax.set_title('Churn Rate by Support Calls', fontsize=12, fontweight='bold')
ax.set_xlabel('Number of Support Calls')
ax.set_ylabel('Churn Rate')

# Tech support vs churn
ax = axes[1, 2]
tech_churn = df_viz.groupby('has_tech_support')['churn'].mean()
ax.bar(['No Tech Support', 'Has Tech Support'], tech_churn.values, color=['coral', 'steelblue'])
ax.set_title('Churn Rate by Tech Support', fontsize=12, fontweight='bold')
ax.set_ylabel('Churn Rate')
for i, v in enumerate(tech_churn.values):
    ax.text(i, v + 0.02, f'{v:.1%}', ha='center')

plt.tight_layout()
plt.show()

print("\nKey Observations:")
print(f"- Month-to-month contracts have {contract_churn[0]:.1%} churn vs {contract_churn[2]:.1%} for two-year")
print(f"- Customers without tech support churn at {tech_churn[0]:.1%} vs {tech_churn[1]:.1%} with support")

## Step 4: Prepare Data for SageMaker XGBoost

SageMaker's XGBoost algorithm expects data in CSV format:
- **Training**: Label in the first column, followed by features (no header)
- **Inference**: Features only (no label column)

In [None]:
# Create local data directory
os.makedirs('data', exist_ok=True)

def save_csv_for_training(X, y, filepath):
    """Save data in SageMaker XGBoost CSV format (label first, no header)."""
    data = np.column_stack([y.reshape(-1, 1), X])
    np.savetxt(filepath, data, delimiter=',', fmt='%.6f')

def save_csv_for_inference(X, filepath):
    """Save features only for batch transform inference."""
    np.savetxt(filepath, X, delimiter=',', fmt='%.6f')

# Save training data (label + features)
save_csv_for_training(X_train, y_train, 'data/train.csv')

# Save test features only (for batch transform)
save_csv_for_inference(X_test, 'data/test_features.csv')

# Save test labels locally for evaluation
np.savetxt('data/test_labels.csv', y_test, delimiter=',', fmt='%.0f')

print("Data saved locally:")
print(f"  - data/train.csv ({os.path.getsize('data/train.csv') / 1024:.1f} KB)")
print(f"  - data/test_features.csv ({os.path.getsize('data/test_features.csv') / 1024:.1f} KB)")
print(f"  - data/test_labels.csv ({os.path.getsize('data/test_labels.csv') / 1024:.1f} KB)")

In [None]:
# Examine the data format
print("Sample training data (first 3 rows):")
print("Format: label, feature1, feature2, ..., feature11")
print("="*70)
with open('data/train.csv', 'r') as f:
    for i, line in enumerate(f):
        if i >= 3:
            break
        print(line.strip())

In [None]:
# Upload to S3
s3_client = boto3.client('s3')

train_s3_path = f"{PREFIX}/train/train.csv"
test_s3_path = f"{PREFIX}/test/test_features.csv"

s3_client.upload_file('data/train.csv', BUCKET_NAME, train_s3_path)
s3_client.upload_file('data/test_features.csv', BUCKET_NAME, test_s3_path)

train_s3_uri = f"s3://{BUCKET_NAME}/{train_s3_path}"
test_s3_uri = f"s3://{BUCKET_NAME}/{test_s3_path}"

print("Data uploaded to S3:")
print(f"  Train: {train_s3_uri}")
print(f"  Test:  {test_s3_uri}")

## Step 5: Configure and Train the XGBoost Model

### Key Hyperparameters

**objective** (Required)
- Specifies the learning task and loss function
- `binary:logistic`: Binary classification, outputs probability
- `multi:softmax`: Multiclass classification, outputs class
- `reg:squarederror`: Regression with squared error loss

**num_round** (Required)
- Number of boosting rounds (trees to build)
- More rounds can improve accuracy but risk overfitting
- Typical range: 50-500, use early stopping to find optimal

**max_depth**
- Maximum depth of each tree
- Controls model complexity and overfitting
- Deeper trees can capture more complex patterns but overfit more easily
- Typical range: 3-10 (default: 6)

**eta (learning_rate)**
- Step size shrinkage to prevent overfitting
- Lower values require more boosting rounds but generalize better
- Typical range: 0.01-0.3 (default: 0.3)

**subsample**
- Fraction of training samples used per tree
- Adds randomness to prevent overfitting (like bagging)
- Typical range: 0.5-1.0 (default: 1.0)

**colsample_bytree**
- Fraction of features used per tree
- Adds randomness and can improve generalization
- Typical range: 0.5-1.0 (default: 1.0)

**min_child_weight**
- Minimum sum of instance weight needed in a child node
- Higher values prevent learning overly specific patterns
- For classification, this is the minimum number of samples
- Typical range: 1-10 (default: 1)

**gamma (min_split_loss)**
- Minimum loss reduction required to make a split
- Acts as regularization: higher values = more conservative
- Typical range: 0-5 (default: 0)

**alpha (reg_alpha)**
- L1 regularization on leaf weights
- Encourages sparsity (many zero weights)
- Useful for high-dimensional data

**lambda (reg_lambda)**
- L2 regularization on leaf weights
- Smooths weights, reduces overfitting
- Default: 1

**scale_pos_weight**
- Balance the positive and negative class weights
- For imbalanced datasets, set to: sum(negative) / sum(positive)
- Helps the model pay more attention to the minority class

**eval_metric**
- Metric used for validation and early stopping
- `auc`: Area under ROC curve (good for imbalanced data)
- `error`: Classification error rate
- `logloss`: Negative log-likelihood

In [None]:
# Get the XGBoost container image
xgboost_image = retrieve(
    framework='xgboost',
    region=region,
    version='1.5-1'  # Use a stable version
)

print(f"XGBoost Image URI: {xgboost_image}")

In [None]:
# Define the estimator
xgboost_estimator = Estimator(
    image_uri=xgboost_image,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/output',
    sagemaker_session=sagemaker_session,
    base_job_name='xgboost-churn'
)

In [None]:
# Calculate scale_pos_weight for imbalanced data
# This helps the model pay more attention to the minority class (churn)
num_negative = (y_train == 0).sum()
num_positive = (y_train == 1).sum()
scale_pos_weight = num_negative / num_positive

print(f"Class distribution: {num_negative} negative, {num_positive} positive")
print(f"Calculated scale_pos_weight: {scale_pos_weight:.2f}")

In [None]:
# Set hyperparameters
hyperparameters = {
    # Objective and evaluation
    "objective": "binary:logistic",
    "eval_metric": "auc",
    
    # Number of boosting rounds
    "num_round": 100,
    
    # Tree parameters
    "max_depth": 5,
    "eta": 0.1,                    # Learning rate
    "min_child_weight": 3,
    "gamma": 0.1,                  # Min split loss
    
    # Sampling parameters (regularization)
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    
    # Regularization
    "alpha": 0.1,                  # L1 regularization
    "lambda": 1.0,                 # L2 regularization
    
    # Handle class imbalance
    "scale_pos_weight": scale_pos_weight,
}

xgboost_estimator.set_hyperparameters(**hyperparameters)

print("Hyperparameters configured:")
for k, v in hyperparameters.items():
    print(f"  {k}: {v}")

In [None]:
# Define training input
train_input = TrainingInput(
    s3_data=train_s3_uri,
    content_type='text/csv'
)

print("Starting training job...")
print("This will take approximately 3-5 minutes.\n")

# Start training
xgboost_estimator.fit({'train': train_input}, wait=True, logs=True)

In [None]:
# Get training job info
training_job_name = xgboost_estimator.latest_training_job.name
print(f"Training job completed: {training_job_name}")
print(f"Model artifacts: {xgboost_estimator.model_data}")

## Step 6: Run Batch Transform

Instead of deploying a real-time endpoint, we use Batch Transform for predictions on the test set. This is more cost-effective for batch predictions.

In [None]:
# Create transformer from the trained model
transformer = xgboost_estimator.transformer(
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/batch-predictions'
)

print("Starting batch transform job...")
print("This will take approximately 3-5 minutes.\n")

# Run batch inference
transformer.transform(
    data=test_s3_uri,
    content_type='text/csv',
    split_type='Line',
    wait=True
)

print(f"\nPredictions written to: {transformer.output_path}")

## Step 7: Download and Parse Predictions

In [None]:
# Download predictions from S3
prediction_key = f"{PREFIX}/batch-predictions/test_features.csv.out"

s3_client.download_file(BUCKET_NAME, prediction_key, 'data/predictions.csv')

# Load predictions (XGBoost outputs probabilities for binary:logistic)
y_pred_proba = np.loadtxt('data/predictions.csv')
y_pred = (y_pred_proba > 0.5).astype(int)

print(f"Loaded predictions for {len(y_pred)} samples")
print(f"Prediction distribution: {np.bincount(y_pred)}")
print(f"Probability range: [{y_pred_proba.min():.4f}, {y_pred_proba.max():.4f}]")

## Step 8: Evaluate Model Performance

### Understanding Classification Metrics

**Accuracy**
- Percentage of correct predictions
- Can be misleading with imbalanced classes
- If 90% of customers don't churn, predicting "no churn" for everyone gives 90% accuracy!

**Precision**
- Of all customers predicted to churn, how many actually churned?
- High precision = few false alarms
- Important when the cost of intervention is high

**Recall (Sensitivity)**
- Of all customers who actually churned, how many did we catch?
- High recall = we catch most churners
- Important when missing a churner is costly

**F1 Score**
- Harmonic mean of precision and recall
- Balances both metrics
- Useful when you need a single metric for imbalanced data

**AUC-ROC**
- Area Under the Receiver Operating Characteristic curve
- Measures discrimination ability across all thresholds
- 0.5 = random guessing, 1.0 = perfect
- Good for imbalanced datasets

**Confusion Matrix**
- Shows True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN)
- Helps understand which types of errors the model makes

In [None]:
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report,
    roc_curve, precision_recall_curve, average_precision_score
)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)
avg_precision = average_precision_score(y_test, y_pred_proba)

print("="*60)
print("MODEL EVALUATION RESULTS")
print("="*60)
print(f"\nAccuracy:           {accuracy:.4f}")
print(f"Precision:          {precision:.4f}")
print(f"Recall:             {recall:.4f}")
print(f"F1 Score:           {f1:.4f}")
print(f"ROC AUC:            {auc:.4f}")
print(f"Average Precision:  {avg_precision:.4f}")

print("\n" + "="*60)
print("CLASSIFICATION REPORT")
print("="*60)
print(classification_report(y_test, y_pred, target_names=['No Churn', 'Churn']))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()

print("="*60)
print("CONFUSION MATRIX")
print("="*60)
print(f"\nTrue Negatives (correct no-churn):  {tn}")
print(f"False Positives (false alarm):       {fp}")
print(f"False Negatives (missed churn):      {fn}")
print(f"True Positives (caught churn):       {tp}")

print(f"\nSpecificity (TN rate):  {tn / (tn + fp):.4f}")
print(f"False Positive Rate:    {fp / (tn + fp):.4f}")

In [None]:
# Visualize results
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. Confusion Matrix Heatmap
ax = axes[0, 0]
im = ax.imshow(cm, cmap='Blues')
ax.set_xticks([0, 1])
ax.set_yticks([0, 1])
ax.set_xticklabels(['No Churn', 'Churn'])
ax.set_yticklabels(['No Churn', 'Churn'])
ax.set_xlabel('Predicted', fontsize=12)
ax.set_ylabel('Actual', fontsize=12)
ax.set_title('Confusion Matrix', fontsize=14, fontweight='bold')
for i in range(2):
    for j in range(2):
        ax.text(j, i, str(cm[i, j]), ha='center', va='center', fontsize=20, fontweight='bold')
plt.colorbar(im, ax=ax)

# 2. ROC Curve
ax = axes[0, 1]
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
ax.plot(fpr, tpr, 'b-', linewidth=2, label=f'XGBoost (AUC = {auc:.3f})')
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
ax.fill_between(fpr, tpr, alpha=0.2)
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curve', fontsize=14, fontweight='bold')
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)

# 3. Precision-Recall Curve
ax = axes[1, 0]
precision_curve, recall_curve, _ = precision_recall_curve(y_test, y_pred_proba)
ax.plot(recall_curve, precision_curve, 'b-', linewidth=2, label=f'XGBoost (AP = {avg_precision:.3f})')
ax.axhline(y=y_test.mean(), color='k', linestyle='--', linewidth=1, label=f'Baseline ({y_test.mean():.2f})')
ax.fill_between(recall_curve, precision_curve, alpha=0.2)
ax.set_xlabel('Recall', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.set_title('Precision-Recall Curve', fontsize=14, fontweight='bold')
ax.legend(loc='lower left')
ax.grid(True, alpha=0.3)

# 4. Prediction Distribution
ax = axes[1, 1]
ax.hist(y_pred_proba[y_test == 0], bins=50, alpha=0.6, label='No Churn (Actual)', color='steelblue', density=True)
ax.hist(y_pred_proba[y_test == 1], bins=50, alpha=0.6, label='Churn (Actual)', color='coral', density=True)
ax.axvline(x=0.5, color='black', linestyle='--', linewidth=2, label='Threshold (0.5)')
ax.set_xlabel('Predicted Churn Probability', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Prediction Distribution', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 9: Threshold Analysis

The default threshold of 0.5 may not be optimal for your business case. Let's analyze how different thresholds affect precision and recall.

In [None]:
# Analyze different thresholds
thresholds = np.linspace(0.1, 0.9, 17)
results = []

for thresh in thresholds:
    y_pred_thresh = (y_pred_proba >= thresh).astype(int)
    results.append({
        'threshold': thresh,
        'precision': precision_score(y_test, y_pred_thresh, zero_division=0),
        'recall': recall_score(y_test, y_pred_thresh, zero_division=0),
        'f1': f1_score(y_test, y_pred_thresh, zero_division=0),
        'predicted_positive': y_pred_thresh.sum()
    })

results_df = pd.DataFrame(results)

# Find optimal thresholds
best_f1_idx = results_df['f1'].idxmax()
best_f1_thresh = results_df.loc[best_f1_idx, 'threshold']

print("Threshold Analysis:")
print(results_df.to_string(index=False))
print(f"\nOptimal threshold for F1 score: {best_f1_thresh:.2f}")

In [None]:
# Visualize threshold trade-offs
fig, ax = plt.subplots(figsize=(12, 6))

ax.plot(results_df['threshold'], results_df['precision'], 'b-', linewidth=2, marker='o', label='Precision')
ax.plot(results_df['threshold'], results_df['recall'], 'r-', linewidth=2, marker='s', label='Recall')
ax.plot(results_df['threshold'], results_df['f1'], 'g-', linewidth=2, marker='^', label='F1 Score')
ax.axvline(x=0.5, color='gray', linestyle='--', alpha=0.7, label='Default (0.5)')
ax.axvline(x=best_f1_thresh, color='green', linestyle=':', linewidth=2, label=f'Best F1 ({best_f1_thresh:.2f})')

ax.set_xlabel('Classification Threshold', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Precision-Recall Trade-off by Threshold', fontsize=14, fontweight='bold')
ax.legend(loc='center right')
ax.grid(True, alpha=0.3)
ax.set_xlim(0.05, 0.95)

plt.tight_layout()
plt.show()

print("\nBusiness Considerations:")
print("- Lower threshold: Catch more churners (high recall) but more false alarms (low precision)")
print("- Higher threshold: Fewer false alarms (high precision) but miss more churners (low recall)")
print("- Choose based on: cost of intervention vs cost of losing a customer")

## Step 10: Deploy Model (Optional)

If you need real-time predictions, you can deploy the model to an endpoint.

In [None]:
# Uncomment to deploy the model
# WARNING: This will incur ongoing charges until deleted!

# print("Deploying model to endpoint...")
# print("This will take approximately 5-7 minutes.\n")

# predictor = xgboost_estimator.deploy(
#     initial_instance_count=1,
#     instance_type='ml.m5.large',
#     endpoint_name=f'xgboost-churn-endpoint-{datetime.now().strftime("%Y%m%d%H%M")}'
# )

# print(f"\nEndpoint deployed: {predictor.endpoint_name}")

In [None]:
# Example: Make predictions with deployed endpoint
# Uncomment if you deployed the model above

# from sagemaker.serializers import CSVSerializer
# from sagemaker.deserializers import CSVDeserializer

# predictor.serializer = CSVSerializer()
# predictor.deserializer = CSVDeserializer()

# # Make prediction for a single customer
# sample_customer = X_test[0:1]
# result = predictor.predict(sample_customer)
# print(f"Churn probability: {float(result[0][0]):.4f}")

In [None]:
# Delete endpoint when done
# Uncomment if you deployed the model

# print(f"Deleting endpoint: {predictor.endpoint_name}")
# predictor.delete_endpoint()
# print("Endpoint deleted successfully!")

---

## Summary

In this exercise, you learned:

1. **Data Format**: SageMaker XGBoost expects CSV format with the label in the first column (no header)

2. **Key Hyperparameters**:
   - `objective`: Learning task (binary:logistic for binary classification)
   - `num_round`: Number of boosting rounds
   - `max_depth`: Tree depth (controls complexity)
   - `eta`: Learning rate (lower = more regularization)
   - `scale_pos_weight`: Handles class imbalance

3. **Evaluation Metrics**:
   - Use AUC-ROC for imbalanced datasets (more robust than accuracy)
   - Consider precision vs recall trade-off based on business needs
   - Tune threshold based on cost of false positives vs false negatives

4. **Best Practices**:
   - Use Batch Transform for cost-effective batch predictions
   - Set `scale_pos_weight` for imbalanced classification
   - Use regularization (`alpha`, `lambda`, `subsample`) to prevent overfitting
   - Always delete endpoints when done to avoid charges

## Next Steps

- Experiment with different hyperparameters
- Try SageMaker Hyperparameter Tuning for automatic optimization
- Add validation data for early stopping
- Use SHAP values to explain individual predictions