# SageMaker LightGBM Classification Exercise

This notebook walks you through training Amazon SageMaker's **LightGBM** algorithm on synthetic customer churn data.

## What You'll Learn
1. How to prepare data in LightGBM's required CSV format
2. How to configure and train a LightGBM model on SageMaker
3. How to use Batch Transform for predictions
4. How to evaluate classification model performance

## What is LightGBM?
LightGBM (Light Gradient Boosting Machine) is an efficient, open-source implementation of Gradient Boosting Decision Tree (GBDT). Compared to XGBoost, LightGBM:
- Uses **leaf-wise** tree growth (vs level-wise), which can be faster and more accurate
- Uses **histogram-based** splitting for faster training
- Is **memory-efficient** - SageMaker recommends general-purpose instances (M5) over compute-optimized (C5)
- Supports **categorical features** natively without one-hot encoding

## Prerequisites
- SageMaker notebook instance or Studio, or local environment with AWS credentials
- IAM role with S3 and SageMaker permissions

---

## Step 1: Setup and Imports

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import os
from datetime import datetime
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Configure AWS session from environment variables
aws_profile = os.getenv('AWS_PROFILE')
aws_region = os.getenv('AWS_REGION', 'us-west-2')
sagemaker_role = os.getenv('SAGEMAKER_ROLE_ARN')

if aws_profile:
    boto3.setup_default_session(profile_name=aws_profile, region_name=aws_region)
else:
    boto3.setup_default_session(region_name=aws_region)

# SageMaker session and role
sagemaker_session = sagemaker.Session()

# Use environment variable for role, or fall back to execution role if running in SageMaker
if sagemaker_role:
    role = sagemaker_role
else:
    role = get_execution_role()

region = sagemaker_session.boto_region_name

print(f"AWS Profile: {aws_profile or 'default'}")
print(f"SageMaker Role: {role}")
print(f"Region: {region}")
print(f"SageMaker SDK Version: {sagemaker.__version__}")

In [None]:
# Configuration
BUCKET_NAME = sagemaker_session.default_bucket()
PREFIX = "lightgbm-churn"

# Dataset parameters
NUM_SAMPLES = 5000
TEST_RATIO = 0.2
RANDOM_STATE = 42

print(f"S3 Bucket: {BUCKET_NAME}")
print(f"S3 Prefix: {PREFIX}")

## Step 2: Generate Synthetic Data

We'll create a realistic customer churn dataset with:
- Customer demographics (age, tenure)
- Service usage patterns (monthly charges, support calls)
- Contract and payment information
- Binary churn target variable

In [None]:
def generate_customer_churn_data(num_samples=5000, seed=42):
    """
    Generate synthetic customer churn data for binary classification.
    
    Features are designed to have realistic relationships with churn:
    - Short tenure increases churn risk
    - Month-to-month contracts have higher churn
    - High support calls indicate dissatisfaction
    - Lack of tech support increases churn
    """
    np.random.seed(seed)
    
    # Customer demographics
    age = np.random.normal(45, 15, num_samples).clip(18, 80)
    tenure_months = np.random.exponential(24, num_samples).clip(1, 72)
    
    # Billing information
    monthly_charges = np.random.normal(70, 30, num_samples).clip(20, 150)
    total_charges = monthly_charges * tenure_months * np.random.uniform(0.9, 1.0, num_samples)
    
    # Service usage
    num_products = np.random.poisson(2.5, num_samples).clip(1, 6)
    support_calls = np.random.poisson(1.5, num_samples)
    
    # Contract type: 0=Month-to-month, 1=One year, 2=Two year
    contract_type = np.random.choice([0, 1, 2], num_samples, p=[0.5, 0.3, 0.2])
    
    # Payment method: 0=Electronic check, 1=Credit card, 2=Bank transfer, 3=Mailed check
    payment_method = np.random.choice([0, 1, 2, 3], num_samples, p=[0.35, 0.25, 0.25, 0.15])
    
    # Binary features
    has_tech_support = np.random.binomial(1, 0.4, num_samples)
    paperless_billing = np.random.binomial(1, 0.6, num_samples)
    auto_payment = np.random.binomial(1, 0.45, num_samples)
    
    # Generate churn based on realistic patterns
    churn_score = (
        -0.03 * tenure_months +           # Longer tenure = less churn
        0.015 * monthly_charges +          # Higher charges = more churn
        0.15 * support_calls +             # More support calls = more churn
        0.8 * (contract_type == 0) +       # Month-to-month = high churn
        0.4 * (payment_method == 0) +      # Electronic check = higher churn
        -0.5 * has_tech_support +          # Tech support = less churn
        -0.1 * num_products +              # More products = less churn
        np.random.normal(0, 0.5, num_samples)  # Random noise
    )
    
    # Convert to probability and generate labels
    churn_prob = 1 / (1 + np.exp(-churn_score))
    churn = (np.random.random(num_samples) < churn_prob).astype(int)
    
    # Combine features into array (label first for SageMaker format)
    features = np.column_stack([
        age, tenure_months, monthly_charges, total_charges,
        num_products, support_calls, contract_type, payment_method,
        has_tech_support, paperless_billing, auto_payment
    ])
    
    feature_names = [
        'age', 'tenure_months', 'monthly_charges', 'total_charges',
        'num_products', 'support_calls', 'contract_type', 'payment_method',
        'has_tech_support', 'paperless_billing', 'auto_payment'
    ]
    
    # Indices of categorical features (0-indexed, excluding the label column)
    categorical_indices = [6, 7, 8, 9, 10]  # contract_type, payment_method, has_tech_support, paperless_billing, auto_payment
    
    return features.astype(np.float32), churn.astype(np.float32), feature_names, categorical_indices

In [None]:
# Generate the dataset
print("Generating synthetic customer churn data...")
X, y, feature_names, categorical_indices = generate_customer_churn_data(NUM_SAMPLES, RANDOM_STATE)

print(f"\nDataset shape: {X.shape}")
print(f"Churn rate: {y.mean():.1%}")
print(f"Features: {feature_names}")
print(f"Categorical feature indices: {categorical_indices}")

In [None]:
def split_data(X, y, test_ratio=0.2, seed=42):
    """Split data into train and test sets, maintaining class distribution."""
    np.random.seed(seed)
    indices = np.random.permutation(len(y))
    test_size = int(len(y) * test_ratio)
    
    test_idx = indices[:test_size]
    train_idx = indices[test_size:]
    
    return X[train_idx], X[test_idx], y[train_idx], y[test_idx]

# Split the data
X_train, X_test, y_train, y_test = split_data(X, y, TEST_RATIO, RANDOM_STATE)

print(f"Training set: {len(y_train)} samples (churn rate: {y_train.mean():.1%})")
print(f"Test set: {len(y_test)} samples (churn rate: {y_test.mean():.1%})")

## Step 3: Visualize the Data

In [None]:
# Create DataFrame for visualization
df_viz = pd.DataFrame(X_train, columns=feature_names)
df_viz['churn'] = y_train

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Churn distribution
ax = axes[0, 0]
churn_counts = df_viz['churn'].value_counts().sort_index()
ax.bar(['No Churn', 'Churn'], churn_counts.values, color=['steelblue', 'coral'])
ax.set_title('Churn Distribution', fontsize=12, fontweight='bold')
ax.set_ylabel('Count')
for i, v in enumerate(churn_counts.values):
    ax.text(i, v + 50, f'{v}\n({v/len(df_viz):.1%})', ha='center')

# Tenure by churn
ax = axes[0, 1]
df_viz[df_viz['churn']==0]['tenure_months'].hist(bins=30, alpha=0.6, label='No Churn', ax=ax, color='steelblue')
df_viz[df_viz['churn']==1]['tenure_months'].hist(bins=30, alpha=0.6, label='Churn', ax=ax, color='coral')
ax.set_title('Tenure Distribution by Churn', fontsize=12, fontweight='bold')
ax.set_xlabel('Tenure (months)')
ax.legend()

# Monthly charges by churn
ax = axes[0, 2]
df_viz[df_viz['churn']==0]['monthly_charges'].hist(bins=30, alpha=0.6, label='No Churn', ax=ax, color='steelblue')
df_viz[df_viz['churn']==1]['monthly_charges'].hist(bins=30, alpha=0.6, label='Churn', ax=ax, color='coral')
ax.set_title('Monthly Charges by Churn', fontsize=12, fontweight='bold')
ax.set_xlabel('Monthly Charges ($)')
ax.legend()

# Contract type vs churn
ax = axes[1, 0]
contract_churn = df_viz.groupby('contract_type')['churn'].mean()
contract_labels = ['Month-to-month', 'One year', 'Two year']
ax.bar(contract_labels, contract_churn.values, color='steelblue')
ax.set_title('Churn Rate by Contract Type', fontsize=12, fontweight='bold')
ax.set_ylabel('Churn Rate')
for i, v in enumerate(contract_churn.values):
    ax.text(i, v + 0.02, f'{v:.1%}', ha='center')

# Support calls vs churn
ax = axes[1, 1]
support_churn = df_viz.groupby('support_calls')['churn'].mean()
ax.bar(support_churn.index, support_churn.values, color='coral')
ax.set_title('Churn Rate by Support Calls', fontsize=12, fontweight='bold')
ax.set_xlabel('Number of Support Calls')
ax.set_ylabel('Churn Rate')

# Tech support vs churn
ax = axes[1, 2]
tech_churn = df_viz.groupby('has_tech_support')['churn'].mean()
ax.bar(['No Tech Support', 'Has Tech Support'], tech_churn.values, color=['coral', 'steelblue'])
ax.set_title('Churn Rate by Tech Support', fontsize=12, fontweight='bold')
ax.set_ylabel('Churn Rate')
for i, v in enumerate(tech_churn.values):
    ax.text(i, v + 0.02, f'{v:.1%}', ha='center')

plt.tight_layout()
plt.show()

print("\nKey Observations:")
print(f"- Month-to-month contracts have {contract_churn[0]:.1%} churn vs {contract_churn[2]:.1%} for two-year")
print(f"- Customers without tech support churn at {tech_churn[0]:.1%} vs {tech_churn[1]:.1%} with support")

## Step 4: Prepare Data for SageMaker LightGBM

SageMaker's LightGBM algorithm expects:
- **Training data**: CSV format with target in the first column (no header)
- **Inference data**: CSV format with features only (no target column)
- **Categorical features** (optional): A JSON file listing column indices of categorical features

In [None]:
# Create local data directory
os.makedirs('data/train', exist_ok=True)
os.makedirs('data/test', exist_ok=True)

def save_csv_for_training(X, y, filepath):
    """Save data in SageMaker LightGBM CSV format (label first, no header)."""
    data = np.column_stack([y.reshape(-1, 1), X])
    np.savetxt(filepath, data, delimiter=',', fmt='%.6f')

def save_csv_for_inference(X, filepath):
    """Save features only for batch transform inference."""
    np.savetxt(filepath, X, delimiter=',', fmt='%.6f')

# Save training data (label + features)
save_csv_for_training(X_train, y_train, 'data/train/train.csv')

# Save test features only (for batch transform)
save_csv_for_inference(X_test, 'data/test/test_features.csv')

# Save test labels locally for evaluation
np.savetxt('data/test/test_labels.csv', y_test, delimiter=',', fmt='%.0f')

# Save categorical feature indices (optional but recommended for LightGBM)
# Note: indices are 0-based and refer to feature columns AFTER the label column
with open('data/train/categorical_index.json', 'w') as f:
    json.dump({"cat_index_list": categorical_indices}, f)

print("Data saved locally:")
print(f"  - data/train/train.csv ({os.path.getsize('data/train/train.csv') / 1024:.1f} KB)")
print(f"  - data/train/categorical_index.json")
print(f"  - data/test/test_features.csv ({os.path.getsize('data/test/test_features.csv') / 1024:.1f} KB)")
print(f"  - data/test/test_labels.csv")

In [None]:
# Examine the data format
print("Sample training data (first 3 rows):")
print("Format: label, feature1, feature2, ..., feature11")
print("="*70)
with open('data/train/train.csv', 'r') as f:
    for i, line in enumerate(f):
        if i >= 3:
            break
        print(line.strip())

print("\nCategorical index file:")
with open('data/train/categorical_index.json', 'r') as f:
    print(f.read())

In [None]:
# Upload to S3
s3_client = boto3.client('s3')

# Upload training data and categorical index
train_s3_path = f"{PREFIX}/train/train.csv"
cat_index_s3_path = f"{PREFIX}/train/categorical_index.json"
test_s3_path = f"{PREFIX}/test/test_features.csv"

s3_client.upload_file('data/train/train.csv', BUCKET_NAME, train_s3_path)
s3_client.upload_file('data/train/categorical_index.json', BUCKET_NAME, cat_index_s3_path)
s3_client.upload_file('data/test/test_features.csv', BUCKET_NAME, test_s3_path)

# LightGBM expects the training directory, not the file
train_s3_uri = f"s3://{BUCKET_NAME}/{PREFIX}/train"
test_s3_uri = f"s3://{BUCKET_NAME}/{test_s3_path}"

print("Data uploaded to S3:")
print(f"  Train directory: {train_s3_uri}")
print(f"  Test file: {test_s3_uri}")

## Step 5: Configure and Train the LightGBM Model

### Key Hyperparameters

**num_boost_round** (Required)
- Number of boosting iterations (trees to build)
- More rounds can improve accuracy but risk overfitting
- Typical range: 100-1000

**num_leaves**
- Maximum number of leaves in one tree
- Main parameter controlling model complexity
- LightGBM uses leaf-wise growth, so this is more important than max_depth
- Typical range: 20-100 (default: 31)
- Higher values = more complex model, risk of overfitting

**learning_rate**
- Shrinkage rate to prevent overfitting
- Lower values require more boosting rounds but generalize better
- Typical range: 0.01-0.3 (default: 0.1)

**max_depth**
- Maximum depth of each tree
- Use to limit tree complexity and prevent overfitting
- Set to -1 for no limit (default)
- Typical range: 3-12 when limiting

**feature_fraction** (colsample_bytree)
- Fraction of features used per tree
- Adds randomness to prevent overfitting
- Typical range: 0.5-1.0 (default: 1.0)

**bagging_fraction** (subsample)
- Fraction of training samples used per tree
- Requires bagging_freq > 0 to take effect
- Typical range: 0.5-1.0 (default: 1.0)

**bagging_freq**
- Frequency for bagging (0 = disabled)
- Set to positive integer to enable bagging every k iterations
- Typical value: 1-10

**min_data_in_leaf**
- Minimum number of samples in a leaf
- Prevents learning overly specific patterns
- Typical range: 10-100 (default: 20)

**lambda_l1** (reg_alpha)
- L1 regularization on leaf weights
- Encourages sparsity
- Default: 0

**lambda_l2** (reg_lambda)
- L2 regularization on leaf weights
- Smooths weights, reduces overfitting
- Default: 0

**scale_pos_weight**
- Weight of positive class for imbalanced datasets
- Set to: sum(negative) / sum(positive)
- Helps model pay attention to minority class

### Evaluation Metrics

SageMaker LightGBM automatically chooses metrics based on problem type:
- **Binary classification**: binary cross entropy (binary_logloss)
- **Multiclass classification**: multi-class cross entropy (multi_logloss)
- **Regression**: root mean squared error (rmse)

In [None]:
# Get the LightGBM container image using JumpStart
train_model_id = "lightgbm-classification-model"
train_model_version = "*"
training_instance_type = "ml.m5.xlarge"

# Retrieve the Docker image URI
train_image_uri = image_uris.retrieve(
    region=region,
    framework=None,
    model_id=train_model_id,
    model_version=train_model_version,
    image_scope="training",
    instance_type=training_instance_type,
)

# Retrieve the training script URI
train_source_uri = script_uris.retrieve(
    model_id=train_model_id,
    model_version=train_model_version,
    script_scope="training"
)

# Retrieve the pre-trained model URI (used as starting point)
train_model_uri = model_uris.retrieve(
    model_id=train_model_id,
    model_version=train_model_version,
    model_scope="training"
)

print(f"Training Image URI: {train_image_uri}")
print(f"Training Script URI: {train_source_uri}")
print(f"Model URI: {train_model_uri}")

In [None]:
# Get default hyperparameters and customize
default_hparams = hyperparameters.retrieve_default(
    model_id=train_model_id,
    model_version=train_model_version
)

print("Default hyperparameters:")
for k, v in default_hparams.items():
    print(f"  {k}: {v}")

In [None]:
# Calculate scale_pos_weight for imbalanced data
num_negative = (y_train == 0).sum()
num_positive = (y_train == 1).sum()
scale_pos_weight = num_negative / num_positive

print(f"Class distribution: {num_negative} negative, {num_positive} positive")
print(f"Calculated scale_pos_weight: {scale_pos_weight:.2f}")

In [None]:
# Customize hyperparameters
hparams = default_hparams.copy()

# Override with our custom values
hparams["num_boost_round"] = "200"
hparams["num_leaves"] = "31"
hparams["learning_rate"] = "0.05"
hparams["max_depth"] = "6"
hparams["feature_fraction"] = "0.8"
hparams["bagging_fraction"] = "0.8"
hparams["bagging_freq"] = "5"
hparams["min_data_in_leaf"] = "20"
hparams["lambda_l1"] = "0.1"
hparams["lambda_l2"] = "1.0"
hparams["scale_pos_weight"] = str(scale_pos_weight)

print("Custom hyperparameters:")
for k, v in hparams.items():
    print(f"  {k}: {v}")

In [None]:
# Create the estimator
training_job_name = name_from_base(f"lightgbm-churn")

lightgbm_estimator = Estimator(
    role=role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=3600,
    hyperparameters=hparams,
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/output',
    sagemaker_session=sagemaker_session,
    base_job_name='lightgbm-churn'
)

In [None]:
print("Starting training job...")
print("This will take approximately 3-5 minutes.\n")

# Start training - pass the directory containing train.csv and categorical_index.json
lightgbm_estimator.fit(
    {"training": train_s3_uri},
    logs=True,
    job_name=training_job_name
)

In [None]:
# Get training job info
print(f"Training job completed: {training_job_name}")
print(f"Model artifacts: {lightgbm_estimator.model_data}")

## Step 6: Run Batch Transform

Instead of deploying a real-time endpoint, we use Batch Transform for predictions on the test set.

In [None]:
# Create transformer from the trained model
transformer = lightgbm_estimator.transformer(
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/batch-predictions'
)

print("Starting batch transform job...")
print("This will take approximately 3-5 minutes.\n")

# Run batch inference
transformer.transform(
    data=test_s3_uri,
    content_type='text/csv',
    split_type='Line',
    wait=True
)

print(f"\nPredictions written to: {transformer.output_path}")

## Step 7: Download and Parse Predictions

In [None]:
# Download predictions from S3
prediction_key = f"{PREFIX}/batch-predictions/test_features.csv.out"

s3_client.download_file(BUCKET_NAME, prediction_key, 'data/predictions.csv')

# Load predictions (LightGBM outputs probabilities for classification)
y_pred_proba = np.loadtxt('data/predictions.csv')
y_pred = (y_pred_proba > 0.5).astype(int)

print(f"Loaded predictions for {len(y_pred)} samples")
print(f"Prediction distribution: {np.bincount(y_pred)}")
print(f"Probability range: [{y_pred_proba.min():.4f}, {y_pred_proba.max():.4f}]")

## Step 8: Evaluate Model Performance

### Understanding Classification Metrics

**Accuracy** - Percentage of correct predictions. Can be misleading with imbalanced classes.

**Precision** - Of all customers predicted to churn, how many actually churned? High precision = few false alarms.

**Recall (Sensitivity)** - Of all customers who actually churned, how many did we catch? High recall = we catch most churners.

**F1 Score** - Harmonic mean of precision and recall. Balances both metrics.

**AUC-ROC** - Area Under the Receiver Operating Characteristic curve. Measures discrimination ability across all thresholds. 0.5 = random, 1.0 = perfect.

In [None]:
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report,
    roc_curve, precision_recall_curve, average_precision_score
)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)
avg_precision = average_precision_score(y_test, y_pred_proba)

print("="*60)
print("MODEL EVALUATION RESULTS")
print("="*60)
print(f"\nAccuracy:           {accuracy:.4f}")
print(f"Precision:          {precision:.4f}")
print(f"Recall:             {recall:.4f}")
print(f"F1 Score:           {f1:.4f}")
print(f"ROC AUC:            {auc:.4f}")
print(f"Average Precision:  {avg_precision:.4f}")

print("\n" + "="*60)
print("CLASSIFICATION REPORT")
print("="*60)
print(classification_report(y_test, y_pred, target_names=['No Churn', 'Churn']))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()

print("="*60)
print("CONFUSION MATRIX")
print("="*60)
print(f"\nTrue Negatives (correct no-churn):  {tn}")
print(f"False Positives (false alarm):       {fp}")
print(f"False Negatives (missed churn):      {fn}")
print(f"True Positives (caught churn):       {tp}")

print(f"\nSpecificity (TN rate):  {tn / (tn + fp):.4f}")
print(f"False Positive Rate:    {fp / (tn + fp):.4f}")

In [None]:
# Visualize results
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. Confusion Matrix Heatmap
ax = axes[0, 0]
im = ax.imshow(cm, cmap='Blues')
ax.set_xticks([0, 1])
ax.set_yticks([0, 1])
ax.set_xticklabels(['No Churn', 'Churn'])
ax.set_yticklabels(['No Churn', 'Churn'])
ax.set_xlabel('Predicted', fontsize=12)
ax.set_ylabel('Actual', fontsize=12)
ax.set_title('Confusion Matrix', fontsize=14, fontweight='bold')
for i in range(2):
    for j in range(2):
        ax.text(j, i, str(cm[i, j]), ha='center', va='center', fontsize=20, fontweight='bold')
plt.colorbar(im, ax=ax)

# 2. ROC Curve
ax = axes[0, 1]
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
ax.plot(fpr, tpr, 'b-', linewidth=2, label=f'LightGBM (AUC = {auc:.3f})')
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
ax.fill_between(fpr, tpr, alpha=0.2)
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curve', fontsize=14, fontweight='bold')
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)

# 3. Precision-Recall Curve
ax = axes[1, 0]
precision_curve, recall_curve, _ = precision_recall_curve(y_test, y_pred_proba)
ax.plot(recall_curve, precision_curve, 'b-', linewidth=2, label=f'LightGBM (AP = {avg_precision:.3f})')
ax.axhline(y=y_test.mean(), color='k', linestyle='--', linewidth=1, label=f'Baseline ({y_test.mean():.2f})')
ax.fill_between(recall_curve, precision_curve, alpha=0.2)
ax.set_xlabel('Recall', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.set_title('Precision-Recall Curve', fontsize=14, fontweight='bold')
ax.legend(loc='lower left')
ax.grid(True, alpha=0.3)

# 4. Prediction Distribution
ax = axes[1, 1]
ax.hist(y_pred_proba[y_test == 0], bins=50, alpha=0.6, label='No Churn (Actual)', color='steelblue', density=True)
ax.hist(y_pred_proba[y_test == 1], bins=50, alpha=0.6, label='Churn (Actual)', color='coral', density=True)
ax.axvline(x=0.5, color='black', linestyle='--', linewidth=2, label='Threshold (0.5)')
ax.set_xlabel('Predicted Churn Probability', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Prediction Distribution', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 9: Threshold Analysis

The default threshold of 0.5 may not be optimal. Let's analyze how different thresholds affect precision and recall.

In [None]:
# Analyze different thresholds
thresholds = np.linspace(0.1, 0.9, 17)
results = []

for thresh in thresholds:
    y_pred_thresh = (y_pred_proba >= thresh).astype(int)
    results.append({
        'threshold': thresh,
        'precision': precision_score(y_test, y_pred_thresh, zero_division=0),
        'recall': recall_score(y_test, y_pred_thresh, zero_division=0),
        'f1': f1_score(y_test, y_pred_thresh, zero_division=0),
        'predicted_positive': y_pred_thresh.sum()
    })

results_df = pd.DataFrame(results)

# Find optimal thresholds
best_f1_idx = results_df['f1'].idxmax()
best_f1_thresh = results_df.loc[best_f1_idx, 'threshold']

print("Threshold Analysis:")
print(results_df.to_string(index=False))
print(f"\nOptimal threshold for F1 score: {best_f1_thresh:.2f}")

In [None]:
# Visualize threshold trade-offs
fig, ax = plt.subplots(figsize=(12, 6))

ax.plot(results_df['threshold'], results_df['precision'], 'b-', linewidth=2, marker='o', label='Precision')
ax.plot(results_df['threshold'], results_df['recall'], 'r-', linewidth=2, marker='s', label='Recall')
ax.plot(results_df['threshold'], results_df['f1'], 'g-', linewidth=2, marker='^', label='F1 Score')
ax.axvline(x=0.5, color='gray', linestyle='--', alpha=0.7, label='Default (0.5)')
ax.axvline(x=best_f1_thresh, color='green', linestyle=':', linewidth=2, label=f'Best F1 ({best_f1_thresh:.2f})')

ax.set_xlabel('Classification Threshold', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Precision-Recall Trade-off by Threshold', fontsize=14, fontweight='bold')
ax.legend(loc='center right')
ax.grid(True, alpha=0.3)
ax.set_xlim(0.05, 0.95)

plt.tight_layout()
plt.show()

print("\nBusiness Considerations:")
print("- Lower threshold: Catch more churners (high recall) but more false alarms (low precision)")
print("- Higher threshold: Fewer false alarms (high precision) but miss more churners (low recall)")
print("- Choose based on: cost of intervention vs cost of losing a customer")

---

## Summary

In this exercise, you learned:

1. **Data Format**: SageMaker LightGBM expects CSV format with the label in the first column (no header). Optionally, provide a `categorical_index.json` file to specify categorical feature indices.

2. **Key Differences from XGBoost**:
   - **Leaf-wise growth**: LightGBM grows trees leaf-wise (best-first) vs XGBoost's level-wise growth
   - **num_leaves**: Main complexity parameter (vs max_depth in XGBoost)
   - **Native categorical support**: No need for one-hot encoding
   - **Memory-bound**: Use general-purpose instances (M5) not compute-optimized (C5)

3. **Key Hyperparameters**:
   - `num_boost_round`: Number of boosting iterations
   - `num_leaves`: Maximum leaves per tree (main complexity control)
   - `learning_rate`: Shrinkage rate
   - `feature_fraction`, `bagging_fraction`: Sampling for regularization
   - `scale_pos_weight`: Handle class imbalance

4. **Evaluation**: Use AUC-ROC for imbalanced datasets. Tune threshold based on business costs of false positives vs false negatives.

## LightGBM vs XGBoost: When to Use Which?

| Aspect | LightGBM | XGBoost |
|--------|----------|----------|
| Speed | Faster (histogram-based) | Slower but more mature |
| Memory | More efficient | Higher memory usage |
| Large datasets | Better scaling | Can struggle |
| Small datasets | Risk of overfitting | Often better |
| Categorical features | Native support | Requires encoding |
| Tree growth | Leaf-wise (faster) | Level-wise (more balanced) |

## Next Steps

- Experiment with different hyperparameters
- Try SageMaker Hyperparameter Tuning for automatic optimization
- Compare performance with XGBoost on your data
- Use early stopping with validation data