# SageMaker Linear Learner Exercise

This notebook walks you through training Amazon SageMaker's **Linear Learner** algorithm for both **classification** and **regression** tasks, using **Batch Transform** for predictions.

## What You'll Learn
1. How to prepare data in Linear Learner's required format
2. How to configure and train a Linear Learner model for classification
3. How to configure and train a Linear Learner model for regression
4. How to use **Batch Transform** for predictions (no endpoint deployment)
5. How to evaluate model performance with comprehensive metrics

## What is Linear Learner?
Linear Learner is a supervised learning algorithm that can be used for both **classification** and **regression** problems. Under the hood, it trains multiple models with different hyperparameters in parallel and selects the best one.

Key features:
- **Binary/Multiclass Classification**: Logistic regression with various loss functions
- **Regression**: Linear regression with L1/L2 regularization
- **Automatic model tuning**: Trains multiple models in parallel
- **Built-in data normalization**: Handles feature scaling automatically

## Why Batch Transform?
- **No endpoint costs**: Only pay for compute during the transform job
- **Better for batch predictions**: Ideal for evaluating models on test sets
- **No cleanup required**: Resources automatically terminate after job completes

## Prerequisites
- SageMaker notebook instance or Studio
- IAM role with S3 and SageMaker permissions

---

## Step 1: Setup and Imports

In [None]:
# Install required packages (if needed)
!pip install -q matplotlib pandas numpy sagemaker boto3 scikit-learn seaborn

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
from sagemaker.transformer import Transformer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    # Classification metrics
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve,
    precision_recall_curve, average_precision_score, log_loss,
    balanced_accuracy_score, matthews_corrcoef,
    # Regression metrics
    mean_squared_error, mean_absolute_error, r2_score,
    mean_absolute_percentage_error, explained_variance_score,
    max_error, median_absolute_error
)
import json
import os
from datetime import datetime
import time
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Configure AWS session from environment variables
aws_profile = os.getenv('AWS_PROFILE')
aws_region = os.getenv('AWS_REGION', 'us-west-2')
sagemaker_role = os.getenv('SAGEMAKER_ROLE_ARN')

if aws_profile:
    boto3.setup_default_session(profile_name=aws_profile, region_name=aws_region)
else:
    boto3.setup_default_session(region_name=aws_region)

# SageMaker session and role
sagemaker_session = sagemaker.Session()

# Use environment variable for role, or fall back to execution role if running in SageMaker
if sagemaker_role:
    role = sagemaker_role
else:
    role = get_execution_role()

region = sagemaker_session.boto_region_name

print(f"AWS Profile: {aws_profile or 'default'}")
print(f"SageMaker Role: {role}")
print(f"Region: {region}")
print(f"SageMaker SDK Version: {sagemaker.__version__}")

In [None]:
# Configuration - MODIFY THESE FOR YOUR ENVIRONMENT
BUCKET_NAME = sagemaker_session.default_bucket()  # Or specify your bucket
PREFIX = "linear-learner-exercise"

# Dataset parameters
NUM_SAMPLES = 5000
TEST_RATIO = 0.2

print(f"S3 Bucket: {BUCKET_NAME}")
print(f"S3 Prefix: {PREFIX}")

## Step 2: Generate Synthetic Data

We'll generate two types of data:
1. **Classification**: Customer churn prediction
2. **Regression**: House price prediction

In [None]:
def generate_classification_data(num_samples=5000, seed=42):
    """
    Generate synthetic customer churn data for binary classification.
    """
    np.random.seed(seed)
    
    feature_names = [
        'tenure_months', 'monthly_charges', 'total_charges', 'support_tickets',
        'contract_type', 'payment_method', 'online_security', 'tech_support',
        'internet_service', 'num_products'
    ]
    
    # Generate features
    tenure = np.random.exponential(scale=24, size=num_samples).clip(1, 72)
    monthly_charges = np.random.normal(65, 30, num_samples).clip(20, 150)
    total_charges = tenure * monthly_charges * np.random.uniform(0.8, 1.0, num_samples)
    support_tickets = np.random.poisson(2, num_samples)
    contract_type = np.random.choice([0, 1, 2], num_samples, p=[0.5, 0.3, 0.2])
    payment_method = np.random.choice([0, 1, 2, 3], num_samples)
    online_security = np.random.choice([0, 1], num_samples, p=[0.6, 0.4])
    tech_support = np.random.choice([0, 1], num_samples, p=[0.5, 0.5])
    internet_service = np.random.choice([0, 1, 2], num_samples, p=[0.2, 0.4, 0.4])
    num_products = np.random.choice([1, 2, 3, 4, 5], num_samples, p=[0.3, 0.3, 0.2, 0.15, 0.05])
    
    X = np.column_stack([
        tenure, monthly_charges, total_charges, support_tickets,
        contract_type, payment_method, online_security, tech_support,
        internet_service, num_products
    ])
    
    # Generate labels based on realistic churn patterns
    churn_score = (
        -0.05 * tenure + 0.02 * monthly_charges + 0.1 * support_tickets +
        -0.5 * contract_type - 0.3 * online_security - 0.3 * tech_support +
        0.2 * (internet_service == 2) - 0.1 * num_products +
        np.random.normal(0, 0.5, num_samples)
    )
    
    churn_prob = 1 / (1 + np.exp(-churn_score))
    y = (np.random.random(num_samples) < churn_prob).astype(float)
    
    return X.astype(np.float32), y.astype(np.float32), feature_names

In [None]:
def generate_regression_data(num_samples=5000, seed=42):
    """
    Generate synthetic house price data for regression.
    """
    np.random.seed(seed)
    
    feature_names = [
        'sqft', 'bedrooms', 'bathrooms', 'lot_size_acres', 'year_built',
        'garage_capacity', 'has_pool', 'distance_to_city', 'school_rating', 'crime_rate'
    ]
    
    sqft = np.random.normal(2000, 800, num_samples).clip(500, 6000)
    bedrooms = np.random.choice([1, 2, 3, 4, 5, 6], num_samples, p=[0.05, 0.15, 0.35, 0.30, 0.12, 0.03])
    bathrooms = np.minimum(bedrooms, np.random.choice([1, 2, 3, 4], num_samples, p=[0.2, 0.4, 0.3, 0.1]))
    lot_size = np.random.exponential(0.3, num_samples).clip(0.1, 5.0)
    year_built = np.random.normal(1990, 20, num_samples).clip(1920, 2024).astype(int)
    garage_capacity = np.random.choice([0, 1, 2, 3], num_samples, p=[0.1, 0.25, 0.50, 0.15])
    has_pool = np.random.choice([0, 1], num_samples, p=[0.75, 0.25])
    distance_to_city = np.random.exponential(10, num_samples).clip(1, 50)
    school_rating = np.random.uniform(3, 10, num_samples)
    crime_rate = np.random.exponential(5, num_samples).clip(0.5, 30)
    
    X = np.column_stack([
        sqft, bedrooms, bathrooms, lot_size, year_built,
        garage_capacity, has_pool, distance_to_city, school_rating, crime_rate
    ])
    
    # Generate prices
    price = (
        50000 + 150 * sqft + 15000 * bedrooms + 20000 * bathrooms +
        30000 * lot_size + 1000 * (year_built - 1950) + 15000 * garage_capacity +
        40000 * has_pool - 2000 * distance_to_city + 10000 * school_rating -
        3000 * crime_rate + np.random.normal(0, 30000, num_samples)
    )
    
    y = np.maximum(price, 50000).astype(np.float32)
    
    return X.astype(np.float32), y, feature_names

In [None]:
# Generate both datasets
print("Generating Classification Data (Customer Churn)...")
X_clf, y_clf, clf_features = generate_classification_data(NUM_SAMPLES)

print("Generating Regression Data (House Prices)...")
X_reg, y_reg, reg_features = generate_regression_data(NUM_SAMPLES)

print(f"\nClassification Dataset:")
print(f"  Shape: {X_clf.shape}")
print(f"  Churn Rate: {y_clf.mean()*100:.1f}%")
print(f"  Features: {clf_features}")

print(f"\nRegression Dataset:")
print(f"  Shape: {X_reg.shape}")
print(f"  Price Range: ${y_reg.min():,.0f} - ${y_reg.max():,.0f}")
print(f"  Mean Price: ${y_reg.mean():,.0f}")
print(f"  Features: {reg_features}")

In [None]:
def split_data(X, y, test_ratio=0.2, seed=42):
    """Split data into train and test sets."""
    np.random.seed(seed)
    indices = np.random.permutation(len(y))
    test_size = int(len(y) * test_ratio)
    
    test_idx = indices[:test_size]
    train_idx = indices[test_size:]
    
    return X[train_idx], X[test_idx], y[train_idx], y[test_idx]

# Split classification data
X_clf_train, X_clf_test, y_clf_train, y_clf_test = split_data(X_clf, y_clf, TEST_RATIO)

# Split regression data
X_reg_train, X_reg_test, y_reg_train, y_reg_test = split_data(X_reg, y_reg, TEST_RATIO)

print(f"Classification - Train: {len(y_clf_train)}, Test: {len(y_clf_test)}")
print(f"Regression - Train: {len(y_reg_train)}, Test: {len(y_reg_test)}")

## Step 3: Visualize the Data

In [None]:
# Visualize classification data
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Create DataFrame for easier plotting
clf_df = pd.DataFrame(X_clf_train, columns=clf_features)
clf_df['churn'] = y_clf_train

features_to_plot = ['tenure_months', 'monthly_charges', 'support_tickets', 
                    'contract_type', 'num_products', 'total_charges']

for idx, feature in enumerate(features_to_plot):
    ax = axes.flatten()[idx]
    
    for label, color, name in [(0, 'blue', 'No Churn'), (1, 'red', 'Churn')]:
        subset = clf_df[clf_df['churn'] == label][feature]
        ax.hist(subset, bins=30, alpha=0.5, color=color, label=name)
    
    ax.set_xlabel(feature)
    ax.set_ylabel('Count')
    ax.legend()
    ax.set_title(f'{feature} by Churn Status')

plt.suptitle('Customer Churn - Feature Distributions', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Visualize regression data
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

reg_df = pd.DataFrame(X_reg_train, columns=reg_features)
reg_df['price'] = y_reg_train

features_to_plot = ['sqft', 'bedrooms', 'year_built', 
                    'distance_to_city', 'school_rating', 'lot_size_acres']

for idx, feature in enumerate(features_to_plot):
    ax = axes.flatten()[idx]
    ax.scatter(reg_df[feature], reg_df['price'], alpha=0.3, s=5)
    ax.set_xlabel(feature)
    ax.set_ylabel('Price ($)')
    ax.set_title(f'{feature} vs Price')

plt.suptitle('House Price - Feature Relationships', fontsize=14)
plt.tight_layout()
plt.show()

## Step 4: Prepare Data for SageMaker Linear Learner

Linear Learner accepts data in several formats:
1. **CSV**: Label in first column, followed by features (for training)
2. **CSV**: Features only (for batch transform inference)

We'll prepare both formats.

In [None]:
# Create local data directory
os.makedirs('data/classification', exist_ok=True)
os.makedirs('data/regression', exist_ok=True)

def save_csv_for_training(X, y, filepath):
    """Save data in SageMaker Linear Learner CSV format (label first) for training."""
    data = np.column_stack([y.reshape(-1, 1), X])
    np.savetxt(filepath, data, delimiter=',', fmt='%.6f')

def save_csv_for_inference(X, filepath):
    """Save features only for batch transform inference."""
    np.savetxt(filepath, X, delimiter=',', fmt='%.6f')

# Save classification data
save_csv_for_training(X_clf_train, y_clf_train, 'data/classification/train.csv')
save_csv_for_inference(X_clf_test, 'data/classification/test_features.csv')
# Also save test labels locally for evaluation
np.savetxt('data/classification/test_labels.csv', y_clf_test, delimiter=',', fmt='%.6f')

# Save regression data
save_csv_for_training(X_reg_train, y_reg_train, 'data/regression/train.csv')
save_csv_for_inference(X_reg_test, 'data/regression/test_features.csv')
# Also save test labels locally for evaluation
np.savetxt('data/regression/test_labels.csv', y_reg_test, delimiter=',', fmt='%.6f')

print("Data saved locally:")
print(f"  Classification train: {os.path.getsize('data/classification/train.csv') / 1024:.1f} KB")
print(f"  Classification test features: {os.path.getsize('data/classification/test_features.csv') / 1024:.1f} KB")
print(f"  Regression train: {os.path.getsize('data/regression/train.csv') / 1024:.1f} KB")
print(f"  Regression test features: {os.path.getsize('data/regression/test_features.csv') / 1024:.1f} KB")

In [None]:
# Upload to S3
s3_client = boto3.client('s3')

# Classification data
clf_train_s3 = f"{PREFIX}/classification/train/train.csv"
clf_test_s3 = f"{PREFIX}/classification/test/test_features.csv"

s3_client.upload_file('data/classification/train.csv', BUCKET_NAME, clf_train_s3)
s3_client.upload_file('data/classification/test_features.csv', BUCKET_NAME, clf_test_s3)

# Regression data
reg_train_s3 = f"{PREFIX}/regression/train/train.csv"
reg_test_s3 = f"{PREFIX}/regression/test/test_features.csv"

s3_client.upload_file('data/regression/train.csv', BUCKET_NAME, reg_train_s3)
s3_client.upload_file('data/regression/test_features.csv', BUCKET_NAME, reg_test_s3)

print("Data uploaded to S3:")
print(f"  Classification train: s3://{BUCKET_NAME}/{clf_train_s3}")
print(f"  Classification test: s3://{BUCKET_NAME}/{clf_test_s3}")
print(f"  Regression train: s3://{BUCKET_NAME}/{reg_train_s3}")
print(f"  Regression test: s3://{BUCKET_NAME}/{reg_test_s3}")

---

# Part A: Binary Classification (Customer Churn)

## Step 5A: Configure and Train the Classification Model

In [None]:
# Get the Linear Learner container image
linear_learner_image = retrieve(
    framework='linear-learner',
    region=region,
    version='1'
)

print(f"Linear Learner Image URI: {linear_learner_image}")

In [None]:
# Create the classification estimator
clf_estimator = Estimator(
    image_uri=linear_learner_image,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/classification/output',
    sagemaker_session=sagemaker_session,
    base_job_name='linear-learner-churn'
)

In [None]:
# Set classification hyperparameters
clf_hyperparameters = {
    # Task type
    "predictor_type": "binary_classifier",
    
    # Number of features
    "feature_dim": X_clf_train.shape[1],
    
    # Training parameters
    "epochs": 15,
    "mini_batch_size": 200,
    
    # Regularization
    "l1": 0.001,
    "wd": 0.0001,  # L2 weight decay
    
    # Optimization
    "learning_rate": 0.1,
    "optimizer": "adam",
    
    # Model selection
    "num_models": "auto",  # Train multiple models and select best
    
    # Normalization (important for linear models!)
    "normalize_data": "true",
    "normalize_label": "false",
    
    # Binary classification specific
    "binary_classifier_model_selection_criteria": "precision_at_target_recall",
    "target_recall": 0.9,  # Optimize precision while maintaining 90% recall
    "positive_example_weight_mult": "balanced",  # Handle class imbalance
}

clf_estimator.set_hyperparameters(**clf_hyperparameters)

print("Classification Hyperparameters:")
for k, v in clf_hyperparameters.items():
    print(f"  {k}: {v}")

In [None]:
# Define training input
clf_train_input = TrainingInput(
    s3_data=f's3://{BUCKET_NAME}/{clf_train_s3}',
    content_type='text/csv'
)

print("Starting classification training job...")
print("This will take approximately 3-5 minutes.\n")

clf_estimator.fit({'train': clf_train_input}, wait=True, logs=True)

In [None]:
# Get training job info
clf_training_job = clf_estimator.latest_training_job.name
print(f"Classification training job completed: {clf_training_job}")
print(f"Model artifacts: {clf_estimator.model_data}")

## Step 6A: Run Batch Transform for Classification Predictions

Instead of deploying an endpoint, we use Batch Transform to get predictions on our test set.

In [None]:
# Create a transformer for batch predictions
clf_transformer = clf_estimator.transformer(
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/classification/batch-output',
    accept='text/csv',
    assemble_with='Line'
)

print("Starting batch transform job for classification...")
print("This will take approximately 3-5 minutes.\n")

# Run batch transform
clf_transformer.transform(
    data=f's3://{BUCKET_NAME}/{clf_test_s3}',
    content_type='text/csv',
    split_type='Line',
    wait=True,
    logs=True
)

print(f"\nBatch transform completed!")
print(f"Output location: {clf_transformer.output_path}")

In [None]:
# Download and parse batch transform results
clf_output_key = f"{PREFIX}/classification/batch-output/test_features.csv.out"

# Download results
s3_client.download_file(BUCKET_NAME, clf_output_key, 'data/classification/predictions.csv')

# Parse predictions (Linear Learner returns JSON per line for classification)
clf_predictions = []
clf_scores = []

with open('data/classification/predictions.csv', 'r') as f:
    for line in f:
        pred = json.loads(line.strip())
        clf_predictions.append(pred['predicted_label'])
        clf_scores.append(pred['score'])

clf_predictions = np.array(clf_predictions)
clf_scores = np.array(clf_scores)

print(f"Loaded {len(clf_predictions)} predictions")
print(f"Prediction distribution: {np.bincount(clf_predictions.astype(int))}")

## Step 7A: Comprehensive Classification Model Evaluation

We'll calculate all standard classification metrics and create visualizations.

In [None]:
def evaluate_classification_model(y_true, y_pred, y_scores, class_names=['No Churn', 'Churn']):
    """
    Comprehensive evaluation of a binary classification model.
    
    Parameters:
    -----------
    y_true : array-like
        True labels
    y_pred : array-like
        Predicted labels
    y_scores : array-like
        Prediction probabilities/scores for the positive class
    class_names : list
        Names of the classes
    
    Returns:
    --------
    dict : Dictionary containing all metrics
    """
    metrics = {}
    
    # Basic metrics
    metrics['accuracy'] = accuracy_score(y_true, y_pred)
    metrics['balanced_accuracy'] = balanced_accuracy_score(y_true, y_pred)
    metrics['precision'] = precision_score(y_true, y_pred)
    metrics['recall'] = recall_score(y_true, y_pred)
    metrics['f1_score'] = f1_score(y_true, y_pred)
    metrics['matthews_corrcoef'] = matthews_corrcoef(y_true, y_pred)
    
    # Probability-based metrics
    metrics['roc_auc'] = roc_auc_score(y_true, y_scores)
    metrics['avg_precision'] = average_precision_score(y_true, y_scores)
    metrics['log_loss'] = log_loss(y_true, y_scores)
    
    # Confusion matrix components
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    metrics['true_negatives'] = tn
    metrics['false_positives'] = fp
    metrics['false_negatives'] = fn
    metrics['true_positives'] = tp
    metrics['specificity'] = tn / (tn + fp) if (tn + fp) > 0 else 0
    metrics['negative_predictive_value'] = tn / (tn + fn) if (tn + fn) > 0 else 0
    
    return metrics


def print_classification_report(metrics, title="Classification Metrics"):
    """Print a formatted classification report."""
    print("=" * 60)
    print(f" {title}")
    print("=" * 60)
    
    print("\n--- Core Metrics ---")
    print(f"  Accuracy:           {metrics['accuracy']:.4f}")
    print(f"  Balanced Accuracy:  {metrics['balanced_accuracy']:.4f}")
    print(f"  Precision:          {metrics['precision']:.4f}")
    print(f"  Recall (Sensitivity): {metrics['recall']:.4f}")
    print(f"  Specificity:        {metrics['specificity']:.4f}")
    print(f"  F1 Score:           {metrics['f1_score']:.4f}")
    
    print("\n--- Probability-Based Metrics ---")
    print(f"  ROC AUC:            {metrics['roc_auc']:.4f}")
    print(f"  Average Precision:  {metrics['avg_precision']:.4f}")
    print(f"  Log Loss:           {metrics['log_loss']:.4f}")
    
    print("\n--- Additional Metrics ---")
    print(f"  Matthews Corr Coef: {metrics['matthews_corrcoef']:.4f}")
    print(f"  Neg Pred Value:     {metrics['negative_predictive_value']:.4f}")
    
    print("\n--- Confusion Matrix ---")
    print(f"  True Negatives:  {metrics['true_negatives']}")
    print(f"  False Positives: {metrics['false_positives']}")
    print(f"  False Negatives: {metrics['false_negatives']}")
    print(f"  True Positives:  {metrics['true_positives']}")
    print("=" * 60)

In [None]:
# Calculate all classification metrics
clf_metrics = evaluate_classification_model(y_clf_test, clf_predictions, clf_scores)
print_classification_report(clf_metrics, "Customer Churn Classification Metrics")

In [None]:
# Print sklearn's classification report for additional detail
print("\nDetailed Classification Report:")
print(classification_report(y_clf_test, clf_predictions, target_names=['No Churn', 'Churn']))

In [None]:
# Visualization: Confusion Matrix
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Raw counts
cm = confusion_matrix(y_clf_test, clf_predictions)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['No Churn', 'Churn'],
            yticklabels=['No Churn', 'Churn'])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix (Counts)')

# Normalized (percentages)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized, annot=True, fmt='.2%', cmap='Blues', ax=axes[1],
            xticklabels=['No Churn', 'Churn'],
            yticklabels=['No Churn', 'Churn'])
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_title('Confusion Matrix (Normalized)')

plt.tight_layout()
plt.show()

In [None]:
# Visualization: ROC Curve and Precision-Recall Curve
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC Curve
fpr, tpr, thresholds_roc = roc_curve(y_clf_test, clf_scores)
axes[0].plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC Curve (AUC = {clf_metrics["roc_auc"]:.3f})')
axes[0].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
axes[0].fill_between(fpr, tpr, alpha=0.2)
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve')
axes[0].legend(loc='lower right')
axes[0].grid(True, alpha=0.3)

# Precision-Recall Curve
precision_curve, recall_curve, thresholds_pr = precision_recall_curve(y_clf_test, clf_scores)
axes[1].plot(recall_curve, precision_curve, 'b-', linewidth=2, 
             label=f'PR Curve (AP = {clf_metrics["avg_precision"]:.3f})')
axes[1].axhline(y=y_clf_test.mean(), color='k', linestyle='--', linewidth=1, label='Baseline (Positive Rate)')
axes[1].fill_between(recall_curve, precision_curve, alpha=0.2)
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].legend(loc='lower left')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Visualization: Score Distribution and Threshold Analysis
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Score distribution by actual class
axes[0].hist(clf_scores[y_clf_test == 0], bins=50, alpha=0.5, label='No Churn (Actual)', color='blue', density=True)
axes[0].hist(clf_scores[y_clf_test == 1], bins=50, alpha=0.5, label='Churn (Actual)', color='red', density=True)
axes[0].axvline(x=0.5, color='black', linestyle='--', linewidth=2, label='Default Threshold (0.5)')
axes[0].set_xlabel('Prediction Score (Probability of Churn)')
axes[0].set_ylabel('Density')
axes[0].set_title('Score Distribution by Actual Class')
axes[0].legend()

# Threshold analysis
thresholds = np.linspace(0.01, 0.99, 99)
f1_scores = []
precisions = []
recalls = []

for thresh in thresholds:
    y_pred_thresh = (clf_scores >= thresh).astype(int)
    f1_scores.append(f1_score(y_clf_test, y_pred_thresh, zero_division=0))
    precisions.append(precision_score(y_clf_test, y_pred_thresh, zero_division=0))
    recalls.append(recall_score(y_clf_test, y_pred_thresh, zero_division=0))

axes[1].plot(thresholds, f1_scores, 'g-', linewidth=2, label='F1 Score')
axes[1].plot(thresholds, precisions, 'b-', linewidth=2, label='Precision')
axes[1].plot(thresholds, recalls, 'r-', linewidth=2, label='Recall')
axes[1].axvline(x=0.5, color='black', linestyle='--', linewidth=1, alpha=0.5)
best_thresh = thresholds[np.argmax(f1_scores)]
axes[1].axvline(x=best_thresh, color='green', linestyle=':', linewidth=2, label=f'Best F1 Threshold ({best_thresh:.2f})')
axes[1].set_xlabel('Classification Threshold')
axes[1].set_ylabel('Score')
axes[1].set_title('Metrics vs Classification Threshold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nOptimal threshold for F1 Score: {best_thresh:.3f}")
print(f"F1 Score at optimal threshold: {max(f1_scores):.4f}")

---

# Part B: Regression (House Price Prediction)

## Step 5B: Configure and Train the Regression Model

In [None]:
# Create the regression estimator
reg_estimator = Estimator(
    image_uri=linear_learner_image,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/regression/output',
    sagemaker_session=sagemaker_session,
    base_job_name='linear-learner-housing'
)

In [None]:
# Set regression hyperparameters
reg_hyperparameters = {
    # Task type
    "predictor_type": "regressor",
    
    # Number of features
    "feature_dim": X_reg_train.shape[1],
    
    # Training parameters
    "epochs": 15,
    "mini_batch_size": 200,
    
    # Regularization
    "l1": 0.001,
    "wd": 0.0001,  # L2 weight decay
    
    # Optimization
    "learning_rate": 0.1,
    "optimizer": "adam",
    
    # Model selection
    "num_models": "auto",
    
    # Normalization (important for linear models!)
    "normalize_data": "true",
    "normalize_label": "true",  # Normalize labels for regression
    
    # Loss function
    "loss": "squared_loss",  # Standard MSE loss
}

reg_estimator.set_hyperparameters(**reg_hyperparameters)

print("Regression Hyperparameters:")
for k, v in reg_hyperparameters.items():
    print(f"  {k}: {v}")

In [None]:
# Define training input
reg_train_input = TrainingInput(
    s3_data=f's3://{BUCKET_NAME}/{reg_train_s3}',
    content_type='text/csv'
)

print("Starting regression training job...")
print("This will take approximately 3-5 minutes.\n")

reg_estimator.fit({'train': reg_train_input}, wait=True, logs=True)

In [None]:
# Get training job info
reg_training_job = reg_estimator.latest_training_job.name
print(f"Regression training job completed: {reg_training_job}")
print(f"Model artifacts: {reg_estimator.model_data}")

## Step 6B: Run Batch Transform for Regression Predictions

In [None]:
# Create a transformer for batch predictions
reg_transformer = reg_estimator.transformer(
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/regression/batch-output',
    accept='text/csv',
    assemble_with='Line'
)

print("Starting batch transform job for regression...")
print("This will take approximately 3-5 minutes.\n")

# Run batch transform
reg_transformer.transform(
    data=f's3://{BUCKET_NAME}/{reg_test_s3}',
    content_type='text/csv',
    split_type='Line',
    wait=True,
    logs=True
)

print(f"\nBatch transform completed!")
print(f"Output location: {reg_transformer.output_path}")

In [None]:
# Download and parse batch transform results
reg_output_key = f"{PREFIX}/regression/batch-output/test_features.csv.out"

# Download results
s3_client.download_file(BUCKET_NAME, reg_output_key, 'data/regression/predictions.csv')

# Parse predictions (Linear Learner returns JSON per line for regression)
reg_predictions = []

with open('data/regression/predictions.csv', 'r') as f:
    for line in f:
        pred = json.loads(line.strip())
        reg_predictions.append(pred['score'])

reg_predictions = np.array(reg_predictions)

print(f"Loaded {len(reg_predictions)} predictions")
print(f"Prediction range: ${reg_predictions.min():,.0f} - ${reg_predictions.max():,.0f}")

## Step 7B: Comprehensive Regression Model Evaluation

We'll calculate all standard regression metrics and create visualizations.

In [None]:
def evaluate_regression_model(y_true, y_pred):
    """
    Comprehensive evaluation of a regression model.
    
    Parameters:
    -----------
    y_true : array-like
        True values
    y_pred : array-like
        Predicted values
    
    Returns:
    --------
    dict : Dictionary containing all metrics
    """
    metrics = {}
    
    # Error metrics
    metrics['mse'] = mean_squared_error(y_true, y_pred)
    metrics['rmse'] = np.sqrt(metrics['mse'])
    metrics['mae'] = mean_absolute_error(y_true, y_pred)
    metrics['median_ae'] = median_absolute_error(y_true, y_pred)
    metrics['max_error'] = max_error(y_true, y_pred)
    
    # Percentage-based metrics
    metrics['mape'] = mean_absolute_percentage_error(y_true, y_pred) * 100
    
    # Symmetric MAPE (handles zeros better)
    denominator = (np.abs(y_true) + np.abs(y_pred)) / 2
    denominator = np.where(denominator == 0, 1, denominator)  # Avoid division by zero
    metrics['smape'] = np.mean(np.abs(y_true - y_pred) / denominator) * 100
    
    # Explained variance metrics
    metrics['r2'] = r2_score(y_true, y_pred)
    metrics['explained_variance'] = explained_variance_score(y_true, y_pred)
    
    # Adjusted R2 (assuming we know the number of features)
    n = len(y_true)
    p = 10  # number of features
    metrics['adjusted_r2'] = 1 - (1 - metrics['r2']) * (n - 1) / (n - p - 1)
    
    # Residual statistics
    residuals = y_true - y_pred
    metrics['residual_mean'] = np.mean(residuals)
    metrics['residual_std'] = np.std(residuals)
    metrics['residual_median'] = np.median(residuals)
    
    # Within X% accuracy
    pct_errors = np.abs(residuals / y_true) * 100
    metrics['within_5pct'] = np.mean(pct_errors <= 5) * 100
    metrics['within_10pct'] = np.mean(pct_errors <= 10) * 100
    metrics['within_20pct'] = np.mean(pct_errors <= 20) * 100
    
    return metrics


def print_regression_report(metrics, value_prefix='$', title="Regression Metrics"):
    """Print a formatted regression report."""
    print("=" * 60)
    print(f" {title}")
    print("=" * 60)
    
    print("\n--- Error Metrics ---")
    print(f"  RMSE:               {value_prefix}{metrics['rmse']:,.0f}")
    print(f"  MAE:                {value_prefix}{metrics['mae']:,.0f}")
    print(f"  Median AE:          {value_prefix}{metrics['median_ae']:,.0f}")
    print(f"  Max Error:          {value_prefix}{metrics['max_error']:,.0f}")
    print(f"  MSE:                {metrics['mse']:,.0f}")
    
    print("\n--- Percentage Metrics ---")
    print(f"  MAPE:               {metrics['mape']:.2f}%")
    print(f"  SMAPE:              {metrics['smape']:.2f}%")
    
    print("\n--- Explained Variance ---")
    print(f"  R-squared (R2):     {metrics['r2']:.4f}")
    print(f"  Adjusted R2:        {metrics['adjusted_r2']:.4f}")
    print(f"  Explained Variance: {metrics['explained_variance']:.4f}")
    
    print("\n--- Residual Statistics ---")
    print(f"  Mean Residual:      {value_prefix}{metrics['residual_mean']:,.0f}")
    print(f"  Residual Std Dev:   {value_prefix}{metrics['residual_std']:,.0f}")
    print(f"  Median Residual:    {value_prefix}{metrics['residual_median']:,.0f}")
    
    print("\n--- Prediction Accuracy ---")
    print(f"  Within 5%:          {metrics['within_5pct']:.1f}%")
    print(f"  Within 10%:         {metrics['within_10pct']:.1f}%")
    print(f"  Within 20%:         {metrics['within_20pct']:.1f}%")
    print("=" * 60)

In [None]:
# Calculate all regression metrics
reg_metrics = evaluate_regression_model(y_reg_test, reg_predictions)
print_regression_report(reg_metrics, value_prefix='$', title="House Price Regression Metrics")

In [None]:
# Visualization: Actual vs Predicted with multiple views
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

residuals = y_reg_test - reg_predictions

# 1. Actual vs Predicted scatter plot
ax = axes[0, 0]
ax.scatter(y_reg_test, reg_predictions, alpha=0.3, s=10)
min_val = min(y_reg_test.min(), reg_predictions.min())
max_val = max(y_reg_test.max(), reg_predictions.max())
ax.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect Prediction')
ax.set_xlabel('Actual Price ($)')
ax.set_ylabel('Predicted Price ($)')
ax.set_title(f'Actual vs Predicted (R2 = {reg_metrics["r2"]:.4f})')
ax.legend()
ax.grid(True, alpha=0.3)

# 2. Residuals vs Predicted
ax = axes[0, 1]
ax.scatter(reg_predictions, residuals, alpha=0.3, s=10)
ax.axhline(y=0, color='r', linestyle='--', linewidth=2)
ax.axhline(y=residuals.mean(), color='g', linestyle=':', linewidth=2, label=f'Mean: ${residuals.mean():,.0f}')
ax.fill_between([reg_predictions.min(), reg_predictions.max()], 
                [-2*residuals.std(), -2*residuals.std()],
                [2*residuals.std(), 2*residuals.std()], alpha=0.1, color='blue')
ax.set_xlabel('Predicted Price ($)')
ax.set_ylabel('Residual ($)')
ax.set_title('Residual Plot')
ax.legend()
ax.grid(True, alpha=0.3)

# 3. Residual distribution
ax = axes[1, 0]
ax.hist(residuals, bins=50, edgecolor='black', alpha=0.7, density=True)
ax.axvline(x=0, color='red', linestyle='--', linewidth=2, label='Zero Error')
ax.axvline(x=residuals.mean(), color='green', linestyle=':', linewidth=2, 
           label=f'Mean: ${residuals.mean():,.0f}')
# Add normal distribution overlay
x_norm = np.linspace(residuals.min(), residuals.max(), 100)
from scipy import stats
ax.plot(x_norm, stats.norm.pdf(x_norm, residuals.mean(), residuals.std()), 
        'r-', linewidth=2, label='Normal Fit')
ax.set_xlabel('Prediction Error ($)')
ax.set_ylabel('Density')
ax.set_title('Distribution of Residuals')
ax.legend()

# 4. Percentage error distribution
ax = axes[1, 1]
pct_errors = (residuals / y_reg_test) * 100
ax.hist(pct_errors, bins=50, edgecolor='black', alpha=0.7, range=(-50, 50))
ax.axvline(x=0, color='red', linestyle='--', linewidth=2, label='Zero Error')
ax.axvline(x=-10, color='green', linestyle=':', linewidth=1, alpha=0.7)
ax.axvline(x=10, color='green', linestyle=':', linewidth=1, alpha=0.7, label='+/- 10%')
ax.set_xlabel('Percentage Error (%)')
ax.set_ylabel('Count')
ax.set_title(f'Percentage Error Distribution (MAPE = {reg_metrics["mape"]:.2f}%)')
ax.legend()

plt.tight_layout()
plt.show()

In [None]:
# Additional visualization: Error by price range
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Create price bins
price_bins = pd.cut(y_reg_test, bins=5)
bin_labels = [f'${int(b.left/1000)}K-${int(b.right/1000)}K' for b in price_bins.categories]

# MAE by price range
mae_by_bin = []
mape_by_bin = []
counts_by_bin = []

for i, bin_label in enumerate(price_bins.categories):
    mask = price_bins == bin_label
    if mask.sum() > 0:
        mae_by_bin.append(mean_absolute_error(y_reg_test[mask], reg_predictions[mask]))
        mape_by_bin.append(mean_absolute_percentage_error(y_reg_test[mask], reg_predictions[mask]) * 100)
        counts_by_bin.append(mask.sum())
    else:
        mae_by_bin.append(0)
        mape_by_bin.append(0)
        counts_by_bin.append(0)

x = range(len(bin_labels))

ax = axes[0]
bars = ax.bar(x, mae_by_bin, color='steelblue', edgecolor='black')
ax.set_xticks(x)
ax.set_xticklabels(bin_labels, rotation=45, ha='right')
ax.set_xlabel('Price Range')
ax.set_ylabel('Mean Absolute Error ($)')
ax.set_title('MAE by Price Range')
# Add count labels
for i, (bar, count) in enumerate(zip(bars, counts_by_bin)):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'n={count}', 
            ha='center', va='bottom', fontsize=9)

ax = axes[1]
bars = ax.bar(x, mape_by_bin, color='coral', edgecolor='black')
ax.set_xticks(x)
ax.set_xticklabels(bin_labels, rotation=45, ha='right')
ax.set_xlabel('Price Range')
ax.set_ylabel('Mean Absolute Percentage Error (%)')
ax.set_title('MAPE by Price Range')
for i, (bar, count) in enumerate(zip(bars, counts_by_bin)):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'n={count}', 
            ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

In [None]:
# Q-Q Plot for residuals (normality check)
fig, ax = plt.subplots(figsize=(8, 6))

from scipy import stats
stats.probplot(residuals, dist="norm", plot=ax)
ax.set_title('Q-Q Plot of Residuals (Normality Check)')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Shapiro-Wilk test for normality (on a sample if dataset is large)
sample_size = min(5000, len(residuals))
sample_residuals = np.random.choice(residuals, size=sample_size, replace=False)
stat, p_value = stats.shapiro(sample_residuals)
print(f"Shapiro-Wilk Test for Normality:")
print(f"  Statistic: {stat:.4f}")
print(f"  P-value: {p_value:.4f}")
print(f"  Residuals {'appear' if p_value > 0.05 else 'do not appear'} normally distributed (alpha=0.05)")

---

## Step 8: Clean Up S3 Data (Optional)

In [None]:
# Uncomment to delete S3 data

# import boto3
# s3 = boto3.resource('s3')
# bucket = s3.Bucket(BUCKET_NAME)
# bucket.objects.filter(Prefix=PREFIX).delete()
# print(f"Deleted all objects under s3://{BUCKET_NAME}/{PREFIX}")

---

## Summary

In this exercise, you learned:

### Data Format
- Linear Learner accepts CSV format with label in the first column (for training)
- For batch transform inference, provide features only (no label column)
- Normalization is handled automatically by the algorithm

### Batch Transform
- More cost-effective than endpoints for batch predictions
- No cleanup required - resources terminate after job completes
- Ideal for model evaluation and offline scoring

### Classification Metrics Covered
| Metric | Description |
|--------|-------------|
| Accuracy | Overall correctness |
| Balanced Accuracy | Average of recall for each class |
| Precision | True positives / Predicted positives |
| Recall (Sensitivity) | True positives / Actual positives |
| Specificity | True negatives / Actual negatives |
| F1 Score | Harmonic mean of precision and recall |
| ROC AUC | Area under the ROC curve |
| Average Precision | Area under the precision-recall curve |
| Log Loss | Logarithmic loss (probabilistic) |
| Matthews Correlation | Correlation between predicted and actual |

### Regression Metrics Covered
| Metric | Description |
|--------|-------------|
| RMSE | Root Mean Square Error |
| MAE | Mean Absolute Error |
| Median AE | Median Absolute Error (robust to outliers) |
| Max Error | Worst case error |
| MAPE | Mean Absolute Percentage Error |
| SMAPE | Symmetric MAPE |
| R-squared | Coefficient of determination |
| Adjusted R2 | R2 adjusted for number of predictors |
| Explained Variance | Proportion of variance explained |

### Key Hyperparameters
| Parameter | Description |
|-----------|-------------|
| `predictor_type` | binary_classifier, multiclass_classifier, or regressor |
| `feature_dim` | Number of features (required) |
| `num_models` | Number of parallel models to train |
| `normalize_data` | Normalize features (recommended) |
| `l1` | L1 regularization strength |
| `wd` | L2 regularization (weight decay) |
| `learning_rate` | Learning rate |

### Best Practices
1. **Always normalize data** - Set `normalize_data: true`
2. **Use regularization** - Helps prevent overfitting
3. **Train multiple models** - Set `num_models: auto` for automatic tuning
4. **Use Batch Transform** - More cost-effective than endpoints for evaluation
5. **Check multiple metrics** - Don't rely on a single metric for evaluation

## Resources

- [Linear Learner Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html)
- [Linear Learner Hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html)
- [SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html)
- [SageMaker Python SDK](https://sagemaker.readthedocs.io/)