# SageMaker Image Classification Exercise

This notebook demonstrates Amazon SageMaker's **Image Classification** algorithm for classifying images into categories.

## What You'll Learn
1. How to prepare image data for classification
2. How to configure and understand image classification hyperparameters
3. How to train an image classifier with transfer learning
4. How to interpret and evaluate classification predictions

## What is Image Classification?

Image Classification assigns one or more labels to an **entire image**. Unlike object detection, it doesn't localize where objects are - it simply answers "what's in this image?"

**SageMaker provides two implementations:**
- **MXNet-based**: CNN with ResNet architecture (covered in this notebook)
- **TensorFlow-based**: Transfer learning from TensorFlow Hub models

## Comparison with Object Detection

| Aspect | Image Classification | Object Detection |
|--------|---------------------|------------------|
| Output | Single/multiple labels for whole image | Labels + bounding boxes |
| Question | "What's in this image?" | "What and where?" |
| Complexity | Simpler | More complex |
| Use case | Product categorization | Inventory counting |

## Use Cases

| Industry | Application |
|----------|-------------|
| E-commerce | Product categorization, visual search |
| Healthcare | Medical image diagnosis, X-ray analysis |
| Manufacturing | Defect detection, quality control |
| Social Media | Content moderation, auto-tagging |
| Agriculture | Plant disease identification, crop classification |
| Security | Scene classification, activity recognition |

---

## ⚠️ Important: Training Cost Warning

<div style="background-color: #060604ff; border: 1px solid #ffc107; border-radius: 5px; padding: 15px; margin: 10px 0;">

### GPU Requirements and Costs

**Image Classification training requires GPU instances.** Like Object Detection, this algorithm uses deep neural networks that are computationally intensive.

| Instance Type | GPU | Memory | On-Demand Price* |
|---------------|-----|--------|------------------|
| ml.p2.xlarge | 1x K80 | 12 GB | ~$1.26/hour |
| ml.p3.2xlarge | 1x V100 | 16 GB | ~$3.83/hour |
| ml.p3.8xlarge | 4x V100 | 64 GB | ~$14.69/hour |
| ml.g4dn.xlarge | 1x T4 | 16 GB | ~$0.74/hour |
| ml.g4dn.2xlarge | 1x T4 | 32 GB | ~$1.05/hour |
| ml.g5.xlarge | 1x A10G | 24 GB | ~$1.41/hour |

*Prices are approximate for us-west-2 and subject to change. Check [AWS SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) for current rates.

### Cost Estimation Example

Training a typical image classification model:
- **30 epochs** with **50,000 images** (CIFAR-10 size): ~1-2 hours on ml.p3.2xlarge
- **Estimated cost**: $3.83 - $7.66 for training
- With transfer learning (pretrained model): Often converges faster, reducing cost

### Cost-Saving Recommendations

1. **Use Spot Instances**: Can save up to 70% - add `use_spot_instances=True` to Estimator
2. **Start with ml.g4dn.xlarge**: Most cost-effective GPU option (~$0.74/hour)
3. **Use transfer learning**: Set `use_pretrained_model=1` - requires fewer epochs
4. **Start with fewer epochs**: Use 10 epochs to validate setup before full training
5. **Use smaller ResNet**: `num_layers=18` or `50` instead of `152` for faster iteration
6. **Enable early stopping**: Stop when validation accuracy plateaus

</div>

## Step 1: Setup and Imports

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator
import numpy as np
import json
import os
from datetime import datetime
from dotenv import load_dotenv
import matplotlib.pyplot as plt
from collections import defaultdict

# Load environment variables from .env file
load_dotenv()

# Configure AWS session from environment variables
aws_profile = os.getenv('AWS_PROFILE')
aws_region = os.getenv('AWS_REGION', 'us-west-2')
sagemaker_role = os.getenv('SAGEMAKER_ROLE_ARN')

if aws_profile:
    boto3.setup_default_session(profile_name=aws_profile, region_name=aws_region)
else:
    boto3.setup_default_session(region_name=aws_region)

# SageMaker session and role
sagemaker_session = sagemaker.Session()

if sagemaker_role:
    role = sagemaker_role
else:
    role = get_execution_role()

region = sagemaker_session.boto_region_name

print(f"AWS Profile: {aws_profile or 'default'}")
print(f"SageMaker Role: {role}")
print(f"Region: {region}")
print(f"SageMaker SDK Version: {sagemaker.__version__}")

In [None]:
# Configuration
BUCKET_NAME = sagemaker_session.default_bucket()
PREFIX = "image-classification"

print(f"S3 Bucket: {BUCKET_NAME}")
print(f"S3 Prefix: {PREFIX}")

## Step 2: Understand Data Formats

SageMaker Image Classification supports multiple data formats. Choosing the right format depends on your dataset size and workflow.

### Format 1: RecordIO (Recommended for Large Datasets)

Binary format that packs images and labels together. Most efficient for training but requires preprocessing.

```bash
# RecordIO files
train.rec
train.idx
validation.rec
validation.idx
```

**Pros**: Fastest training, efficient I/O  
**Cons**: Requires preprocessing step to create

### Format 2: Image Files + LST File (Easiest to Understand)

Plain image files with a list file mapping images to labels.

```
# Directory structure
train/
  cat/
    image001.jpg
    image002.jpg
  dog/
    image001.jpg
    image002.jpg

# train.lst format: index \t label \t path
0\t0\ttrain/cat/image001.jpg
1\t0\ttrain/cat/image002.jpg
2\t1\ttrain/dog/image001.jpg
3\t1\ttrain/dog/image002.jpg
```

**Pros**: Easy to create and inspect  
**Cons**: Slower than RecordIO for large datasets

### Format 3: Augmented Manifest (For Ground Truth Integration)

JSON Lines format with S3 references - ideal when using SageMaker Ground Truth.

```json
{"source-ref": "s3://bucket/image.jpg", "class": 0}
{"source-ref": "s3://bucket/image2.jpg", "class": 1}
```

**Pros**: Direct Ground Truth integration  
**Cons**: Requires S3 paths for all images

### LST File Format Deep Dive

The LST (list) file format is tab-separated with three columns:

```
index \t label \t image_path
```

**Key Points:**
- `index`: Unique integer ID for each image (typically 0, 1, 2, ...)
- `label`: Class ID (0-indexed integer)
- `image_path`: Relative path to the image file

**For Multi-label Classification:**
```
# Comma-separated labels
0\t0,2,5\ttrain/image001.jpg  # Image belongs to classes 0, 2, and 5
1\t1,3\ttrain/image002.jpg    # Image belongs to classes 1 and 3
```

## Step 3: Synthetic Data - Limitations and Purpose

<div style="background-color: #d1ecf1; border: 1px solid #0c5460; border-radius: 5px; padding: 15px; margin: 10px 0;">

### ⚠️ Important: Why We Can't Truly Simulate Image Classification

Like Object Detection, Image Classification requires **real images** with actual visual features that neural networks can learn from.

**Why synthetic data doesn't work for training:**
1. **CNNs learn hierarchical visual features**: Edges → Textures → Parts → Objects
2. **Random noise or generated patterns** don't contain the visual structure found in real images
3. **Labels are meaningless** without corresponding visual content

**What we CAN demonstrate:**
- ✅ Data format structure (LST files, RecordIO concept)
- ✅ Data preparation pipeline
- ✅ Evaluation metric calculations (accuracy, confusion matrix, etc.)
- ✅ Output parsing and visualization
- ✅ Hyperparameter configuration

**For actual training, you need:**
- Real images organized by class
- Public datasets like CIFAR-10, ImageNet, or domain-specific datasets
- SageMaker Ground Truth for custom labeling

</div>

In [None]:
def generate_sample_lst_file(num_samples=100, num_classes=5, seed=42):
    """
    Generate a sample LST file content to demonstrate the format.
    
    NOTE: This is for FORMAT DEMONSTRATION ONLY.
    Real training requires actual images.
    
    Args:
        num_samples: Number of sample entries
        num_classes: Number of classes
        seed: Random seed for reproducibility
    
    Returns:
        lines: List of LST file lines
        class_names: List of class names
    """
    np.random.seed(seed)
    
    class_names = ['airplane', 'automobile', 'bird', 'cat', 'dog']
    lines = []
    
    for i in range(num_samples):
        label = np.random.randint(0, num_classes)
        class_name = class_names[label]
        image_path = f"train/{class_name}/image_{i:04d}.jpg"
        lines.append(f"{i}\t{label}\t{image_path}")
    
    return lines, class_names

lst_lines, class_names = generate_sample_lst_file()

print("Sample LST file content:")
print("Format: index\tlabel\tpath")
print("=" * 60)
for line in lst_lines[:10]:
    print(line)
print("...")

print(f"\nClasses: {class_names}")
print(f"Total samples: {len(lst_lines)}")

In [None]:
# Analyze class distribution in our sample LST file
def analyze_lst_distribution(lst_lines, class_names):
    """Analyze the class distribution in LST file."""
    class_counts = defaultdict(int)
    
    for line in lst_lines:
        parts = line.split('\t')
        label = int(parts[1])
        class_counts[class_names[label]] += 1
    
    return dict(class_counts)

class_distribution = analyze_lst_distribution(lst_lines, class_names)

# Visualize distribution
fig, ax = plt.subplots(figsize=(10, 5))

colors = plt.cm.tab10(np.linspace(0, 1, len(class_names)))
bars = ax.bar(class_distribution.keys(), class_distribution.values(), color=colors)

ax.set_xlabel('Class')
ax.set_ylabel('Number of Samples')
ax.set_title('Class Distribution in Training Data')

# Add value labels on bars
for bar, count in zip(bars, class_distribution.values()):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
           str(count), ha='center', va='bottom')

plt.tight_layout()
plt.show()

# Check for class imbalance
counts = list(class_distribution.values())
imbalance_ratio = max(counts) / min(counts)
print(f"\nClass imbalance ratio: {imbalance_ratio:.2f}x")
if imbalance_ratio > 2:
    print("⚠️  Warning: Significant class imbalance detected. Consider using weighted loss or oversampling.")
else:
    print("✓ Classes are reasonably balanced.")

---

## Step 4: Training Configuration and Hyperparameters

### Understanding Image Classification Hyperparameters

SageMaker's Image Classification algorithm has many hyperparameters. Understanding each one is crucial for successful training.

### Core Required Parameters

**num_classes** (Required)
- The number of distinct categories to classify into
- Must match the number of unique labels in your training data
- Example: CIFAR-10 has 10 classes → `num_classes=10`

**num_training_samples** (Required)
- Total number of training images
- Used for learning rate scheduling and progress tracking
- Must match your actual training dataset size
- Example: If you have 50,000 training images → `num_training_samples=50000`

### Network Architecture Parameters

**num_layers**
- Depth of the ResNet architecture
- Options: `18`, `34`, `50`, `101`, `152`, `200`
- Trade-off: Deeper networks can learn more complex patterns but are slower and need more data

| Layers | Parameters | Speed | Accuracy | Best For |
|--------|------------|-------|----------|----------|
| 18 | ~11M | Fastest | Lower | Quick experiments, simple tasks |
| 34 | ~21M | Fast | Good | Medium complexity tasks |
| 50 | ~25M | Medium | Very Good | **Recommended default** |
| 101 | ~44M | Slow | Excellent | Complex tasks, large datasets |
| 152 | ~60M | Very Slow | Best | Maximum accuracy needed |

- Default: `152` (but `50` is often the best balance)
- Recommendation: Start with `50`, increase only if needed

**use_pretrained_model**
- Whether to initialize with ImageNet pretrained weights
- `1`: Yes - **highly recommended**, especially for smaller datasets
- `0`: No - train from scratch (requires much more data: 100K+ images)
- Pretrained models have already learned visual features from 1.2M ImageNet images
- Transfer learning: Fine-tune these features for your specific task
- Default: `1`

**image_shape**
- Input image dimensions in format `"channels,height,width"`
- Common values: `"3,224,224"` (standard), `"3,299,299"` (Inception-style)
- All input images are resized to this shape
- Larger sizes: Better for fine details, but slower and more memory
- Default: `"3,224,224"`
- Important: Use same shape for training and inference

### Training Parameters

**epochs**
- Number of complete passes through the training data
- More epochs = more learning opportunities, but risk of overfitting
- With pretrained models, often converges in 10-30 epochs
- From scratch: May need 100+ epochs
- Monitor validation accuracy to detect overfitting
- Default: `30`

**mini_batch_size**
- Number of images processed before updating weights
- Larger batches: More stable gradients, better GPU utilization
- Smaller batches: More frequent updates, may generalize better
- Limited by GPU memory (reduce if you get OOM errors)
- Typical range: 16-128 depending on image size and GPU
- Default: `32`
- Rule of thumb: If using `image_shape="3,224,224"`, batch 32-64 works on most GPUs

**learning_rate**
- How much to adjust weights on each update
- Too high: Training oscillates or diverges
- Too low: Training is very slow
- **Critical for transfer learning**: Use LOWER learning rate (0.001-0.01) to preserve pretrained features
- From scratch: Can use higher rate (0.1)
- Default: `0.1` (reduce for fine-tuning!)

**lr_scheduler_step**
- Epochs at which to reduce learning rate
- Format: comma-separated epoch numbers (e.g., `"10,20"`)
- Reducing learning rate helps fine-tune as training progresses
- Common pattern: Reduce at 1/3 and 2/3 of total epochs
- Example: For 30 epochs → `"10,20"`

**lr_scheduler_factor**
- Factor to multiply learning rate by at each step
- Value of `0.1` means learning rate becomes 10% of previous value
- Default: `0.1`
- Example: If LR=0.01 and factor=0.1, after step LR=0.001

### Optimizer Parameters

**optimizer**
- Algorithm for updating weights based on gradients
- Options: `sgd`, `adam`, `rmsprop`, `nag` (Nesterov Accelerated Gradient)
- `sgd`: Simple, effective with proper momentum and learning rate schedule
- `adam`: Adaptive learning rates, often converges faster, good default
- `nag`: SGD with look-ahead momentum, can escape local minima better
- Default: `sgd`
- Recommendation: Use `sgd` with momentum for best results on image classification

**momentum**
- Used with SGD/NAG optimizer
- Helps accelerate training by maintaining velocity in consistent directions
- Typical value: `0.9`
- Higher momentum (0.9-0.99): Faster convergence, may overshoot
- Default: `0.9`

**weight_decay**
- L2 regularization to prevent overfitting
- Adds penalty for large weights
- Helps model generalize by keeping weights small
- Typical range: `0.0001` to `0.001`
- Default: `0.0001`

### Data Augmentation Parameters

**augmentation_type**
- Type of data augmentation to apply during training
- Options: `crop`, `crop_color`, `crop_color_transform`

| Type | Augmentations Applied | Speed | When to Use |
|------|----------------------|-------|-------------|
| `crop` | Random crop, horizontal flip | Fast | Large datasets, speed priority |
| `crop_color` | + Brightness, saturation, hue jitter | Medium | Moderate datasets |
| `crop_color_transform` | + Rotation, shear, aspect ratio | Slower | **Small datasets, prevent overfitting** |

- Default: `crop_color_transform`
- Recommendation: Use `crop_color_transform` unless you have 100K+ images

**resize**
- Size to resize images before cropping
- Should be larger than `image_shape` to allow random cropping
- Default: `256` (for 224x224 images)

### Classification-Specific Parameters

**top_k**
- Report top-k accuracy metrics during training
- Top-1: Correct if best prediction matches label
- Top-5: Correct if label is in top 5 predictions
- Default: `5`
- Set to number of classes if you have fewer than 5 classes

**multi_label**
- Enable multi-label classification mode
- `0`: Single-label (each image has exactly one class)
- `1`: Multi-label (each image can have multiple classes)
- Default: `0`
- Changes loss function from softmax to sigmoid

**precision_dtype**
- Numerical precision for training
- `float32`: Standard precision, most compatible
- `float16`: Mixed precision, faster on V100/A100 GPUs, may reduce accuracy slightly
- Default: `float32`

In [None]:
# Get Image Classification container image
image_classification_image = retrieve(
    framework='image-classification',
    region=region,
    version='1'
)

print(f"Image Classification Image URI: {image_classification_image}")

In [None]:
# Complete hyperparameter configuration with explanations
hyperparameters = {
    # === REQUIRED PARAMETERS ===
    "num_classes": 5,                        # Number of classification categories
    "num_training_samples": 10000,           # Total training images
    
    # === NETWORK ARCHITECTURE ===
    "num_layers": 50,                        # ResNet-50 (good balance of speed/accuracy)
    "use_pretrained_model": 1,               # Transfer learning from ImageNet
    "image_shape": "3,224,224",              # Input image shape (channels, height, width)
    
    # === TRAINING PARAMETERS ===
    "epochs": 30,                            # Training epochs
    "mini_batch_size": 32,                   # Batch size
    "learning_rate": 0.001,                  # Lower for fine-tuning pretrained model
    "lr_scheduler_step": "10,20",            # Reduce LR at epochs 10 and 20
    "lr_scheduler_factor": 0.1,              # LR multiplier at each step
    
    # === OPTIMIZER ===
    "optimizer": "sgd",                      # SGD with momentum
    "momentum": 0.9,                         # Momentum value
    "weight_decay": 0.0001,                  # L2 regularization
    
    # === DATA AUGMENTATION ===
    "augmentation_type": "crop_color_transform",  # Full augmentation
    "resize": 256,                           # Resize before cropping
    
    # === CLASSIFICATION SPECIFIC ===
    "top_k": 5,                              # Report top-k accuracy
    "multi_label": 0,                        # Single-label classification
    "precision_dtype": "float32",            # Training precision
}

print("Image Classification Hyperparameters:")
print("=" * 55)
for key, value in hyperparameters.items():
    print(f"  {key}: {value}")

In [None]:
# Example Estimator Configuration
# NOTE: Do NOT run training without actual image data!

print("""
═══════════════════════════════════════════════════════════════════════════════
                    EXAMPLE ESTIMATOR CONFIGURATION
═══════════════════════════════════════════════════════════════════════════════

⚠️  WARNING: Running this training job will incur GPU costs!
    Estimated cost: $2-10 depending on epochs and instance type.

# Standard training (On-Demand)
image_classification_estimator = Estimator(
    image_uri=image_classification_image,
    role=role,
    instance_count=1,
    instance_type='ml.p3.2xlarge',  # GPU required!
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/output',
    sagemaker_session=sagemaker_session,
    base_job_name='image-classification',
    max_run=3600 * 4,  # 4 hour max runtime
)

# Cost-saving alternative with Spot Instances (up to 70% savings)
image_classification_estimator_spot = Estimator(
    image_uri=image_classification_image,
    role=role,
    instance_count=1,
    instance_type='ml.g4dn.xlarge',  # Most cost-effective GPU
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/output',
    sagemaker_session=sagemaker_session,
    base_job_name='image-classification-spot',
    use_spot_instances=True,         # Enable Spot pricing
    max_wait=3600 * 5,               # Max time to wait for spot capacity
    max_run=3600 * 4,                # Max training time
)

# Set hyperparameters
image_classification_estimator.set_hyperparameters(**hyperparameters)

# Data channels configuration (using LST file format)
# train: s3://bucket/prefix/train/  (images)
# validation: s3://bucket/prefix/validation/  (images)
# train_lst: s3://bucket/prefix/train.lst  (list file)
# validation_lst: s3://bucket/prefix/validation.lst  (list file)

""")

---

## Step 5: Understanding Model Output

The model outputs a **probability distribution** over all classes. Each output value represents the model's confidence that the image belongs to that class.

**Output Format:**
- Array of probabilities, one per class
- All values sum to 1.0 (for single-label classification)
- Index corresponds to class ID

**Example:** For 5 classes
```
[0.05, 0.02, 0.85, 0.05, 0.03]
      ↑            ↑
  class 1    class 2 (highest = prediction)
```

In [None]:
def parse_classification_output(probabilities, class_names, top_k=5):
    """
    Parse classification output and return top-k predictions.
    
    Args:
        probabilities: Array of class probabilities (sums to 1.0)
        class_names: List of class names
        top_k: Number of top predictions to return
    
    Returns:
        List of prediction dictionaries sorted by probability
    """
    # Sort by probability descending
    top_indices = np.argsort(probabilities)[::-1][:top_k]
    
    results = []
    for idx in top_indices:
        results.append({
            'class_name': class_names[idx],
            'class_id': int(idx),
            'probability': float(probabilities[idx]),
            'percentage': f"{probabilities[idx] * 100:.2f}%"
        })
    
    return results

# Simulate model output - a typical confident prediction
np.random.seed(42)

# Create a realistic output where model is confident about one class
sample_probs = np.array([0.02, 0.05, 0.85, 0.05, 0.03])  # Confident about class 2 (bird)

predictions = parse_classification_output(sample_probs, class_names)

print("Sample Classification Output:")
print("=" * 50)
print(f"\nRaw probabilities: {sample_probs}")
print(f"Sum of probabilities: {sum(sample_probs):.4f} (should be 1.0)")
print(f"\nTop-{len(predictions)} predictions:")
for i, pred in enumerate(predictions, 1):
    marker = "← Predicted class" if i == 1 else ""
    print(f"  {i}. {pred['class_name']:12s}: {pred['percentage']:>8s} {marker}")

In [None]:
def visualize_predictions(probabilities, class_names, true_label=None, title="Classification Predictions"):
    """
    Visualize classification probabilities as a horizontal bar chart.
    
    Args:
        probabilities: Array of class probabilities
        class_names: List of class names
        true_label: Optional true class index for comparison
        title: Plot title
    """
    fig, ax = plt.subplots(figsize=(10, 5))
    
    # Color bars based on probability magnitude
    colors = plt.cm.RdYlGn(probabilities)  # Red to Green colormap
    
    # Highlight true label if provided
    if true_label is not None:
        edge_colors = ['green' if i == true_label else 'none' for i in range(len(class_names))]
        linewidths = [3 if i == true_label else 0 for i in range(len(class_names))]
    else:
        edge_colors = ['none'] * len(class_names)
        linewidths = [0] * len(class_names)
    
    bars = ax.barh(class_names, probabilities, color=colors, 
                   edgecolor=edge_colors, linewidth=linewidths)
    
    ax.set_xlabel('Probability')
    ax.set_title(title)
    ax.set_xlim(0, 1)
    
    # Add value labels
    for bar, prob in zip(bars, probabilities):
        ax.text(bar.get_width() + 0.02, bar.get_y() + bar.get_height()/2,
               f'{prob:.4f}', va='center', fontsize=10)
    
    # Add predicted class indicator
    predicted_class = np.argmax(probabilities)
    ax.annotate('Predicted', 
                xy=(probabilities[predicted_class], predicted_class),
                xytext=(probabilities[predicted_class] + 0.15, predicted_class),
                fontsize=10, color='blue',
                arrowprops=dict(arrowstyle='->', color='blue'))
    
    if true_label is not None:
        ax.annotate('True Label', 
                    xy=(probabilities[true_label], true_label),
                    xytext=(0.7, true_label + 0.5),
                    fontsize=10, color='green',
                    arrowprops=dict(arrowstyle='->', color='green'))
    
    plt.tight_layout()
    plt.show()

# Visualize with true label
visualize_predictions(sample_probs, class_names, true_label=2, 
                     title="Sample Prediction: Correctly Classified 'bird'")

In [None]:
# Example: Uncertain prediction (low confidence)
uncertain_probs = np.array([0.25, 0.22, 0.20, 0.18, 0.15])

print("Example: Uncertain Prediction")
print("When probabilities are similar, the model is uncertain.")
print(f"Max probability: {max(uncertain_probs):.2%}")
print(f"Entropy: {-sum(p * np.log(p) for p in uncertain_probs if p > 0):.4f}")

visualize_predictions(uncertain_probs, class_names, 
                     title="Uncertain Prediction (Low Confidence)")

---

## Step 6: Evaluation Metrics Deep Dive

Image classification uses several metrics depending on whether it's single-label or multi-label.

### Single-Label Metrics

| Metric | Description | Formula |
|--------|-------------|--------|
| **Top-1 Accuracy** | Correct if top prediction = true label | Correct / Total |
| **Top-5 Accuracy** | Correct if true label in top 5 predictions | Correct / Total |
| **Cross-Entropy Loss** | Measures prediction confidence | -log(P(true class)) |

### Multi-Label Metrics

| Metric | Description |
|--------|-------------|
| **Precision** | Of predicted labels, how many are correct? |
| **Recall** | Of true labels, how many were predicted? |
| **F1 Score** | Harmonic mean of precision and recall |

In [None]:
def calculate_topk_accuracy(predictions_list, true_labels, k=5):
    """
    Calculate top-k accuracy for classification.
    
    Args:
        predictions_list: List of probability arrays (one per image)
        true_labels: List of true class indices
        k: Top-k parameter
    
    Returns:
        Accuracy value between 0 and 1
    """
    correct = 0
    for probs, true_label in zip(predictions_list, true_labels):
        top_k_preds = np.argsort(probs)[::-1][:k]
        if true_label in top_k_preds:
            correct += 1
    
    return correct / len(predictions_list)


def calculate_cross_entropy_loss(predictions_list, true_labels):
    """
    Calculate average cross-entropy loss.
    
    Cross-entropy measures how well the probability distribution
    matches the true labels. Lower is better.
    """
    epsilon = 1e-15  # Prevent log(0)
    total_loss = 0
    
    for probs, true_label in zip(predictions_list, true_labels):
        # Clip probability to prevent log(0)
        prob = np.clip(probs[true_label], epsilon, 1 - epsilon)
        total_loss += -np.log(prob)
    
    return total_loss / len(predictions_list)


# Simulate a realistic evaluation scenario
np.random.seed(42)
num_test_samples = 500

# Generate test labels
test_labels = np.random.randint(0, len(class_names), num_test_samples)

# Generate realistic predictions (model is mostly correct but not perfect)
test_predictions = []
for true_label in test_labels:
    # 80% of the time, model is confident and correct
    if np.random.random() < 0.80:
        probs = np.random.dirichlet(np.ones(len(class_names)) * 0.5)
        probs[true_label] += 0.5  # Boost true class
        probs = probs / probs.sum()  # Renormalize
    else:
        # Model makes a mistake
        probs = np.random.dirichlet(np.ones(len(class_names)))
    test_predictions.append(probs)

# Calculate metrics
top1_acc = calculate_topk_accuracy(test_predictions, test_labels, k=1)
top2_acc = calculate_topk_accuracy(test_predictions, test_labels, k=2)
top3_acc = calculate_topk_accuracy(test_predictions, test_labels, k=3)
top5_acc = calculate_topk_accuracy(test_predictions, test_labels, k=5)
ce_loss = calculate_cross_entropy_loss(test_predictions, test_labels)

print("Evaluation Metrics (Simulated 500 test samples):")
print("=" * 50)
print(f"  Top-1 Accuracy: {top1_acc:.4f} ({top1_acc*100:.1f}%)")
print(f"  Top-2 Accuracy: {top2_acc:.4f} ({top2_acc*100:.1f}%)")
print(f"  Top-3 Accuracy: {top3_acc:.4f} ({top3_acc*100:.1f}%)")
print(f"  Top-5 Accuracy: {top5_acc:.4f} ({top5_acc*100:.1f}%)")
print(f"  Cross-Entropy Loss: {ce_loss:.4f}")
print(f"\nNote: With 5 classes, random guessing would give ~20% top-1 accuracy.")

### Confusion Matrix

A confusion matrix shows how predictions match (or don't match) true labels. It's essential for understanding:
- Which classes the model confuses
- Per-class accuracy
- Systematic biases

In [None]:
def compute_confusion_matrix(predictions_list, true_labels, num_classes):
    """
    Compute confusion matrix for classification.
    
    Args:
        predictions_list: List of probability arrays
        true_labels: List of true class indices
        num_classes: Number of classes
    
    Returns:
        Confusion matrix (num_classes x num_classes)
        Row = true label, Column = predicted label
    """
    cm = np.zeros((num_classes, num_classes), dtype=int)
    
    for probs, true_label in zip(predictions_list, true_labels):
        predicted_label = np.argmax(probs)
        cm[true_label, predicted_label] += 1
    
    return cm


def plot_confusion_matrix(cm, class_names, normalize=False, title="Confusion Matrix"):
    """
    Plot confusion matrix as a heatmap.
    
    Args:
        cm: Confusion matrix array
        class_names: List of class names
        normalize: Whether to show percentages
        title: Plot title
    """
    if normalize:
        cm_display = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        fmt = '.2f'
    else:
        cm_display = cm
        fmt = 'd'
    
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Create heatmap
    im = ax.imshow(cm_display, interpolation='nearest', cmap='Blues')
    ax.figure.colorbar(im, ax=ax)
    
    # Labels
    ax.set(xticks=np.arange(len(class_names)),
           yticks=np.arange(len(class_names)),
           xticklabels=class_names,
           yticklabels=class_names,
           xlabel='Predicted Label',
           ylabel='True Label',
           title=title)
    
    # Rotate x labels
    plt.setp(ax.get_xticklabels(), rotation=45, ha='right', rotation_mode='anchor')
    
    # Add text annotations
    thresh = cm_display.max() / 2
    for i in range(len(class_names)):
        for j in range(len(class_names)):
            value = cm_display[i, j]
            if normalize:
                text = f'{value:.2f}'
            else:
                text = f'{value}'
            ax.text(j, i, text,
                   ha='center', va='center',
                   color='white' if cm_display[i, j] > thresh else 'black',
                   fontsize=12)
    
    plt.tight_layout()
    plt.show()

# Compute and plot confusion matrix
cm = compute_confusion_matrix(test_predictions, test_labels, len(class_names))

print("Confusion Matrix (counts):")
plot_confusion_matrix(cm, class_names, normalize=False, title="Confusion Matrix (Counts)")

In [None]:
# Normalized confusion matrix (percentages)
print("Normalized Confusion Matrix (per-class recall):")
plot_confusion_matrix(cm, class_names, normalize=True, 
                     title="Confusion Matrix (Normalized by True Label)")

In [None]:
def compute_classification_report(cm, class_names):
    """
    Compute precision, recall, and F1 for each class.
    
    Returns:
        Dictionary with per-class and average metrics
    """
    num_classes = len(class_names)
    report = {}
    
    # Per-class metrics
    precisions = []
    recalls = []
    f1s = []
    supports = []
    
    for i in range(num_classes):
        tp = cm[i, i]
        fp = cm[:, i].sum() - tp  # Column sum minus diagonal
        fn = cm[i, :].sum() - tp  # Row sum minus diagonal
        support = cm[i, :].sum()  # Row sum = true positives for this class
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
        
        report[class_names[i]] = {
            'precision': precision,
            'recall': recall,
            'f1-score': f1,
            'support': support
        }
        
        precisions.append(precision)
        recalls.append(recall)
        f1s.append(f1)
        supports.append(support)
    
    # Macro average (unweighted mean)
    report['macro avg'] = {
        'precision': np.mean(precisions),
        'recall': np.mean(recalls),
        'f1-score': np.mean(f1s),
        'support': sum(supports)
    }
    
    # Weighted average
    total = sum(supports)
    report['weighted avg'] = {
        'precision': sum(p * s for p, s in zip(precisions, supports)) / total,
        'recall': sum(r * s for r, s in zip(recalls, supports)) / total,
        'f1-score': sum(f * s for f, s in zip(f1s, supports)) / total,
        'support': total
    }
    
    return report

# Compute and display classification report
report = compute_classification_report(cm, class_names)

print("Classification Report:")
print("=" * 65)
print(f"{'Class':>15s} {'Precision':>12s} {'Recall':>12s} {'F1-Score':>12s} {'Support':>10s}")
print("-" * 65)

for class_name in class_names:
    metrics = report[class_name]
    print(f"{class_name:>15s} {metrics['precision']:>12.4f} {metrics['recall']:>12.4f} "
          f"{metrics['f1-score']:>12.4f} {metrics['support']:>10d}")

print("-" * 65)
for avg_type in ['macro avg', 'weighted avg']:
    metrics = report[avg_type]
    print(f"{avg_type:>15s} {metrics['precision']:>12.4f} {metrics['recall']:>12.4f} "
          f"{metrics['f1-score']:>12.4f} {metrics['support']:>10d}")

In [None]:
# Visualize per-class metrics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

metrics_names = ['precision', 'recall', 'f1-score']
colors = plt.cm.tab10(np.linspace(0, 1, len(class_names)))

for ax, metric_name in zip(axes, metrics_names):
    values = [report[cn][metric_name] for cn in class_names]
    bars = ax.bar(class_names, values, color=colors)
    
    ax.set_ylabel(metric_name.capitalize())
    ax.set_title(f'Per-Class {metric_name.capitalize()}')
    ax.set_ylim(0, 1)
    ax.axhline(y=report['macro avg'][metric_name], color='red', linestyle='--', 
               label=f'Macro Avg: {report["macro avg"][metric_name]:.3f}')
    ax.legend()
    
    # Rotate x labels
    ax.set_xticklabels(class_names, rotation=45, ha='right')
    
    # Add value labels
    for bar, val in zip(bars, values):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
               f'{val:.3f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

---

## Step 7: Multi-Label Classification

In multi-label classification, each image can belong to **multiple classes simultaneously**.

**Example Use Cases:**
- Image tagging (sunset, beach, people)
- Medical imaging (multiple conditions)
- Content moderation (violence, nudity, hate speech)

**Key Differences from Single-Label:**

| Aspect | Single-Label | Multi-Label |
|--------|-------------|-------------|
| Output activation | Softmax (sums to 1) | Sigmoid (independent) |
| Loss function | Cross-entropy | Binary cross-entropy |
| Decision | argmax | Threshold (e.g., > 0.5) |
| Metrics | Accuracy | Precision, Recall, F1 |

In [None]:
def parse_multilabel_output(probabilities, class_names, threshold=0.5):
    """
    Parse multi-label classification output.
    
    In multi-label, we use a threshold to decide which classes are present.
    
    Args:
        probabilities: Array of independent class probabilities (0-1 each)
        class_names: List of class names
        threshold: Probability threshold for positive prediction
    
    Returns:
        List of predicted classes and all probabilities
    """
    predicted_classes = []
    all_predictions = []
    
    for i, (prob, name) in enumerate(zip(probabilities, class_names)):
        all_predictions.append({
            'class_name': name,
            'class_id': i,
            'probability': prob,
            'predicted': prob >= threshold
        })
        
        if prob >= threshold:
            predicted_classes.append(name)
    
    return predicted_classes, all_predictions


# Example: Multi-label scenario (image has both 'bird' and 'airplane')
multilabel_probs = np.array([0.82, 0.15, 0.78, 0.08, 0.12])  # airplane and bird both high

predicted, all_preds = parse_multilabel_output(multilabel_probs, class_names, threshold=0.5)

print("Multi-Label Classification Example:")
print("=" * 50)
print(f"\nProbabilities: {multilabel_probs}")
print(f"Note: Probabilities don't need to sum to 1")
print(f"Sum: {multilabel_probs.sum():.2f}")
print(f"\nUsing threshold = 0.5:")
print(f"Predicted classes: {predicted}")
print(f"\nAll predictions:")
for pred in all_preds:
    status = "✓ POSITIVE" if pred['predicted'] else "✗ negative"
    print(f"  {pred['class_name']:12s}: {pred['probability']:.4f} {status}")

In [None]:
def calculate_multilabel_metrics(predictions_list, true_labels_list, threshold=0.5):
    """
    Calculate precision, recall, F1 for multi-label classification.
    
    Args:
        predictions_list: List of probability arrays
        true_labels_list: List of binary label arrays (1 if class present)
        threshold: Probability threshold for positive prediction
    """
    total_tp = 0
    total_fp = 0
    total_fn = 0
    
    for probs, true_labels in zip(predictions_list, true_labels_list):
        predicted = (probs >= threshold).astype(int)
        
        tp = np.sum((predicted == 1) & (true_labels == 1))
        fp = np.sum((predicted == 1) & (true_labels == 0))
        fn = np.sum((predicted == 0) & (true_labels == 1))
        
        total_tp += tp
        total_fp += fp
        total_fn += fn
    
    precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0
    recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    return {'precision': precision, 'recall': recall, 'f1': f1}

# Simulate multi-label evaluation
np.random.seed(42)
num_samples = 200

# Generate random multi-label data
ml_true_labels = []
ml_predictions = []

for _ in range(num_samples):
    # Random true labels (each class has 30% chance of being present)
    true = (np.random.random(len(class_names)) < 0.3).astype(int)
    
    # Simulate predictions (correlated with true labels + noise)
    pred = true * 0.6 + np.random.random(len(class_names)) * 0.4
    
    ml_true_labels.append(true)
    ml_predictions.append(pred)

ml_metrics = calculate_multilabel_metrics(ml_predictions, ml_true_labels, threshold=0.5)

print("Multi-Label Evaluation Metrics:")
print("=" * 40)
print(f"  Precision: {ml_metrics['precision']:.4f}")
print(f"  Recall:    {ml_metrics['recall']:.4f}")
print(f"  F1 Score:  {ml_metrics['f1']:.4f}")

---

## Step 8: CloudWatch Training Metrics

During training, SageMaker Image Classification emits these metrics to CloudWatch:

| Metric | Description | Good Values |
|--------|-------------|-------------|
| `train:accuracy` | Training set accuracy | Higher is better |
| `validation:accuracy` | Validation set accuracy | Higher is better |
| `train:cross_entropy` | Training loss | Lower is better |
| `validation:cross_entropy` | Validation loss | Lower is better |
| `train:top_k_accuracy_5` | Training top-5 accuracy | Higher is better |
| `validation:top_k_accuracy_5` | Validation top-5 accuracy | Higher is better |

### What to Watch For

**Healthy Training:**
- Both training and validation accuracy increasing
- Loss decreasing over epochs
- Small gap between training and validation metrics

**Overfitting Signs:**
- Training accuracy keeps improving but validation accuracy plateaus/decreases
- Large gap between training and validation metrics
- Validation loss starts increasing

**Underfitting Signs:**
- Both training and validation metrics are poor
- Metrics improve very slowly

**Learning Rate Issues:**
- Loss oscillates wildly → Learning rate too high
- Metrics change very slowly → Learning rate too low

In [None]:
# Simulate training metrics over epochs
np.random.seed(42)
epochs = 30

# Simulate healthy training curves with transfer learning
# Transfer learning typically starts with higher accuracy and converges faster

# Accuracy curves
base_train_acc = 0.5
train_accuracy = [min(0.98, base_train_acc + 0.015 * e + np.random.normal(0, 0.01)) for e in range(epochs)]
val_accuracy = [min(0.92, base_train_acc - 0.03 + 0.014 * e + np.random.normal(0, 0.015)) for e in range(epochs)]

# Loss curves
base_loss = 2.0
train_loss = [max(0.05, base_loss * np.exp(-0.12 * e) + np.random.normal(0, 0.03)) for e in range(epochs)]
val_loss = [max(0.15, base_loss * np.exp(-0.10 * e) + 0.1 + np.random.normal(0, 0.04)) for e in range(epochs)]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy plot
axes[0].plot(range(1, epochs + 1), train_accuracy, 'b-', label='Training Accuracy', linewidth=2)
axes[0].plot(range(1, epochs + 1), val_accuracy, 'r--', label='Validation Accuracy', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Training Progress: Accuracy')
axes[0].set_ylim(0, 1)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Mark LR reduction points
for lr_step in [10, 20]:
    axes[0].axvline(x=lr_step, color='green', linestyle=':', alpha=0.7)
axes[0].text(10, 0.55, 'LR÷10', fontsize=9, color='green')
axes[0].text(20, 0.55, 'LR÷10', fontsize=9, color='green')

# Loss plot
axes[1].plot(range(1, epochs + 1), train_loss, 'b-', label='Training Loss', linewidth=2)
axes[1].plot(range(1, epochs + 1), val_loss, 'r--', label='Validation Loss', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Cross-Entropy Loss')
axes[1].set_title('Training Progress: Loss')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Mark LR reduction points
for lr_step in [10, 20]:
    axes[1].axvline(x=lr_step, color='green', linestyle=':', alpha=0.7)

plt.tight_layout()
plt.show()

print(f"Final Training Accuracy: {train_accuracy[-1]:.4f}")
print(f"Final Validation Accuracy: {val_accuracy[-1]:.4f}")
print(f"Final Training Loss: {train_loss[-1]:.4f}")
print(f"Final Validation Loss: {val_loss[-1]:.4f}")
print(f"\nGap (train-val accuracy): {train_accuracy[-1] - val_accuracy[-1]:.4f}")

---

## Summary

In this exercise, you learned:

### 1. Data Formats
- **RecordIO**: Binary format, most efficient for large datasets
- **Image + LST**: Separate images and list file mapping labels
- **Augmented Manifest**: JSON Lines with S3 references

### 2. Key Hyperparameters

| Category | Parameters |
|----------|------------|
| Architecture | `num_layers`, `use_pretrained_model`, `image_shape` |
| Training | `epochs`, `mini_batch_size`, `learning_rate` |
| Optimizer | `optimizer`, `momentum`, `weight_decay` |
| Augmentation | `augmentation_type`, `resize` |
| Classification | `top_k`, `multi_label` |

### 3. Model Output
- **Single-label**: Probability distribution (softmax, sums to 1)
- **Multi-label**: Independent probabilities (sigmoid, 0-1 each)

### 4. Evaluation Metrics

| Task | Metrics |
|------|--------|
| Single-label | Top-1/5 Accuracy, Confusion Matrix, Per-class Precision/Recall |
| Multi-label | Precision, Recall, F1 Score |

### 5. Transfer Learning
- Use `use_pretrained_model=1` for smaller datasets
- Use LOWER learning rate (0.001 instead of 0.1) to preserve pretrained features
- Often converges in fewer epochs than training from scratch

### Instance Requirements

| Task | Instance Types | Notes |
|------|----------------|-------|
| Training | ml.g4dn.xlarge, ml.p3.2xlarge | **GPU required** |
| Inference | ml.m5.large (CPU), ml.c5.large (CPU) | GPU optional for real-time |

### Cost Considerations
- Training costs: $2-15 depending on dataset size and epochs
- Use Spot Instances for up to 70% savings
- Start with ml.g4dn.xlarge (~$0.74/hour) for cost efficiency
- Use ResNet-50 instead of ResNet-152 for faster iteration

### Next Steps
1. Obtain real labeled image data (CIFAR-10, ImageNet subset, or custom)
2. Use SageMaker Ground Truth for custom dataset labeling
3. Experiment with different `num_layers` and `augmentation_type` settings
4. Monitor CloudWatch metrics during training
5. Try multi-label classification for images with multiple attributes

## Resources

- [SageMaker Image Classification Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html)
- [Image Classification Hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html)
- [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html) - Good for testing
- [ImageNet](http://www.image-net.org/) - Large-scale image dataset
- [SageMaker Ground Truth](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html) - For custom labeling
- [AWS Pricing Calculator](https://calculator.aws/) - Estimate training costs