# Module 3: Data Pipelines ‚Äî Dataset, DataLoader & Transforms

---
## Why Data Pipelines Matter

### üß† Brain Analogy
Before your brain can learn, info needs to arrive in the right format and pace. A PyTorch data pipeline is your study assistant: **Dataset** = the whole flashcard collection (891 passengers), **DataLoader** = hands you 32 cards at a time shuffled, **Feature Engineering** = translating raw text ("male") into numbers.

### ‚öôÔ∏è Engineer Analogy
80% of real ML work is data wrangling. PyTorch separates data logic (Dataset) from model logic (nn.Module). Clean pipelines are reproducible, testable, reusable.

**Level:** Beginner ‚Üí Intermediate  
**Duration:** ~3 hours  
**Dataset:** Titanic Survival ([Kaggle Competition](https://www.kaggle.com/competitions/titanic/data))  
**Real-World Use Case:** Tabular binary classification with feature engineering

## What You'll Learn
- Custom `torch.utils.data.Dataset` class
- `DataLoader` ‚Äî batching, shuffling, parallel loading
- Feature engineering & preprocessing pipeline
- Handling class imbalance (weighted sampling)
- Validation split strategies
- Experiment tracking with a training history dict

## Why This Matters
In real projects 80% of the work is data wrangling. A clean PyTorch data pipeline separates your **data logic** from **model logic** ‚Äî making code maintainable and reusable.

In [None]:
# üß† Gathering all tools ‚Äî data loading, preprocessing, evaluation instruments
# ‚öôÔ∏è WeightedRandomSampler = fix class imbalance | roc_auc_score = better than accuracy for skewed data
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix

torch.manual_seed(42)
np.random.seed(42)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

## 3.1  Download & Load Titanic Data

### üß† Brain Analogy
891 passengers, each described by age, sex, class, family size, fare. The brain must learn: who survived? Like a detective inferring outcomes from evidence ‚Äî "women in 1st class had priority access to lifeboats."

### ‚öôÔ∏è Engineer Analogy
Binary classification: `y ‚àà {0,1}`. Only 38% survived ‚Üí class imbalance. Missing values need imputation. AUC-ROC is better than accuracy for imbalanced data.

```bash
# Option A ‚Äî Kaggle API:
kaggle competitions download -c titanic

# Option B ‚Äî Direct URL (no auth required):
# We use seaborn's built-in version below
```

In [None]:
# üß† Opening the historical record of 891 passengers ‚Äî 38% survived
# ‚öôÔ∏è 38% survival = class imbalance; naive "always guess died" gets 62% accuracy for free!
# Using seaborn's built-in Titanic (same data, no Kaggle account needed)
df = sns.load_dataset('titanic')
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nMissing values:\n", df.isnull().sum()[df.isnull().sum() > 0])
print("\nSurvival rate:", df['survived'].mean().round(3))

In [None]:
# üß† Detectives review evidence before forming a theory ‚Äî look for survival patterns
# ‚öôÔ∏è EDA: spot class imbalance and which features (sex, class, age) predict survival
# ‚îÄ‚îÄ Exploratory plots ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

sns.barplot(data=df, x='sex',    y='survived', ax=axes[0]).set_title('Survival by Sex')
sns.barplot(data=df, x='pclass', y='survived', ax=axes[1]).set_title('Survival by Class')
df[df['age'].notna()].groupby('survived')['age'].plot.hist(
    ax=axes[2], alpha=0.6, bins=25, legend=True
)
axes[2].set_title('Age Distribution by Survival')
axes[2].legend(['Died (0)', 'Survived (1)'])

plt.tight_layout()
plt.show()

## 3.2  Feature Engineering

### üß† Brain Analogy
Raw data is in a foreign language ‚Äî translate it: "male"‚Üí1, create new clues ("travelling alone?", "fare per person"). Domain knowledge matters: families tried to stay together ‚Üí create `family_size`.

### ‚öôÔ∏è Engineer Analogy
Feature engineering: `x_raw ‚àà mixed_types` ‚Üí `x_clean ‚àà ‚Ñù¬π¬π`. Impute NaN with median/mode. Encode categoricals. Create interaction terms (`age √ó pclass`). Derived features inject domain knowledge.



In [None]:
# üß† Translate foreign-language data into 11 numbers the brain can process per passenger
# ‚öôÔ∏è Systematic pipeline: impute ‚Üí encode ‚Üí engineer ‚Üí select feature columns
def feature_engineer(df: pd.DataFrame) -> pd.DataFrame:
    """Build a clean feature matrix from raw Titanic data."""
    d = df.copy()

    # ‚îÄ‚îÄ Fill missing values ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    d['age']      = d['age'].fillna(d['age'].median())
    d['embarked'] = d['embarked'].fillna(d['embarked'].mode()[0])
    d['fare']     = d['fare'].fillna(d['fare'].median())

    # ‚îÄ‚îÄ Encode categoricals ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    d['sex_enc']  = (d['sex'] == 'male').astype(int)   # 0=female, 1=male
    d['emb_enc']  = LabelEncoder().fit_transform(d['embarked'])

    # ‚îÄ‚îÄ Engineered features ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    d['family_size'] = d['sibsp'] + d['parch'] + 1
    d['is_alone']    = (d['family_size'] == 1).astype(int)
    d['fare_per_person'] = d['fare'] / d['family_size']
    d['age_class']   = d['age'] * d['pclass']          # interaction term

    feature_cols = [
        'pclass', 'sex_enc', 'age', 'sibsp', 'parch', 'fare',
        'emb_enc', 'family_size', 'is_alone', 'fare_per_person', 'age_class'
    ]
    return d[feature_cols], d['survived']


X, y = feature_engineer(df)
print("Feature matrix shape:", X.shape)
X.describe().round(2)

## 3.3  Custom PyTorch Dataset Class

### üß† Brain Analogy
Dataset = a library of memories. `__init__` organises the shelf. `__len__` says how many "books" exist. `__getitem__(i)` hands you book number i. DataLoader is the librarian ‚Äî it picks books and delivers them in batches.

### ‚öôÔ∏è Engineer Analogy
`Dataset` protocol: `__len__` enables `len()`, `__getitem__` enables indexing. Convert to tensors in `__init__` once (not per sample = faster). `y.reshape(-1,1)` required for `BCEWithLogitsLoss` expecting shape (N,1).

The three mandatory methods: `__init__`, `__len__`, `__getitem__`

In [None]:
# üß† Library of passenger flashcards ‚Äî 11 features + survived/died label per card
# ‚öôÔ∏è Convert to tensors ONCE in __init__ (amortises conversion cost across all samples)
class TitanicDataset(Dataset):
    """PyTorch Dataset wrapping the Titanic tabular data."""

    def __init__(self, X: np.ndarray, y: np.ndarray):
        # Store as tensors
        self.X = torch.FloatTensor(X)
        self.y = torch.FloatTensor(y).reshape(-1, 1)   # shape (N,1) for BCE

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]


# ‚îÄ‚îÄ Split data ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
X_np = X.values.astype(np.float32)
y_np = y.values.astype(np.float32)

X_train, X_val, y_train, y_val = train_test_split(
    X_np, y_np, test_size=0.2, random_state=42, stratify=y_np
)

# ‚îÄ‚îÄ Normalize ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val   = scaler.transform(X_val)

train_ds = TitanicDataset(X_train, y_train)
val_ds   = TitanicDataset(X_val,   y_val)

print(f"Train samples: {len(train_ds)}, Val samples: {len(val_ds)}")
x_sample, y_sample = train_ds[0]
print(f"Sample features shape: {x_sample.shape}, label: {y_sample.item()}")

## 3.4  DataLoader ‚Äî Batching, Shuffling, Weighted Sampling

### üß† Brain Analogy
DataLoader = study assistant: shuffles cards, hands you 32 at a time. Class imbalance fix: 620 "died" vs 270 "survived" ‚Üí brain learns "just guess died" (lazy!). WeightedRandomSampler gives survivors 2√ó draw probability ‚Üí balanced study sessions.

### ‚öôÔ∏è Engineer Analogy
`WeightedRandomSampler`: per-sample prob = `1/class_count`. Corrects gradient bias toward majority class. `pin_memory=True`: pre-locks CPU memory ‚Üí faster PCIe DMA to GPU. Cannot use both sampler AND shuffle simultaneously.

The Titanic dataset is **imbalanced** (~38% survived). We handle this with weighted sampling.

In [None]:
# üß† Balance the study deck: give rare survivors same frequency as common deaths
# ‚öôÔ∏è sample_weight[i] = 1/class_count[class_i] ‚Üí equalises effective class distribution per batch
# ‚îÄ‚îÄ Handle class imbalance with WeightedRandomSampler ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
class_counts = np.bincount(y_train.astype(int))   # [n_died, n_survived]
class_weights = 1.0 / class_counts
sample_weights = class_weights[y_train.astype(int)]   # weight per sample

sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(train_ds),
    replacement=True
)

train_loader = DataLoader(
    train_ds,
    batch_size=32,
    sampler=sampler,      # replaces shuffle=True
    num_workers=0,        # increase for large datasets
    pin_memory=(device.type == 'cuda')
)

val_loader = DataLoader(val_ds, batch_size=64, shuffle=False)

print(f"Class counts: died={class_counts[0]}, survived={class_counts[1]}")
print(f"Class weights: died={class_weights[0]:.3f}, survived={class_weights[1]:.3f}")

# Inspect a batch
xb, yb = next(iter(train_loader))
print(f"\nBatch shapes ‚Üí X: {xb.shape}, y: {yb.shape}")
print(f"Batch survival rate: {yb.mean().item():.2%}  (should be ‚âà50% now)")

## 3.5  Model ‚Äî Deep Tabular Network

### üß† Brain Analogy
BatchNorm normalises internal brain signals between layers ‚Äî keeps activations in a comfortable range. Dropout 0.3 is stronger regularisation for noisy Titanic data. BCEWithLogitsLoss is for binary decisions (survived/died) ‚Äî one output neuron.

### ‚öôÔ∏è Engineer Analogy
`BatchNorm1d`: normalises mini-batch activations ‚Üí reduces internal covariate shift ‚Üí allows higher LR. `BCEWithLogitsLoss` = sigmoid + BCE in one numerically stable step. `AdamW` = Adam with decoupled weight decay. `CosineAnnealingLR` = smooth lr schedule.



In [None]:
# üß† 4-layer brain: detect patterns ‚Üí combine ‚Üí output one number (survived probability)
# ‚öôÔ∏è Output = single logit; BCEWithLogitsLoss applies sigmoid internally ‚Äî never add sigmoid manually
class TitanicNet(nn.Module):
    """Deep network for tabular binary classification."""

    def __init__(self, n_features=11):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_features, 64),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Dropout(0.3),

            nn.Linear(64, 32),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.Dropout(0.2),

            nn.Linear(32, 16),
            nn.ReLU(),

            nn.Linear(16, 1)   # output: single logit for binary classification
        )

    def forward(self, x):
        return self.net(x)


model = TitanicNet(n_features=X_train.shape[1]).to(device)
criterion = nn.BCEWithLogitsLoss()   # Binary cross-entropy (includes sigmoid)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {total_params:,}")

## 3.6  Training with DataLoader

### üß† Brain Analogy
Same 5-step loop with DataLoader delivering batches automatically. AUC = a better score than accuracy for imbalanced data: "do survivors score higher than non-survivors?" Save best checkpoint ‚Äî not just last epoch.

### ‚öôÔ∏è Engineer Analogy
AUC-ROC: threshold-independent ranking metric. AUC=1 perfect, AUC=0.5 random. Move batches to device just-in-time to minimise peak GPU memory usage.



In [None]:
# üß† Auto-delivered batches, balanced sampling, save best checkpoint (not just final epoch)
# ‚öôÔ∏è AUC measured at all thresholds simultaneously ‚Äî better than fixed-threshold accuracy
def train_one_epoch(model, loader, criterion, optimizer, device):
    model.train()
    running_loss, correct, total = 0.0, 0, 0

    for X_b, y_b in loader:
        X_b, y_b = X_b.to(device), y_b.to(device)

        optimizer.zero_grad()
        logits = model(X_b)
        loss   = criterion(logits, y_b)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * len(X_b)
        preds   = (torch.sigmoid(logits) >= 0.5).long()
        correct += (preds == y_b.long()).sum().item()
        total   += len(X_b)

    return running_loss / total, correct / total


@torch.no_grad()
def evaluate(model, loader, criterion, device):
    model.eval()
    running_loss, correct, total = 0.0, 0, 0
    all_probs, all_labels = [], []

    for X_b, y_b in loader:
        X_b, y_b = X_b.to(device), y_b.to(device)
        logits = model(X_b)
        loss   = criterion(logits, y_b)
        probs  = torch.sigmoid(logits)

        running_loss += loss.item() * len(X_b)
        preds   = (probs >= 0.5).long()
        correct += (preds == y_b.long()).sum().item()
        total   += len(X_b)

        all_probs.extend(probs.cpu().numpy().flatten())
        all_labels.extend(y_b.cpu().numpy().flatten())

    auc = roc_auc_score(all_labels, all_probs)
    return running_loss / total, correct / total, auc


# ‚îÄ‚îÄ Run training ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
EPOCHS = 100
history = {k: [] for k in ['tr_loss','tr_acc','vl_loss','vl_acc','vl_auc']}
best_val_auc = 0.0

for epoch in range(1, EPOCHS + 1):
    tr_l, tr_a          = train_one_epoch(model, train_loader, criterion, optimizer, device)
    vl_l, vl_a, vl_auc = evaluate(model, val_loader, criterion, device)
    scheduler.step()

    history['tr_loss'].append(tr_l);  history['tr_acc'].append(tr_a)
    history['vl_loss'].append(vl_l);  history['vl_acc'].append(vl_a)
    history['vl_auc'].append(vl_auc)

    if vl_auc > best_val_auc:
        best_val_auc = vl_auc
        torch.save(model.state_dict(), 'titanic_best.pth')   # save best

    if epoch % 20 == 0:
        print(f"Epoch {epoch:3d} | tr_loss {tr_l:.4f} tr_acc {tr_a:.3f} | "
              f"vl_loss {vl_l:.4f} vl_acc {vl_a:.3f} vl_AUC {vl_auc:.4f}")

print(f"\nBest Validation AUC: {best_val_auc:.4f}")

## 3.7  Visualise & Evaluate

### üß† Brain Analogy
ROC curve: at each threshold, how many real survivors do we catch vs how many false alarms? AUC summarises this as one number. Perfect = top-left corner. Random = diagonal. Confusion matrix: where exactly did the model get confused?

### ‚öôÔ∏è Engineer Analogy
ROC: TPR (recall for positive) vs FPR (1-specificity) at varying thresholds. AUC = probability that a random positive is ranked higher than a random negative.



In [None]:
# üß† Learning diary: fewer mistakes? More correct? AUC rising toward 1.0?
# ‚öôÔ∏è Three subplots: loss, accuracy, AUC-ROC over training epochs
# ‚îÄ‚îÄ Training curves ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].plot(history['tr_loss'], label='Train')
axes[0].plot(history['vl_loss'], label='Val')
axes[0].set_title('Loss');  axes[0].legend()

axes[1].plot(history['tr_acc'], label='Train')
axes[1].plot(history['vl_acc'], label='Val')
axes[1].set_title('Accuracy'); axes[1].legend()

axes[2].plot(history['vl_auc'], color='purple')
axes[2].set_title('Validation AUC-ROC')

plt.suptitle('Titanic Survival Model ‚Äî Training History')
plt.tight_layout()
plt.show()

In [None]:
# üß† Load best state, plot decision quality: ROC curve + confusion matrix
# ‚öôÔ∏è ROC = tradeoff at all thresholds | confusion matrix = specific errors at threshold=0.5
# ‚îÄ‚îÄ Load best model & ROC curve ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
model.load_state_dict(torch.load('titanic_best.pth', map_location=device))
_, val_acc, val_auc = evaluate(model, val_loader, criterion, device)

# Collect predictions for plots
model.eval()
all_probs, all_labels, all_preds = [], [], []
with torch.no_grad():
    for X_b, y_b in val_loader:
        X_b = X_b.to(device)
        probs = torch.sigmoid(model(X_b)).cpu().numpy().flatten()
        all_probs.extend(probs)
        all_labels.extend(y_b.numpy().flatten())
        all_preds.extend((probs >= 0.5).astype(int))

fpr, tpr, _ = roc_curve(all_labels, all_probs)
cm = confusion_matrix(all_labels, all_preds)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.plot(fpr, tpr, lw=2, label=f'AUC = {val_auc:.4f}')
ax1.plot([0,1],[0,1],'--', color='gray')
ax1.set_xlabel('False Positive Rate'); ax1.set_ylabel('True Positive Rate')
ax1.set_title('ROC Curve ‚Äî Titanic Survival'); ax1.legend()

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Died','Survived'], yticklabels=['Died','Survived'], ax=ax2)
ax2.set_title(f'Confusion Matrix (acc={val_acc:.1%})')
ax2.set_ylabel('True'); ax2.set_xlabel('Predicted')

plt.tight_layout()
plt.show()

## 3.8  DataLoader Tips for Production

### üß† Brain Analogy
`num_workers=4` = 4 assistants preparing the next batch in parallel. `pin_memory=True` = pre-position cards in hand for instant GPU transfer. `prefetch_factor=2` = pre-load tomorrow's study material tonight.

### ‚öôÔ∏è Engineer Analogy
Production DataLoader: num_workers (parallel processes), pin_memory (non-pageable memory ‚Üí faster DMA), seed workers (reproducible augmentation), prefetch_factor (pre-load extra batches).

```python
# Large datasets: parallel loading
DataLoader(dataset, batch_size=256, num_workers=4, pin_memory=True)

# Reproducibility: fix worker seeds
def seed_worker(worker_id):
    np.random.seed(42 + worker_id)
    random.seed(42 + worker_id)

g = torch.Generator()
g.manual_seed(42)
DataLoader(dataset, worker_init_fn=seed_worker, generator=g)

# Memory-efficient: prefetch_factor
DataLoader(dataset, num_workers=4, prefetch_factor=2)
```

## Exercises

1. Download the real Titanic CSV from Kaggle and build a `TitanicDataset` that reads directly from file.
2. Add a **threshold tuning** step: instead of 0.5, find the threshold that maximises F1 score.
3. Implement **k-fold cross-validation** using `sklearn.model_selection.StratifiedKFold`.
4. Use `torch.utils.tensorboard.SummaryWriter` to log losses and view in TensorBoard.

---
**Next ‚Üí** [Module 04: Convolutional Neural Networks ‚Äî CIFAR-10](./Module_04_CNNs_CIFAR10.ipynb)