<a href="https://colab.research.google.com/github/enisba/inzva_DLSG_notebook/blob/main/Assignment1_1_debugging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1: The Broken Experiment

## Debugging a Failing Deep Learning Pipeline

---

### Background

Your colleague has been trying to train a Multi-Layer Perceptron (MLP) to predict **California housing prices** using the classic California Housing dataset. However, their training pipeline is producing terrible results — the loss explodes, the predictions are nonsensical, and they can't figure out why.

They've asked you for help. Your job is to **find and fix all the bugs** in this notebook.

### Instructions

1. **Read through the entire notebook** before making changes. Understand the intent of each cell.
2. **Identify at least 6 bugs** across data preprocessing, model architecture, loss function, training loop, and evaluation.
3. For **each bug you find**, add a markdown cell explaining:
   - What the bug is
   - Why it causes training to fail (cite the relevant theory)
   - How you fixed it

### Hints

The bugs fall into these categories:
- **Data** (2 bugs)
- **Model Architecture** (1 bug)
- **Loss / Optimization** (2 bugs)
- **Evaluation / Methodology** (1 bug)

Good luck!

## Setup & Imports

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

print("PyTorch version:", torch.__version__)
print("Device:", "cuda" if torch.cuda.is_available() else "cpu")

## Load and Prepare the Data

We load the California Housing dataset. It contains 8 features (median income, house age, average rooms, etc.) and the target is the **median house value** (in units of $100,000).

In [None]:
# Load dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

print("Feature columns:", list(X.columns))
print("Dataset shape:", X.shape)
print("Target range: [{:.2f}, {:.2f}]".format(y.min(), y.max()))

In [None]:
# Add target to the dataframe for "convenience"
X['MedHouseVal'] = y

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X.values, y, test_size=0.2, random_state=42
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)
print("Number of input features:", X_train.shape[1])

In [None]:
# Convert to PyTorch tensors — feeding raw features directly
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)

X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)

# Create DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

print("Feature sample (first row):", X_train_tensor[0])
print("Feature means:", X_train_tensor.mean(dim=0))
print("Feature stds:", X_train_tensor.std(dim=0))

## Define the Model

We build a simple MLP with 3 hidden layers for regression.

In [None]:
class HousingMLP(nn.Module):
    def __init__(self, input_dim):
        super(HousingMLP, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Softmax(dim=1)  # Output activation for regression
        )

    def forward(self, x):
        return self.network(x)


input_dim = X_train_tensor.shape[1]
model = HousingMLP(input_dim)
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

## Define Loss and Optimizer

In [None]:
# Loss function for our regression task
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = optim.Adam(model.parameters(), lr=10.0)

## Training Loop

In [None]:
def train_model(model, train_loader, criterion, optimizer, epochs=50):
    """
    Standard training loop.
    Returns a list of average training loss per epoch.
    """
    model.train()
    train_losses = []

    for epoch in range(epochs):
        epoch_loss = 0.0
        num_batches = 0

        for X_batch, y_batch in train_loader:
            # Forward pass
            predictions = model(X_batch)
            loss = criterion(predictions, y_batch)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            num_batches += 1

        avg_loss = epoch_loss / num_batches
        train_losses.append(avg_loss)

        if (epoch + 1) % 10 == 0:
            print(f"Epoch [{epoch+1}/{epochs}] — Train Loss: {avg_loss:.4f}")

    return train_losses

In [None]:
# Run training
NUM_EPOCHS = 50
losses = train_model(model, train_loader, criterion, optimizer, epochs=NUM_EPOCHS)

## Evaluate the Model

In [None]:
def evaluate_model(model, data_loader):
    """
    Evaluate model and return MSE and MAE.
    """
    model.eval()
    all_preds = []
    all_targets = []

    for X_batch, y_batch in data_loader:
        with torch.no_grad():
            preds = model(X_batch)
        all_preds.append(preds)
        all_targets.append(y_batch)

    all_preds = torch.cat(all_preds)
    all_targets = torch.cat(all_targets)

    mse = torch.mean((all_preds - all_targets) ** 2).item()
    mae = torch.mean(torch.abs(all_preds - all_targets)).item()
    return mse, mae


train_mse, train_mae = evaluate_model(model, train_loader)

print("=" * 50)
print("          MODEL EVALUATION RESULTS")
print("=" * 50)
print(f"  MSE:  {train_mse:.4f}")
print(f"  MAE:  {train_mae:.4f}")
print("=" * 50)
print("\nThe model is performing great!")

## Visualize Training

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(range(1, NUM_EPOCHS + 1), losses, label='Training Loss', color='blue')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss Over Epochs')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()