
# MLF Week 2: Logistic Regression, Model Evaluation, and Autograd

This notebook accompanies the **MLF Week 2 Slides**, with a content focuses on **binary classification with logistic regression**, **model evaluation**, and **using PyTorch `nn` and `optim`**.

**You will:**
- Understand sigmoid and decision boundaries
- Implement Binary Cross-Entropy (BCE) and compare with PyTorch
- Split data into **train**, **validation**, and **test** sets
- Build a logistic regression classifier using `torch.nn` and `torch.optim`
- Plot a decision boundary for a 2D dataset
- Diagnose **overfitting vs underfitting** and try **L2 regularization**
- Complete graded mini-exercises with tests


In [None]:
%pip install torch==2.8.0
%pip install matplotlib==3.10.6
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt

print("PyTorch version:", torch.__version__)


# 1. From Linear to Logistic

**Classification** predicts a **class label**. We start with a linear score then squash it to a probability using the **sigmoid** function:

\begin{equation}
z = w^\top x + b,\qquad \sigma(z) = \frac{1}{1 + e^{-z}}.
\end{equation}

The model predicts class 1 when $\sigma(z) \ge 0.5$ and class 0 otherwise. The line where $\sigma(z)=0.5$ defines the **decision boundary**.



## 1.1 Exercise: Implement `sigmoid`

Fill in the TODO to implement a **numerically stable** sigmoid. Avoid overflow for large negative inputs.

**Hints**
- For $z \ge 0$: $\sigma(z) = 1 / (1 + e^{-z})$
- For $z < 0$: $\sigma(z) = e^{z} / (1 + e^{z})$

We will test your function below.


In [None]:
def student_sigmoid(z: torch.Tensor) -> torch.Tensor:
    """
    Numerically stable sigmoid.
    Args:
        z: tensor of logits
    Returns:
        probabilities in (0, 1)
    """
    # TODO: implement a numerically stable sigmoid (no torch.sigmoid call)
    # Replace the next line with your solution.
    pos_mask = (z >= 0)
    neg_mask = ~pos_mask
    out = torch.empty_like(z)

    # Positive branch: 1 / (1 + exp(-z))
    out[pos_mask] = 1.0 / (1.0 + torch.exp(-z[pos_mask]))
    # Negative branch: exp(z) / (1 + exp(z))
    ez = torch.exp(z[neg_mask])
    out[neg_mask] = ez / (1.0 + ez)
    return out


### Tests: `student_sigmoid`
These should pass if your implementation is correct.


In [None]:
from utils import grade_sigmoid, show_result

res = grade_sigmoid(student_sigmoid)
show_result("student_sigmoid", res)


# 2. Binary Cross-Entropy (BCE) Loss

For binary labels $y \in \{0,1\}$ and predicted probability $ \hat{p} = \sigma(z) $, the **BCE** loss is

\begin{equation}
\mathcal{L}_{\text{BCE}}(y,\hat{p}) = -\Big[y\log(\hat{p}) + (1-y)\log(1-\hat{p})\Big].
\end{equation}

In practice we often pass logits $z$ directly to a numerically stable function like `nn.BCEWithLogitsLoss`, which internally applies sigmoid and BCE in one go.



## 2.1 Exercise: Implement `binary_cross_entropy`

Implement BCE given **probabilities** $\hat{p}$ and labels $y$. Use a small `eps` to avoid $\log(0)$. We will compare against PyTorch.


In [None]:
def binary_cross_entropy(probs: torch.Tensor, targets: torch.Tensor, eps: float = 1e-7) -> torch.Tensor:
    """
    BCE computed from probabilities.
    Args:
        probs: predicted probabilities in (0, 1), shape [N] or [N,1]
        targets: binary labels 0/1, same shape as probs
    Returns:
        mean BCE loss (scalar tensor)
    """
    # TODO: implement BCE safely using eps
    probs = probs.clamp(eps, 1.0 - eps)
    loss = -(targets * torch.log(probs) + (1.0 - targets) * torch.log(1.0 - probs))
    return loss.mean()


### Tests: `binary_cross_entropy`


In [None]:
from utils import grade_bce, show_result

res = grade_bce(binary_cross_entropy)
show_result("binary_cross_entropy", res)


# 3. Train, Validation, Test Split

We split the dataset to estimate generalization and to tune hyperparameters without contaminating the test set.

- **Train**: optimize parameters
- **Validation**: choose hyperparameters and detect overfitting
- **Test**: final unbiased estimate

Train optimizes parameters. Validation selects hyperparameters and detects overfitting. The test set is never touched until the very end, otherwise you leak information and inflate the final estimate.



## 3.1 Generate a 2D synthetic dataset

We create two Gaussian blobs that are (roughly) linearly separable.


In [None]:
def make_blobs(n_per_class=200, centers=((0.0, 0.0), (2.5, 2.5)), std=0.8, seed=42):
    gen = torch.Generator().manual_seed(seed)
    c0 = torch.randn(n_per_class, 2, generator=gen) * std + torch.tensor(centers[0])
    c1 = torch.randn(n_per_class, 2, generator=gen) * std + torch.tensor(centers[1])
    X = torch.cat([c0, c1], dim=0)
    y = torch.cat([torch.zeros(n_per_class, 1), torch.ones(n_per_class, 1)], dim=0)
    idx = torch.randperm(len(X), generator=gen)
    return X[idx], y[idx]

X, y = make_blobs()
print("X shape:", X.shape, "y mean:", y.float().mean().item())


## 3.2 Exercise: Implement `train_val_test_split`

Write a function that splits tensors `X, y` into `(Xtr, ytr), (Xva, yva), (Xte, yte)` given ratios.

**Requirements**
- Use a fixed seed for reproducibility
- No overlap between splits
- Return contiguous tensors


In [None]:
def train_val_test_split(X: torch.Tensor, y: torch.Tensor, ratios=(0.7, 0.15, 0.15), seed=123):
    """
    Split X, y into train/val/test.
    Returns: (Xtr, ytr), (Xva, yva), (Xte, yte)
    """
    # TODO: implement a reproducible random split
    n = len(X)
    r_tr, r_va, r_te = ratios
    assert abs(r_tr + r_va + r_te - 1.0) < 1e-6, "Ratios must sum to 1."

    g = torch.Generator().manual_seed(seed)
    idx = torch.randperm(n, generator=g)

    n_tr = int(n * r_tr)
    n_va = int(n * r_va)
    n_te = n - n_tr - n_va

    idx_tr = idx[:n_tr]
    idx_va = idx[n_tr:n_tr + n_va]
    idx_te = idx[n_tr + n_va:]

    Xtr, ytr = X[idx_tr].contiguous(), y[idx_tr].contiguous()
    Xva, yva = X[idx_va].contiguous(), y[idx_va].contiguous()
    Xte, yte = X[idx_te].contiguous(), y[idx_te].contiguous()
    return (Xtr, ytr), (Xva, yva), (Xte, yte)


### Tests: `train_val_test_split`


In [None]:
from utils import grade_split, show_result

res = grade_split(train_val_test_split, X, y, ratios=(0.6, 0.2, 0.2), seed=7)
show_result("train_val_test_split", res)


# 4. Logistic Regression in PyTorch (`nn` + `optim`)

We use a single linear layer to produce logits $ z = w^\top x + b $ and the stable `BCEWithLogitsLoss`. Accuracy uses a threshold at 0.5.



## 4.1 Exercise: Complete the training loop

Fill the TODOs in `train_logreg` for a working training loop.


In [None]:
class LogisticRegression(nn.Module):
    def __init__(self, in_dim=2):
        super().__init__()
        self.linear = nn.Linear(in_dim, 1)

    def forward(self, x):
        return self.linear(x)  # logits


def accuracy_from_logits(logits: torch.Tensor, targets: torch.Tensor) -> float:
    probs = torch.sigmoid(logits)
    preds = (probs >= 0.5).float()
    return (preds.eq(targets).float().mean().item())


def train_logreg(Xtr, ytr, Xva, yva, lr=0.1, weight_decay=0.0, epochs=300):
    model = LogisticRegression(in_dim=Xtr.shape[1])
    criterion = nn.BCEWithLogitsLoss()
    optimizer = optim.SGD(model.parameters(), lr=lr, weight_decay=weight_decay)

    tr_losses, va_losses, va_accs = [], [], []

    for epoch in range(epochs):
        model.train()
        # TODO: forward on train
        logits = model(Xtr)
        loss = criterion(logits, ytr)

        # TODO: backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        tr_losses.append(loss.item())

        model.eval()
        with torch.no_grad():
            va_logits = model(Xva)
            va_loss = criterion(va_logits, yva).item()
            va_acc = accuracy_from_logits(va_logits, yva)
            va_losses.append(va_loss)
            va_accs.append(va_acc)

        if epoch % 50 == 0:
            print(f"epoch {epoch:03d}  train_loss={loss.item():.4f}  val_loss={va_loss:.4f}  val_acc={va_acc:.3f}")

    return model, tr_losses, va_losses, va_accs


# Prepare splits
(Xtr, ytr), (Xva, yva), (Xte, yte) = train_val_test_split(X, y, ratios=(0.7, 0.15, 0.15), seed=42)

model, tr_losses, va_losses, va_accs = train_logreg(Xtr, ytr, Xva, yva, lr=0.2, weight_decay=0.0, epochs=300)

print(f"Validation accuracy (last): {va_accs[-1]:.3f}")


### Test: Training loop reduces loss

Checks that training loss decreases overall.


In [None]:
from utils import grade_training_progress, grade_validation_accuracy, show_result

show_result("training_progress", grade_training_progress(tr_losses))
show_result("validation_accuracy", grade_validation_accuracy(va_accs, threshold=0.85))


# 5. Visualizing the Decision Boundary

We visualize a contour where the model predicts 0.5 probability.


In [None]:
def plot_decision_boundary(model, X, y, padding=1.0, steps=200):
    model.eval()
    with torch.no_grad():
        x_min, x_max = X[:,0].min().item() - padding, X[:,0].max().item() + padding
        y_min, y_max = X[:,1].min().item() - padding, X[:,1].max().item() + padding
        xs = torch.linspace(x_min, x_max, steps=steps)
        ys = torch.linspace(y_min, y_max, steps=steps)
        grid_x, grid_y = torch.meshgrid(xs, ys, indexing="xy")
        grid = torch.stack([grid_x.reshape(-1), grid_y.reshape(-1)], dim=1)
        logits = model(grid)
        probs = torch.sigmoid(logits).reshape(steps, steps)

        plt.figure(figsize=(6, 5))
        plt.contourf(grid_x.numpy(), grid_y.numpy(), probs.numpy(), levels=20, alpha=0.6)
        plt.contour(grid_x.numpy(), grid_y.numpy(), (probs.numpy() >= 0.5).astype(float), levels=[0.5])
        plt.scatter(X[y.squeeze()==0,0].numpy(), X[y.squeeze()==0,1].numpy(), s=12, label="class 0")
        plt.scatter(X[y.squeeze()==1,0].numpy(), X[y.squeeze()==1,1].numpy(), s=12, label="class 1")
        plt.legend()
        plt.title("Decision boundary and data")
        plt.xlabel("x1")
        plt.ylabel("x2")
        plt.show()

plot_decision_boundary(model, Xtr, ytr)


# 6. Overfitting vs Underfitting, and L2 Regularization

- **Underfitting**: the model is too simple and fails to capture structure.
- **Overfitting**: the model memorizes noise and performs poorly on new data.
- **Regularization (L2/weight decay)** shrinks parameters and can reduce overfitting.

We compare training with and without L2 on a noisy dataset.


In [None]:
# Make a noisier dataset
X2, y2 = make_blobs(n_per_class=150, centers=((0.0, 0.0), (2.2, 2.2)), std=1.2, seed=7)
(Xtr2, ytr2), (Xva2, yva2), (Xte2, yte2) = train_val_test_split(X2, y2, seed=3)

# No regularization
m0, tr0, va0, acc0 = train_logreg(Xtr2, ytr2, Xva2, yva2, lr=0.2, weight_decay=0.0, epochs=250)
print(f"No-reg val acc: {acc0[-1]:.3f}")
plot_decision_boundary(m0, Xtr2, ytr2)

# L2 regularization
m1, tr1, va1, acc1 = train_logreg(Xtr2, ytr2, Xva2, yva2, lr=0.2, weight_decay=1e-2, epochs=250)
print(f"L2 val acc: {acc1[-1]:.3f}")
plot_decision_boundary(m1, Xtr2, ytr2)

# Compare weight norms
with torch.no_grad():
    w0 = m0.linear.weight.norm().item()
    w1 = m1.linear.weight.norm().item()
print(f"Weight norm no-reg: {w0:.3f}  vs  with L2: {w1:.3f}")


# 7. Final Evaluation on the Test Set

Evaluate your final model on the held-out test data.


In [None]:
def evaluate_on_test(model, Xte, yte):
    model.eval()
    with torch.no_grad():
        logits = model(Xte)
        loss = nn.BCEWithLogitsLoss()(logits, yte).item()
        acc = accuracy_from_logits(logits, yte)
    print(f"Test loss: {loss:.4f}, Test accuracy: {acc:.3f}")
    return loss, acc

_ = evaluate_on_test(model, Xte, yte)


# 8. Conclusion
You now have:
- A numerically stable sigmoid and BCE implementation
- A clean train / val / test workflow
- A PyTorch logistic regression model + monitoring
- Experience diagnosing underfitting vs overfitting
- Hands‑on exposure to L2 regularization effects

**Great job on completing Week 2!**