# nn.Module: Building Neural Networks
## Module 2.1, Lesson 3

In this notebook, you'll build neural networks from scratch using PyTorch's `nn.Module` system.

**What you'll do:**
1. Create and inspect `nn.Linear` layers — Guided
2. Verify that `nn.Linear` is just `w @ x + b` — Guided
3. Build a 2-layer network by subclassing `nn.Module` — Supported
4. Rewrite it using `nn.Sequential` — Supported
5. Show why stacking linear layers without activations collapses — Guided
6. (Stretch) Build a skip-connection module — Independent

**For each exercise, PREDICT the output before running the cell.** Wrong predictions are more valuable than correct ones — they reveal gaps in your mental model.

---

**Prerequisites:** Tensors & Autograd lesson, basic Python classes

## 0. Setup

In [None]:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

# For nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Reproducibility
torch.manual_seed(42)

print(f'PyTorch version: {torch.__version__}')

## Exercise 1: Create and Inspect nn.Linear Layers (Guided)

An `nn.Linear(in_features, out_features)` layer performs the operation `y = xW^T + b`.

**Before running, predict:**
- What shape will the weight matrix be for `nn.Linear(10, 5)`?
- What shape will the bias be?
- How many total parameters is that?

In [None]:
# Create a linear layer: 10 inputs, 5 outputs
layer = nn.Linear(10, 5)

print('=== nn.Linear(10, 5) ===')
print(f'Weight shape: {layer.weight.shape}')  # Predict this first!
print(f'Bias shape:   {layer.bias.shape}')     # Predict this first!
print(f'Weight numel: {layer.weight.numel()}')
print(f'Bias numel:   {layer.bias.numel()}')
print(f'Total params: {layer.weight.numel() + layer.bias.numel()}')
print()

# Verify: weight is (out_features, in_features) = (5, 10)
# Why transposed? Because PyTorch computes x @ W^T + b, so W is stored as (out, in)
assert layer.weight.shape == (5, 10), f'Expected (5, 10), got {layer.weight.shape}'
assert layer.bias.shape == (5,), f'Expected (5,), got {layer.bias.shape}'
print('Assertions passed!')

In [None]:
# Now try different sizes. Predict the shapes before running!
sizes = [(3, 1), (784, 128), (128, 10), (1, 100)]

print(f'{"Layer":<20s} {"Weight Shape":<18s} {"Bias Shape":<14s} {"Total Params":>12s}')
print('-' * 65)

for in_f, out_f in sizes:
    l = nn.Linear(in_f, out_f)
    total = sum(p.numel() for p in l.parameters())
    print(f'Linear({in_f}, {out_f}){"":<{13 - len(str(in_f)) - len(str(out_f))}} '
          f'{str(tuple(l.weight.shape)):<18s} {str(tuple(l.bias.shape)):<14s} {total:>12,}')

**Key insight:** The parameter count for `nn.Linear(in, out)` is always `in * out + out` (weights + biases). That `784 * 128 = 100,352` is why dense layers on images get expensive fast.

## Exercise 2: Verify nn.Linear IS w*x + b (Guided)

`nn.Linear` isn't magic. It computes exactly `x @ weight.T + bias`. Let's prove it.

**Before running, predict:** Will `torch.allclose()` return True or False when comparing `layer(x)` to `x @ layer.weight.T + layer.bias`?

In [None]:
# Create a layer and a random input
layer = nn.Linear(10, 5)
x = torch.randn(3, 10)  # batch of 3 samples, 10 features each

# Method 1: Use the layer directly
out_layer = layer(x)

# Method 2: Manual computation
out_manual = x @ layer.weight.T + layer.bias

print(f'Input shape:         {x.shape}')
print(f'Layer output shape:  {out_layer.shape}')
print(f'Manual output shape: {out_manual.shape}')
print()
print(f'Layer output:\n{out_layer}')
print(f'\nManual output:\n{out_manual}')
print()

# Are they the same?
match = torch.allclose(out_layer, out_manual)
print(f'Outputs match: {match}')
assert match, 'Outputs should match — nn.Linear is literally x @ W^T + b'
print('\nConfirmed: nn.Linear(x) == x @ weight.T + bias')

**Why this matters:** There's no hidden complexity in `nn.Linear`. When you call `layer(x)`, PyTorch computes `x @ weight.T + bias` and tracks gradients. That's it. The `nn.Module` system gives you parameter management, not computation magic.

## Exercise 3: Build a 2-Layer nn.Module Subclass (Supported)

Now build an actual neural network by subclassing `nn.Module`. The architecture:

```
Input (10) -> Linear(10, 32) -> ReLU -> Linear(32, 1) -> Output (1)
```

**Task:**
1. Define layers in `__init__`
2. Wire them together in `forward`

Fill in the TODO sections below.

In [None]:
class TwoLayerNet(nn.Module):
    def __init__(self):
        super().__init__()
        # TODO: Define the layers
        # self.fc1 = nn.Linear(10, 32)
        # self.relu = nn.ReLU()
        # self.fc2 = nn.Linear(32, 1)
        pass  # Remove this line when you add your code

    def forward(self, x):
        # TODO: Wire the layers together
        # x = self.fc1(x)
        # x = self.relu(x)
        # x = self.fc2(x)
        # return x
        return x  # Replace this line


# Test it
model = TwoLayerNet()
x = torch.randn(5, 10)  # batch of 5 samples
out = model(x)

print(f'Input shape:  {x.shape}')    # [5, 10]
print(f'Output shape: {out.shape}')   # Should be [5, 1]
print()

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print('Parameter breakdown:')
for name, param in model.named_parameters():
    print(f'  {name:<12s} shape={str(tuple(param.shape)):<12s} params={param.numel()}')
print(f'  {"TOTAL":<12s} {"":12s} params={total_params}')
print()

# Verify
expected_params = 10 * 32 + 32 + 32 * 1 + 1  # fc1 weights + bias + fc2 weights + bias
print(f'Expected params: {expected_params}')
print(f'Actual params:   {total_params}')
assert total_params == expected_params, f'Expected {expected_params}, got {total_params}'
assert out.shape == (5, 1), f'Expected output shape (5, 1), got {out.shape}'
print('\nAll checks passed!')

<details>
<summary>Solution</summary>

The key pattern is: define layers as `self.xxx` in `__init__` (so PyTorch can find them), then call them in sequence in `forward`.

```python
class TwoLayerNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 32)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(32, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x
```

**Why `super().__init__()`?** This registers the module with PyTorch's parameter tracking system. Without it, `.parameters()` won't find your layers.

**Why assign layers to `self`?** PyTorch uses `__setattr__` magic to detect `nn.Module` attributes. If you create a layer as a local variable, it won't be registered and won't appear in `.parameters()`.

</details>

**Key pattern:** Every `nn.Module` subclass follows the same structure:
- `__init__`: register layers as attributes (so PyTorch can find their parameters)
- `forward`: define the computation graph
- Never call `forward()` directly — call the module like a function: `model(x)`

## Exercise 4: Convert to nn.Sequential (Supported)

For simple feed-forward networks where data flows straight through each layer, `nn.Sequential` saves you from writing a `forward` method.

**Task:** Rewrite `TwoLayerNet` as an `nn.Sequential` model. Then verify it produces the same output when given the same weights.

In [None]:
# TODO: Build the same architecture using nn.Sequential
# sequential_model = nn.Sequential(
#     nn.Linear(10, 32),
#     nn.ReLU(),
#     nn.Linear(32, 1),
# )

sequential_model = nn.Sequential()  # Replace this line with the full Sequential

print('Sequential model:')
print(sequential_model)
print()

# Verify same parameter count
seq_params = sum(p.numel() for p in sequential_model.parameters())
print(f'Sequential params: {seq_params}')
print(f'TwoLayerNet params: {total_params}')
assert seq_params == total_params, 'Parameter counts should match!'
print('Parameter counts match!')

In [None]:
# Now prove they compute the same thing with the same weights.
# Copy weights from TwoLayerNet into the Sequential model.

# The Sequential layers are indexed: 0 = first Linear, 1 = ReLU, 2 = second Linear
with torch.no_grad():
    sequential_model[0].weight.copy_(model.fc1.weight)
    sequential_model[0].bias.copy_(model.fc1.bias)
    sequential_model[2].weight.copy_(model.fc2.weight)
    sequential_model[2].bias.copy_(model.fc2.bias)

# Same input
x = torch.randn(5, 10)
out_module = model(x)
out_sequential = sequential_model(x)

print(f'Module output:     {out_module.flatten()[:5]}')
print(f'Sequential output: {out_sequential.flatten()[:5]}')
print()

match = torch.allclose(out_module, out_sequential)
print(f'Outputs match: {match}')
assert match, 'Same weights + same input should give same output'
print('\nConfirmed: nn.Module subclass and nn.Sequential are interchangeable.')
print('Sequential is just syntactic sugar for simple feed-forward architectures.')

<details>
<summary>Solution</summary>

`nn.Sequential` takes layers in order and wires them together automatically — no `forward` method needed.

```python
sequential_model = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    nn.Linear(32, 1),
)
```

The layers are accessed by index: `sequential_model[0]` is the first Linear, `sequential_model[1]` is ReLU, `sequential_model[2]` is the second Linear.

</details>

**When to use which:**
- `nn.Sequential`: Simple pipelines where data flows straight through
- `nn.Module` subclass: Anything with branching, skip connections, multiple inputs/outputs, or conditional logic

## Exercise 5: Linear Collapse Experiment (Guided)

**Claim:** Stacking linear layers without activations is pointless — the entire stack collapses to a single linear transformation.

Mathematically: if `f(x) = W2 @ (W1 @ x + b1) + b2`, that simplifies to `f(x) = (W2 @ W1) @ x + (W2 @ b1 + b2)`. One matrix multiplication, one bias. The depth adds no expressive power.

**Before running, predict:** If we collapse a 3-layer network (no activations) into a single Linear layer, will the outputs be identical?

In [None]:
# Network WITHOUT activations (should collapse)
no_activation = nn.Sequential(
    nn.Linear(10, 64),
    nn.Linear(64, 64),
    nn.Linear(64, 5),
)

# Network WITH activations (should NOT collapse)
with_activation = nn.Sequential(
    nn.Linear(10, 64),
    nn.ReLU(),
    nn.Linear(64, 64),
    nn.ReLU(),
    nn.Linear(64, 5),
)

print(f'No-activation params:   {sum(p.numel() for p in no_activation.parameters()):,}')
print(f'With-activation params: {sum(p.numel() for p in with_activation.parameters()):,}')
print('(Same parameter count — activations have no parameters)')

In [None]:
# Prove the no-activation network collapses to a single linear layer.
# Compute the collapsed weight and bias manually.

with torch.no_grad():
    W1 = no_activation[0].weight  # (64, 10)
    b1 = no_activation[0].bias    # (64,)
    W2 = no_activation[1].weight  # (64, 64)
    b2 = no_activation[1].bias    # (64,)
    W3 = no_activation[2].weight  # (5, 64)
    b3 = no_activation[2].bias    # (5,)

    # Collapse: W_combined = W3 @ W2 @ W1
    #           b_combined = W3 @ W2 @ b1 + W3 @ b2 + b3
    W_combined = W3 @ W2 @ W1
    b_combined = W3 @ W2 @ b1 + W3 @ b2 + b3

    print(f'Collapsed weight shape: {W_combined.shape}')  # (5, 10)
    print(f'Collapsed bias shape:   {b_combined.shape}')  # (5,)
    print()

    # Build a single-layer equivalent
    collapsed = nn.Linear(10, 5)
    collapsed.weight.copy_(W_combined)
    collapsed.bias.copy_(b_combined)

# Test: do the 3-layer network and the collapsed single layer give the same output?
x = torch.randn(20, 10)

with torch.no_grad():
    out_3layer = no_activation(x)
    out_collapsed = collapsed(x)

match = torch.allclose(out_3layer, out_collapsed, atol=1e-5)
print(f'3-layer output matches single-layer: {match}')
assert match, 'They should match — stacked linears collapse!'
print()
print('Confirmed: 3 linear layers with no activations = 1 linear layer.')
print('All that extra computation and parameters add zero expressive power.')

In [None]:
# Now show that this does NOT work for the network with activations.
# ReLU is nonlinear, so the composition cannot be reduced.

# Try to approximate the activated network with a single linear layer.
# We'll use least squares to find the best single-layer approximation.

torch.manual_seed(42)
x_test = torch.randn(1000, 10)

with torch.no_grad():
    y_activated = with_activation(x_test)

# Best linear fit: solve for W, b such that x @ W^T + b ≈ y
# Using least squares via pseudoinverse
x_augmented = torch.cat([x_test, torch.ones(1000, 1)], dim=1)  # add bias column
solution = torch.linalg.lstsq(x_augmented, y_activated).solution  # (11, 5)

y_linear_approx = x_augmented @ solution

# How close is the approximation?
residual = (y_activated - y_linear_approx).pow(2).mean().sqrt()
signal = y_activated.pow(2).mean().sqrt()

print(f'RMS of activated outputs:        {signal:.4f}')
print(f'RMS of linear approximation err: {residual:.4f}')
print(f'Relative error:                  {residual / signal:.1%}')
print()

if residual / signal > 0.01:
    print('The activated network CANNOT be collapsed to a single linear layer.')
    print('This is the whole point of activation functions: they add nonlinearity.')
    print('Without them, depth is an illusion.')

**Key takeaway:** Linear layers stacked without activations collapse to a single linear transformation. Activation functions (ReLU, etc.) are what give deep networks their expressive power. Depth without nonlinearity is meaningless.

## Exercise 6: Build a Skip-Connection Module (Independent)

A **residual block** computes `output = x + f(x)` where `f` is some transformation. The skip connection (`+ x`) lets gradients flow directly through the network, which helps with training deep networks.

**Your task:** Implement `ResidualBlock` where:
- `f(x) = Linear -> ReLU -> Linear`
- The forward pass returns `x + f(x)`

**Important:** For the skip connection to work, `f(x)` must output the same shape as `x`. So both Linear layers must preserve the dimension.

In [None]:
class ResidualBlock(nn.Module):
    """A simple residual block: output = x + f(x)
    
    f(x) = Linear(dim, dim) -> ReLU -> Linear(dim, dim)
    
    The skip connection adds the input directly to the output.
    """
    def __init__(self, dim):
        super().__init__()
        # TODO: Define the transformation f(x)
        # self.fc1 = nn.Linear(dim, dim)
        # self.relu = nn.ReLU()
        # self.fc2 = nn.Linear(dim, dim)
        pass  # Remove this line when you add your code

    def forward(self, x):
        # TODO: Compute f(x) and add the skip connection
        # f_x = self.fc2(self.relu(self.fc1(x)))
        # return x + f_x
        return x  # Replace this line

In [None]:
# Test the ResidualBlock
block = ResidualBlock(dim=16)
x = torch.randn(4, 16)
out = block(x)

print(f'Input shape:  {x.shape}')   # [4, 16]
print(f'Output shape: {out.shape}')  # Should be [4, 16]
assert out.shape == x.shape, f'Output shape should match input: {x.shape}'
print()

# Verify the skip connection is present.
# If we zero out all the weights in the transformation,
# the output should equal the input (identity).
with torch.no_grad():
    for p in block.parameters():
        p.zero_()

out_zeroed = block(x)
is_identity = torch.allclose(out_zeroed, x)
print(f'With zeroed weights, output == input: {is_identity}')
assert is_identity, 'When f(x)=0, the block should be identity (x + 0 = x)'
print()
print('Skip connection verified!')
print('When the transformation learns nothing, the block defaults to passing input through.')
print('This is why residual connections help training: the "easy" path is identity, not zero.')

<details>
<summary>Solution</summary>

The key insight is that the skip connection (`x + f(x)`) means the block defaults to identity when `f` learns nothing. This is why residual networks are easier to train than plain deep networks.

```python
class ResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.fc1 = nn.Linear(dim, dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(dim, dim)

    def forward(self, x):
        f_x = self.fc2(self.relu(self.fc1(x)))
        return x + f_x
```

**Why `dim -> dim`?** Both Linear layers must preserve the dimension so that `x + f(x)` works. If `f(x)` had a different shape than `x`, the addition would fail.

**Why this matters:** In a plain deep network, each layer can only learn transformations that are "close to zero" at initialization. With a skip connection, each block can learn transformations that are "close to identity" — a much easier starting point.

</details>

In [None]:
# Bonus: Stack multiple residual blocks into a deeper network
deep_residual = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    ResidualBlock(32),
    ResidualBlock(32),
    ResidualBlock(32),
    nn.Linear(32, 1),
)

x = torch.randn(8, 10)
out = deep_residual(x)

print(f'Deep residual network:')
print(f'  Input:  {x.shape}')
print(f'  Output: {out.shape}')
print(f'  Params: {sum(p.numel() for p in deep_residual.parameters()):,}')
print()
print('Each ResidualBlock adds depth without risk of degradation.')
print('This pattern is the foundation of ResNets, Transformers, and most modern architectures.')

---

## Key Takeaways

1. **`nn.Linear` is just `x @ W^T + b`** — no magic, just matrix math with tracked gradients
2. **`nn.Module` subclassing** is the standard pattern: define layers in `__init__`, wire them in `forward`
3. **`nn.Sequential`** is syntactic sugar for simple pipelines — use `nn.Module` when you need branching or skip connections
4. **Stacked linear layers without activations collapse** to a single linear transformation — activations are what give depth its power
5. **Skip connections** let a block default to identity, making deep networks trainable

**Everything in PyTorch builds on these primitives.** CNNs, RNNs, Transformers — they all use `nn.Module`, define layers in `__init__`, and wire them in `forward`. The only thing that changes is *which* layers and *how* they connect.