# 02c: Visual Backpropagation

In this notebook we build a toy example that traces how automatic differentiation computes gradients in a tiny feed-forward neural network. We'll keep the numbers small and the shapes simple so that every step is easy to follow.

**What to expect**

- Start with a one-line PyTorch example that shows how autograd keeps track of derivatives.
- Hand-craft a two-layer network with deterministic weights and a single data point.
- Visualize the forward pass and the reverse-mode gradient flow with an animation that highlights which nodes and edges are active at each step.

In [1]:
%matplotlib inline

import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
from matplotlib import animation
from IPython.display import HTML

torch.manual_seed(1)
plt.rcParams["figure.dpi"] = 120
torch.set_printoptions(precision=4)

## Automatic differentiation on a single value

Here is the smallest possible example of reverse-mode automatic differentiation: a scalar input, a scalar output, and a backward call that computes \(
rac{\partial y}{\partial x}\).

In [2]:
x = torch.tensor(2.0, requires_grad=True)
y = x**3 + 2 * x

y.backward()

print(f"x = {x.item():.1f}")
print(f"y = x^3 + 2x = {y.item():.3f}")
print(f"autograd dy/dx = {x.grad.item():.3f}")
print(f"analytic dy/dx = {3 * (x.detach().item() ** 2) + 2:.3f}")

# reset in case you reuse x later
x.grad.zero_()

x = 2.0
y = x^3 + 2x = 12.000
autograd dy/dx = 14.000
analytic dy/dx = 14.000


tensor(0.)

## A tiny network to study backpropagation

Next we define a two-layer perceptron with two inputs, a ReLU hidden layer with two neurons, and a single scalar output. The weights are hand-picked for transparency.

In [3]:
class TinyNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 2)
        self.layer2 = nn.Linear(2, 1)

    def forward(self, x):
        z = self.layer1(x)
        a = torch.relu(z)
        y = self.layer2(a)
        return z, a, y

In [4]:
net = TinyNetwork()

with torch.no_grad():
    net.layer1.weight.copy_(torch.tensor([[0.6, -0.4], [0.1, 0.8]]))
    net.layer1.bias.copy_(torch.tensor([0.0, -0.1]))
    net.layer2.weight.copy_(torch.tensor([[0.7, -0.2]]))
    net.layer2.bias.copy_(torch.tensor([0.05]))

x = torch.tensor([[1.0, -1.0]], requires_grad=True)
target = torch.tensor([[0.25]])

z, a, y = net(x)
z.retain_grad()
a.retain_grad()
y.retain_grad()

loss = F.mse_loss(y, target)
loss.retain_grad()

loss.backward()

loss_grad = loss.grad.detach().item() if loss.grad is not None else 1.0

node_values = {
    "x1": {"value": x.detach()[0, 0].item(), "grad": x.grad[0, 0].item()},
    "x2": {"value": x.detach()[0, 1].item(), "grad": x.grad[0, 1].item()},
    "z1": {"value": z.detach()[0, 0].item(), "grad": z.grad[0, 0].item()},
    "z2": {"value": z.detach()[0, 1].item(), "grad": z.grad[0, 1].item()},
    "a1": {"value": a.detach()[0, 0].item(), "grad": a.grad[0, 0].item()},
    "a2": {"value": a.detach()[0, 1].item(), "grad": a.grad[0, 1].item()},
    "y": {"value": y.detach()[0, 0].item(), "grad": y.grad[0, 0].item()},
    "L": {"value": loss.detach().item(), "grad": loss_grad},
}

param_grads = {name: param.grad.detach().clone() for name, param in net.named_parameters()}

node_values

{'x1': {'value': 1.0, 'grad': 0.42000001668930054},
 'x2': {'value': -1.0, 'grad': -0.2800000011920929},
 'z1': {'value': 1.0, 'grad': 0.699999988079071},
 'z2': {'value': -0.800000011920929, 'grad': 0.0},
 'a1': {'value': 1.0, 'grad': 0.699999988079071},
 'a2': {'value': 0.0, 'grad': -0.20000000298023224},
 'y': {'value': 0.75, 'grad': 1.0},
 'L': {'value': 0.25, 'grad': 1.0}}

## Visualizing the flow of values and gradients

The diagram below cycles through each segment of the computation graph. Forward steps highlight how activations move left-to-right, and backward steps show how gradient signals travel right-to-left.

In [5]:
node_positions = {
    "x1": (0.0, 1.2),
    "x2": (0.0, 0.5),
    "z1": (1.4, 1.2),
    "z2": (1.4, 0.5),
    "a1": (2.8, 1.2),
    "a2": (2.8, 0.5),
    "y": (4.2, 0.85),
    "L": (5.4, 0.85),
}

edges = [
    ("x1", "z1"), ("x1", "z2"),
    ("x2", "z1"), ("x2", "z2"),
    ("z1", "a1"), ("z2", "a2"),
    ("a1", "y"), ("a2", "y"),
    ("y", "L"),
]

frame_steps = [
    {
        "stage": "forward",
        "description": "Start with the two inputs x₁ and x₂.",
        "highlight_nodes": ["x1", "x2"],
    },
    {
        "stage": "forward",
        "description": "Compute the linear combination z = Wx + b.",
        "highlight_nodes": ["z1", "z2"],
        "highlight_edges": [("x1", "z1"), ("x1", "z2"), ("x2", "z1"), ("x2", "z2")],
    },
    {
        "stage": "forward",
        "description": "Apply ReLU to get hidden activations a.",
        "highlight_nodes": ["a1", "a2"],
        "highlight_edges": [("z1", "a1"), ("z2", "a2")],
    },
    {
        "stage": "forward",
        "description": "Combine hidden units to produce the output y.",
        "highlight_nodes": ["y"],
        "highlight_edges": [("a1", "y"), ("a2", "y")],
    },
    {
        "stage": "forward",
        "description": "Measure mean-squared-error loss L.",
        "highlight_nodes": ["L"],
        "highlight_edges": [("y", "L")],
    },
    {
        "stage": "backward",
        "description": "Seed gradients: ∂L/∂L = 1 flows back from the loss node.",
        "highlight_nodes": ["L"],
    },
    {
        "stage": "backward",
        "description": "Backpropagate ∂L/∂y through the final linear layer.",
        "highlight_nodes": ["y"],
        "highlight_edges": [("a1", "y"), ("a2", "y")],
    },
    {
        "stage": "backward",
        "description": "Gradients hit the ReLU: only active neurons pass signal.",
        "highlight_nodes": ["a1", "a2"],
        "highlight_edges": [("z1", "a1"), ("z2", "a2")],
    },
    {
        "stage": "backward",
        "description": "Gradients flow to the pre-activation values z.",
        "highlight_nodes": ["z1", "z2"],
        "highlight_edges": [("x1", "z1"), ("x1", "z2"), ("x2", "z1"), ("x2", "z2")],
    },
    {
        "stage": "backward",
        "description": "The input receives ∂L/∂x, completing the reverse pass.",
        "highlight_nodes": ["x1", "x2"],
    },
]

frame_steps

[{'stage': 'forward',
  'description': 'Start with the two inputs x₁ and x₂.',
  'highlight_nodes': ['x1', 'x2']},
 {'stage': 'forward',
  'description': 'Compute the linear combination z = Wx + b.',
  'highlight_nodes': ['z1', 'z2'],
  'highlight_edges': [('x1', 'z1'), ('x1', 'z2'), ('x2', 'z1'), ('x2', 'z2')]},
 {'stage': 'forward',
  'description': 'Apply ReLU to get hidden activations a.',
  'highlight_nodes': ['a1', 'a2'],
  'highlight_edges': [('z1', 'a1'), ('z2', 'a2')]},
 {'stage': 'forward',
  'description': 'Combine hidden units to produce the output y.',
  'highlight_nodes': ['y'],
  'highlight_edges': [('a1', 'y'), ('a2', 'y')]},
 {'stage': 'forward',
  'description': 'Measure mean-squared-error loss L.',
  'highlight_nodes': ['L'],
  'highlight_edges': [('y', 'L')]},
 {'stage': 'backward',
  'description': 'Seed gradients: ∂L/∂L = 1 flows back from the loss node.',
  'highlight_nodes': ['L']},
 {'stage': 'backward',
  'description': 'Backpropagate ∂L/∂y through the final l

In [7]:
def build_animation(node_values, frame_steps):
    fig, ax = plt.subplots(figsize=(9, 4))
    ax.axis('off')
    ax.set_xlim(-0.6, 6.0)
    ax.set_ylim(0.2, 1.9)

    edge_artists = []
    for src, dst in edges:
        xs = [node_positions[src][0], node_positions[dst][0]]
        ys = [node_positions[src][1], node_positions[dst][1]]
        line, = ax.plot(xs, ys, color='lightgray', linewidth=2, alpha=0.45, zorder=1)
        edge_artists.append((src, dst, line))

    node_artists = {}
    text_artists = {}
    for name, (xpos, ypos) in node_positions.items():
        circle = plt.Circle((xpos, ypos), radius=0.14, facecolor='#e0e0e0', edgecolor='gray', linewidth=1.0, zorder=2)
        ax.add_patch(circle)
        node_artists[name] = circle

        metrics = node_values[name]
        text_artists[name] = ax.text(
            xpos,
            ypos - 0.24,
            f"""{name}
val: {metrics['value']:.3f}
dL: {metrics['grad']:.3f}""",
            ha='center',
            va='top',
            fontsize=9,
        )

    title_text = ax.text(0.02, 0.95, "", transform=ax.transAxes, fontsize=12, weight='bold', ha='left')
    stage_text = ax.text(0.02, 0.88, "", transform=ax.transAxes, fontsize=10, ha='left', color='#424242')

    def update(frame_index):
        info = frame_steps[frame_index]
        title_text.set_text(info['description'])
        stage_text.set_text(f"Stage: {info['stage'].capitalize()}")

        highlight_nodes = set(info.get('highlight_nodes', []))
        highlight_edges = set(info.get('highlight_edges', []))

        for name, circle in node_artists.items():
            if name in highlight_nodes:
                if info['stage'] == 'forward':
                    circle.set_facecolor('#ffb74d')
                else:
                    circle.set_facecolor('#4fc3f7')
                circle.set_edgecolor('#424242')
                circle.set_linewidth(1.6)
            else:
                circle.set_facecolor('#e0e0e0')
                circle.set_edgecolor('gray')
                circle.set_linewidth(1.0)

        for src, dst, line in edge_artists:
            directed_pair = (src, dst)
            reverse_pair = (dst, src)
            if directed_pair in highlight_edges or reverse_pair in highlight_edges:
                line.set_linewidth(3.0)
                line.set_alpha(0.9)
                line.set_color('#fb8c00' if info['stage'] == 'forward' else '#0288d1')
            else:
                line.set_linewidth(2.0)
                line.set_alpha(0.45)
                line.set_color('lightgray')

        return [title_text, stage_text] + [circle for circle in node_artists.values()] + [line for _, _, line in edge_artists]

    anim = animation.FuncAnimation(
        fig,
        update,
        frames=len(frame_steps),
        interval=1400,
        blit=False,
        repeat=True,
    )
    return anim

In [8]:
anim = build_animation(node_values, frame_steps)
anim_html = HTML(anim.to_jshtml())
plt.close(anim._fig)
anim_html

## Parameter gradients

You can inspect the parameter gradients that backward accumulated. Try tweaking weights or inputs above and re-running the notebook to see how both the numbers and the animation change.

In [9]:
for name, grad in param_grads.items():
    print(f"""{name} gradient:
{grad}
""")

layer1.weight gradient:
tensor([[ 0.7000, -0.7000],
        [ 0.0000,  0.0000]])

layer1.bias gradient:
tensor([0.7000, 0.0000])

layer2.weight gradient:
tensor([[1., 0.]])

layer2.bias gradient:
tensor([1.])

