This notebook demonstrates:

- The use of PyTorch for automatic differentiation.
- How to compute gradients of functions using `torch.autograd`.
- Verification of manually derived gradients against those computed via autograd.
- Working with multi-dimensional tensors in gradient computations.


In [1]:
import torch

torch.set_default_dtype(torch.float64)

### Scalar function example

Use `torch.autograd.grad` to compute the gradient of a scalar function in a batch.

The `grad_outputs` argument should be set to `torch.ones_like(y)` in order to compute the
vector-gradient product of the seed vector $( 1, \ldots, 1 )^\top$ 
with the gradient of the function $( f(x_1), \ldots, f(x_N) )^\top$.

Since the Jacobian is diagonal, the $i$-th element of the vector-gradient product is
$\dfrac{\partial f}{\partial x_i}(x_i)$.


In [2]:
def f(x: torch.Tensor) -> torch.Tensor:
    return x[:,0]*x[:,1] + torch.sin(x[:,0]*x[:,1]**2)

def df_dx(x: torch.Tensor) -> torch.Tensor:
    return torch.stack(
        (
            x[:,1] + x[:,1]**2*torch.cos(x[:,0]*x[:,1]**2), 
            x[:,0] + 2*x[:,0]*x[:,1]*torch.cos(x[:,0]*x[:,1]**2)
        ),
        dim=1
    )

N = 1000

x = 2 * torch.rand(N, 2) - 1
x.requires_grad = True

y = f(x)

# this is equivalent to y.backward(ones)
grad_x = torch.autograd.grad(
    outputs=y,
    inputs=x,
    grad_outputs=torch.ones_like(y)
)[0]

dfdx = df_dx(x)

print("Are gradients correct?", torch.allclose(dfdx, grad_x))


Are gradients correct? True


### Vector function example

Use `torch.autograd.grad` to compute the Jacobian of a vector function in a batch.

Use two tensors $x$ and $y$ to compute the Jacobian of the vector function $f(x,y)$.

This is important in order to accumulate the derivative of $f$ with respect to $x$ and $y$
to the tensor $x$ and $y$, respectively.

In [3]:
def f(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    return x*y + torch.sin(x*y**2)

def df_dx(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    return torch.stack(
        (
            y + y**2 * torch.cos(x*y**2),
            x + 2*x*y * torch.cos(x*y**2)
        ),
        dim=1
    )

N = 1000

x = (2*torch.rand(N) - 1).requires_grad_()
y = (2*torch.rand(N) - 1).requires_grad_()

out = f(x, y)


grad_x, grad_y = torch.autograd.grad(
    outputs=out,
    inputs=(x, y),
    grad_outputs=torch.ones_like(out)
)

J = df_dx(x, y)

print("dx correct?", torch.allclose(J[:,0], grad_x))
print("dy correct?", torch.allclose(J[:,1], grad_y))


dx correct? True
dy correct? True


### Jacobian example

Use `torch.autograd.grad` to compute the Jacobian of a vector function in a batch.

In order to compute the Jacobian, we loop over the output dimensions and:

1. set the corresponding element of the `gradient` tensor to one and the rest to zero,
2. set `retain_graph=True` to allow the reuse of intermediate results for the next iteration,
otherwise the computational graph will be deleted after the first iteration.

In [4]:
def f(x: torch.Tensor) -> torch.Tensor:
    x1 = x[:, 0]
    x2 = x[:, 1]

    y1 = x1 * x2
    y2 = torch.sin(x1 + x2**2)
    y3 = x1**2 - 3 * x2

    return torch.stack((y1, y2, y3), dim=1)


def df_dx(x: torch.Tensor) -> torch.Tensor:
    x1 = x[:, 0]
    x2 = x[:, 1]

    dy1_dx1 = x2
    dy1_dx2 = x1

    dy2_dx1 = torch.cos(x1 + x2**2)
    dy2_dx2 = 2 * x2 * torch.cos(x1 + x2**2)

    dy3_dx1 = 2 * x1
    dy3_dx2 = -3 * torch.ones_like(x2)

    J = torch.stack(
        (
            torch.stack((dy1_dx1, dy1_dx2), dim=1),
            torch.stack((dy2_dx1, dy2_dx2), dim=1),
            torch.stack((dy3_dx1, dy3_dx2), dim=1),
        ),
        dim=1
    )

    return J


N = 1000

x = (2 * torch.rand(N, 2) - 1).requires_grad_()

y = f(x)

# ---- Jacobian using autograd.grad instead of backward ----
Js = torch.zeros(N, 3, 2)

for k in range(3):     # 3 outputs
    g = torch.zeros_like(y) 
    g[:, k] = 1.0
    grad_x = torch.autograd.grad(
        outputs=y,
        inputs=x,
        grad_outputs=g,
        retain_graph=True
    )[0]
    Js[:, k, :] = grad_x


dfdx = df_dx(x)

print("Shapes:", dfdx.shape, Js.shape)
print("Are Jacobians correct?", torch.allclose(dfdx, Js))


Shapes: torch.Size([1000, 3, 2]) torch.Size([1000, 3, 2])
Are Jacobians correct? True


In [9]:
def f(x, y):
    y1 = x * y
    y2 = torch.sin(x + y**2)
    y3 = x**2 - 3*y
    return torch.stack((y1, y2, y3), dim=1)


def df_dxdy(x, y):
    dy1_dx = y
    dy1_dy = x

    dy2_dx = torch.cos(x + y**2)
    dy2_dy = 2*y * torch.cos(x + y**2)

    dy3_dx = 2*x
    dy3_dy = -3 * torch.ones_like(y)

    J = torch.stack(
        (
            torch.stack((dy1_dx, dy1_dy), dim=1),
            torch.stack((dy2_dx, dy2_dy), dim=1),
            torch.stack((dy3_dx, dy3_dy), dim=1),
        ),
        dim=1
    )
    return J


N = 1000
x = (2*torch.rand(N) - 1).requires_grad_()
y = (2*torch.rand(N) - 1).requires_grad_()

out = f(x, y)      # shape [N, 3]

Js = torch.zeros(N, 3, 2)

for k in range(3):
    g = torch.zeros_like(out)
    g[:, k] = 1.0
    gx, gy = torch.autograd.grad(out, (x, y), g, retain_graph=True)
    Js[:, k, 0] = gx
    Js[:, k, 1] = gy

J_true = df_dxdy(x, y)

print("Correct?", torch.allclose(Js, J_true))


Correct? True


# Practical differences between `torch.autograd.grad` and `.backward()`

## ✅ Benefits of `torch.autograd.grad` over `.backward()`

### 1. `autograd.grad` RETURNS the gradient — it does NOT write into `.grad`

```python
gx = torch.autograd.grad(y, x)[0]
```

vs.

```python
y.backward()
print(x.grad)   # must read from .grad
```

With `.backward()`, you must manage:

- `x.grad` accumulation  
- resetting: `x.grad = None` or `x.grad.zero_()`

With `autograd.grad`, no accumulation ever happens.  
The gradient is clean, isolated, and directly usable.

➡️ Cleaner and safer.

---

### 2. `autograd.grad` works naturally with multiple inputs

```python
gx, gy = torch.autograd.grad(out, (x, y))
```

`.backward()` cannot return multiple gradients.  
It only fills `.grad` fields.

You must manually read `x.grad`, `y.grad`, reset them, etc.

➡️ Better when you have multiple inputs.

---

### 3. `autograd.grad` is necessary when you want Jacobian rows without side effects

To compute a Jacobian row cleanly:

```python
gx = torch.autograd.grad(y[:, k], x)[0]
```

No need to clear `.grad`.

Using `.backward()`:

```python
y[:, k].backward()
J[k] = x.grad
x.grad = None
```

You must reset every time.

➡️ `autograd.grad` is cleaner: no state, no mutation.

---

### 4. `autograd.grad` is functional; `.backward()` is stateful

Functional = predictable, easy to reason about  
Stateful = `.grad` persists or accumulates

➡️ In analytical gradient checks and research code, functional is preferred.

---

### 5. `autograd.grad` is safer inside loops

Inside loops:

- `.backward()` accumulates — you must reset manually  
- `autograd.grad` returns fresh gradients — no accumulation ever

➡️ Fewer bugs.

---

### 6. `autograd.grad` is better for custom loss constructions

You can build arbitrary VJPs:

```python
vjp = torch.autograd.grad(y, x, grad_outputs=v)[0]
```

`.backward()` forces writing into `.grad`, which can conflict with training loops.

➡️ Essential for advanced differentiable programming.

---

## ❌ When `.backward()` is better

`.backward()` is the right choice when you are **training a model with an optimizer**.

Why?

### 1. Optimizers expect gradients to be in `param.grad`
PyTorch optimizers (SGD, Adam, RMSProp, etc.) read gradients **directly** from  
each parameter’s `.grad` field:

```python
loss.backward()    # fills p.grad for every parameter
optimizer.step()   # optimizer uses p.grad internally
```

`torch.autograd.grad` does NOT fill `param.grad`.  
You would have to assign every gradient manually.  
This is error-prone and completely unnecessary.

---

### 2. `.backward()` handles ALL parameters automatically
A neural network can have thousands/millions of parameters.  
`.backward()` computes and stores gradients for **all of them at once**.

With `torch.autograd.grad`, you would need:

```python
grads = torch.autograd.grad(loss, model.parameters())
for p, g in zip(model.parameters(), grads):
    p.grad = g     # manual wiring
```

`.backward()` does this wiring for you.

---

### 3. `.backward()` supports gradient accumulation
Training often uses mini-batches or multi-step accumulation:

```python
loss.backward()   # adds to existing p.grad
loss.backward()   # adds again
optimizer.step()
optimizer.zero_grad()
```

`autograd.grad` NEVER accumulates.  
It always returns a fresh gradient and discards it unless you store it manually.

Gradient accumulation is essential for:

- multi-GPU training  
- large batch emulation  
- gradient checkpointing  
- truncated BPTT  

---

### 4. `.backward()` integrates with the entire PyTorch training ecosystem
All PyTorch tools assume `.backward()` is building gradients:

- optimizers  
- schedulers  
- AMP (automatic mixed precision)  
- DDP (DistributedDataParallel)  
- hooks on `.grad`  
- gradient clipping  

Using `autograd.grad` bypasses most of this infrastructure.

---

### 5. `.backward()` is designed for the standard training loop
It matches the classic workflow:

```python
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

This pattern is universal, stable, and deeply integrated in PyTorch’s design.

---

### Summary

Use `.backward()` for **model training**, because only `.backward()`:

- fills `param.grad` automatically  
- accumulates gradients correctly  
- integrates with optimizers  
- supports AMP, DDP, schedulers, clipping, hooks  
- scales to large models with millions of parameters  

`torch.autograd.grad` is powerful and precise, but it is **not** a replacement  
for `.backward()` in training loops.  


In [6]:
grad_x, grad_y = torch.autograd.grad( outputs=out, inputs=(x, y), grad_outputs=torch.ones_like(out) )
grad_xx, = torch.autograd.grad( outputs=grad_x, inputs=(x, ), grad_outputs=torch.ones_like(grad_x) )

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn