# PyTorch Expert: The Missing Manual

This is not a tutorial. This is a **Lab Bench**.
We are going to break things to understand how PyTorch *really* works.

1.  **Autograd Mechanics:** In-place operations, Leaf nodes, and Gradient Accumulation.
2.  **Module Internals:** Buffers (`register_buffer`), Hooks, and State Dicts.
3.  **Optimization:** `inference_mode` vs `no_grad`, and `torch.compile`.

In [28]:
import torch
import torch.nn as nn
import copy

## Section 1: The Automagic Gradient (Autograd)

### Experiment 1.1: The "In-Place" Trap
Why does `x += 1` sometimes crash your training?
- `y = x + 1`: Creates a **NEW** tensor. Safe.
- `x += 1`: Modifies data **IN MEMORY**. 

If PyTorch needs the *original* value of `x` to calculate gradients later, but you overwrote it... **BOOM**.

In [29]:
# SETUP: A simple operation y = x * w
x = torch.tensor([5.0], requires_grad=True)
w = torch.tensor([2.0], requires_grad=True)

# Forward pass
y = x * w

# IN-PLACE OPERATION (The Sabotage)
# We modify 'w' in place. 
# But 'y' needs 'w' to calculate gradient for 'x' (dy/dx = w).
try:
    w += 1 
    y.backward()
    print("Success? No.")
except RuntimeError as e:
    print(f"ðŸ’¥ CAUGHT ERROR: {e}")

# Takeaway: Avoid in-place ops (+=, *=) on tensors that require grad, unless you know exactly what you are doing.

ðŸ’¥ CAUGHT ERROR: a leaf Variable that requires grad is being used in an in-place operation.


### Experiment 1.2: Gradient Accumulation (Simulating Big Batches)
GPU out of memory? Batch size too small? 
**Solution:** Don't call `optimizer.zero_grad()` every step.
Gradients **ACCUMULATE** (add up) by default.

In [30]:
model = nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Fake data
inputs = torch.randn(4, 10)
targets = torch.randn(4, 1)

# Standard Loop (Batch Size = 4)
print(f"Initial Weight Grad: {model.weight.grad}")

# Step 1
output = model(inputs[0])
loss = (output - targets[0])**2
loss.backward()
print(f"Grad after Step 1: {model.weight.grad[0][0]:.4f}")

# Step 2 (NO ZERO_GRAD)
output = model(inputs[1])
loss = (output - targets[1])**2
loss.backward()
print(f"Grad after Step 2 (Accumulated): {model.weight.grad[0][0]:.4f}")

# NOW we step
optimizer.step()
optimizer.zero_grad()

Initial Weight Grad: None
Grad after Step 1: 1.6357
Grad after Step 2 (Accumulated): -0.4097


## Section 2: Module Internals

### Experiment 2.1: Buffers (`register_buffer`)
Not all numbers in a model are Weights.
Example: `BatchNorm` tracks "Running Mean". It needs to be saved (`state_dict`), but it **does not** need Gradient Descent.
We use `register_buffer` for this.

In [31]:
class MyModule(nn.Module):
    def __init__(self):
        super().__init__()
        # Parameter: Learned by Optimizer
        self.weight = nn.Parameter(torch.randn(1))
        
        # Buffer: Saved, but NOT learned
        self.register_buffer('running_count', torch.zeros(1))
        
    def forward(self, x):
        self.running_count += 1 # We manually update this
        return x * self.weight

m = MyModule()
print(f"Parameters: {list(m.parameters())}") # Only weight shows up

# But it is in the state_dict!
print(f"State Dict Keys: {m.state_dict().keys()}")

Parameters: [Parameter containing:
tensor([1.5884], requires_grad=True)]
State Dict Keys: odict_keys(['weight', 'running_count'])


### Experiment 2.2: Forward Hooks (The Debugger)
Need to see the shape of a tensor inside a ResNet layer? Don't use `print()`. Use a HOOK.

In [32]:
model = nn.Linear(10, 5)

def print_shape_hook(module, input, output):
    # Input is a tuple
    print(f"HOOK -> Input Shape: {input[0].shape}")
    print(f"HOOK -> Output Shape: {output.shape}")

# Register hook
handle = model.register_forward_hook(print_shape_hook)

# Run model
x = torch.randn(2, 10)
y = model(x)

# Cleanup (Important!)
handle.remove()

HOOK -> Input Shape: torch.Size([2, 10])
HOOK -> Output Shape: torch.Size([2, 5])


## Section 3: Optimization & Speed

### Experiment 3.1: `no_grad` vs `inference_mode`
- `torch.no_grad()`: Disables Gradient Calculation. Standard for validation.
- `torch.inference_mode()`: Disables Gradients AND View-Tracking. **Faster**. Use this for production/inference.

In [33]:
x = torch.randn(32, 32, requires_grad=True)
w = torch.randn(32, 32, requires_grad=True)

%timeit x @ w
with torch.no_grad(): 
    %timeit x @ w
with torch.inference_mode(): 
    %timeit x @ w

14.1 Âµs Â± 2.88 Âµs per loop (mean Â± std. dev. of 7 runs, 100000 loops each)
7.57 Âµs Â± 221 ns per loop (mean Â± std. dev. of 7 runs, 100000 loops each)
8.75 Âµs Â± 1.96 Âµs per loop (mean Â± std. dev. of 7 runs, 100000 loops each)


### Experiment 3.2: `torch.compile` (PyTorch 2.0)
This turns your Python code into optimized Graph code (Triton/C++).
It takes a moment to "Compile" the first time, then it flies.

In [34]:
def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(y)
    return a + b

# The new magic line
opt_foo = torch.compile(foo)

x = torch.randn(1000, 1000)
y = torch.randn(1000, 1000)

# First run (Compiles - might be slow)
print("Compiling...")
opt_foo(x, y)
print("Done.")

# Subsequent runs are optimized
import time
start = time.time()
for _ in range(100):
    opt_foo(x, y)
print(f"Optimized Run Time: {time.time() - start:.4f}s")

Compiling...
Done.
Optimized Run Time: 0.5288s


## Section 4: The Data Pipeline (The Bottleneck)

### Experiment 4.1: `collate_fn` (Handling Messy Data)
Default DataLoader expects every tensor to be the same size (to stack them).
What if you have audio/text of different lengths?
- **FAIL:** Default collate throws error.
- **FIX:** Custom `collate_fn` to pad them.

In [35]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

# Fake Dataset returning variable length tensors
dataset = [
    torch.tensor([1, 2, 3]),
    torch.tensor([4, 5]),
    torch.tensor([6, 7, 8, 9])
]

# Try default loader
try:
    loader = DataLoader(dataset, batch_size=2)
    for batch in loader:
        print(batch)
except Exception as e:
    print(f"ðŸ’¥ CRASH: {e}")

# DEFINE CUSTOM COLLATE
def padding_collate(batch):
    # batch is a list of tensors: [tensor([1,2,3]), tensor([4,5])]
    # We pad them to match the longest one
    return pad_sequence(batch, batch_first=True, padding_value=0)

print("\nWith Custom Collate:")
loader = DataLoader(dataset, batch_size=2, collate_fn=padding_collate)
for batch in loader:
    print(batch)
    # Now they are stacked nicely!

ðŸ’¥ CRASH: stack expects each tensor to be equal size, but got [3] at entry 0 and [2] at entry 1

With Custom Collate:
tensor([[1, 2, 3],
        [4, 5, 0]])
tensor([[6, 7, 8, 9]])


### Experiment 4.2: Speed (`pin_memory` & `num_workers`)
- `num_workers > 0`: Uses multiprocessing to load data (CPU) while GPU is training. Essential.
- `pin_memory=True`: Allocates staging memory so transfer to GPU is faster.

**Rule of Thumb:**
```python
DataLoader(dataset, batch_size=32, num_workers=4, pin_memory=True)
```