# 1. Why do we need gradient accumulation?

If your GPU cannot fit a large batch in memory (for example batch size 128), you can simulate this large batch by splitting it into several smaller batches (micro-batches) and *accumulating* their gradients before updating the weights.

Instead of:

* Load batch of size $B$
* Forward
* Backward
* Optimizer step (weights update)

We do:

* Load micro-batch of size $b$
* Forward
* Backward (accumulate gradients)
* Repeat for $k$ micro-batches
* Optimizer step once

Where
$$B = k \cdot b.$$

---

# 2. Intuition

Imagine the “true” gradient for batch size 128 is:

$$
g = \frac{1}{128}\sum_{i=1}^{128} \nabla_\theta \ell(x_i)
$$

But you can only process 32 samples at once.
So you compute:

* Micro-batch 1 (32 samples) → gradient $g_1$
* Micro-batch 2 (32 samples) → gradient $g_2$
* Micro-batch 3 (32 samples) → gradient $g_3$
* Micro-batch 4 (32 samples) → gradient $g_4$

PyTorch *adds* these gradients automatically when you call `.backward()` repeatedly.

So you want:

$$
g = \frac{1}{4}(g_1 + g_2 + g_3 + g_4)
$$

To achieve this, you will **divide the loss by accumulation_steps**.

---

# 3. ASCII Diagram (Clear Visualization)

```
   Big Batch (B = 128)
   -----------------------------------------
   | 32 | 32 | 32 | 32 |    micro-batches   |
   -----------------------------------------

   step 1: forward(b1) → backward(loss/4) → grads += g1/4
   step 2: forward(b2) → backward(loss/4) → grads += g2/4
   step 3: forward(b3) → backward(loss/4) → grads += g3/4
   step 4: forward(b4) → backward(loss/4) → grads += g4/4
   step 5: optimizer.step()  → uses g = (g1+g2+g3+g4)/4
   step 6: optimizer.zero_grad()
```

---

# 4. Correct PyTorch implementation

### Core rule

❗ Always scale the loss:

$$
\mathrm{loss\_scaled} = \frac{\mathrm{loss}}{\mathrm{accum\_steps}}
$$


so the final gradient matches the large-batch gradient.

### Full minimal example

```python
import torch
from torch import nn
from torch.utils.data import DataLoader

model = MyModel()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()

train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

accum_steps = 4   # Simulated batch size = 32 * 4 = 128
model.train()

optimizer.zero_grad()

for step, (images, labels) in enumerate(train_loader):
    outputs = model(images)
    loss = criterion(outputs, labels)
    loss = loss / accum_steps   # scale the loss

    loss.backward()             # gradients accumulate

    if (step + 1) % accum_steps == 0:
        optimizer.step()        # update weights
        optimizer.zero_grad()   # reset for next accumulation
```

---

# 5. How PyTorch accumulates gradients internally

Each `.backward()` adds to `param.grad`:

```
param.grad = param.grad + grad_from_this_backward_call
```

Gradients are **not cleared** automatically.
Only `optimizer.zero_grad()` clears them.

---

# 6. Common mistakes (and how to avoid them)

### Mistake 1

Not scaling the loss.

❌ Wrong

```
loss.backward()
```

Weights get updated with gradients that are too large.

### Correct

```
(loss / accum_steps).backward()
```

---

### Mistake 2

Doing optimizer.step() every iteration.
That breaks accumulation.

---

### Mistake 3

Forgetting to reset gradients at the right moment.

---

# 7. With mixed precision / GradScaler

```python
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
accum_steps = 4

optimizer.zero_grad()

for step, (x, y) in enumerate(train_loader):
    with autocast():
        loss = criterion(model(x), y)
        loss = loss / accum_steps

    scaler.scale(loss).backward()

    if (step + 1) % accum_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
```

---

# 8. How to choose accumulation steps

Use:

$$
\mathrm{accum\_steps} = \frac{\text{wanted batch size}}{\text{actual batch size}}
$$

Example:

* model fits only batch size 16
* want effective batch size 128

accum_steps = 128 / 16 = 8

---

# 9. Verifying gradient equality (mathematically)

Let $g_i$ be gradient of micro-batch $i$.

In accumulation:

$$
\nabla_\theta L = \sum_{i=1}^{k} \frac{1}{k} g_i = \frac{1}{k} \sum_{i=1}^k g_i.
$$

This equals the gradient of a single batch of size $B = k b$.

If you do not divide loss by $k$, you get:

$$
\nabla_\theta L = \sum g_i
$$

which is too large.

---

# 10. Summary

### Gradient accumulation simulates large batch training:

* Split batch into micro-batches.
* Scale loss by $1 / accum_steps$.
* Call backward() on each micro-batch.
* Step optimizer once per “full accumulated batch”.

It’s simple and works with any model, any dataset, any optimizer.

---

# Gradient Accumulation for Large Batch Training

## Overview

This project demonstrates how to simulate training with large batch sizes when GPU memory is limited using **gradient accumulation**. We achieve the training dynamics of `batch_size=80` while only fitting `batch_size=10` in memory at a time.

## Problem Statement

**Goal:** Train a model with effective `batch_size=80` for better gradient estimates and training stability.

**Constraint:** GPU can only fit `batch_size=20` under normal training conditions.

**Solution:** Use gradient accumulation to simulate larger batches by accumulating gradients over multiple smaller batches before updating weights.

## How Gradient Accumulation Works

Instead of updating weights after every batch, we:

1. Perform multiple forward and backward passes
2. Accumulate gradients across these passes
3. Update weights only after N accumulation steps

**Mathematical representation:**

For `accum_steps = N`:

$$
\nabla_{\text{effective}} = \frac{1}{N} \sum_{i=1}^{N} \nabla_i
$$

This simulates the gradient of a batch that is N times larger.

## Implementation Details

### Configuration

| Parameter | Value | Description |
|-----------|-------|-------------|
| **Normal batch size** | 20 | What fits without gradient accumulation |
| **Actual batch size** | 10 | Reduced to accommodate accumulation overhead |
| **Accumulation steps** | 8 | Number of batches to accumulate |
| **Effective batch size** | 80 | `10 × 8 = 80` |

### Memory Overhead Analysis

**Why reduce batch size from 20 to 10?**

Gradient accumulation requires keeping gradients in memory across multiple iterations, which adds overhead:

- **Accumulated gradients:** Stored until optimizer step
- **Optimizer state:** Maintained throughout accumulation
- **Model activations:** Kept for backward pass

**Memory trade-off:**
- ✅ Gained: 4× larger effective batch size (20 → 80)
- ❌ Cost: 50% batch size reduction (20 → 10) to fit accumulated gradients

### Key Implementation Steps

```python
accum_steps = 8
optimizer.zero_grad()

for step, (images, labels) in enumerate(data_loader):
    # 1. Forward pass with mixed precision
    with torch.amp.autocast('cuda', dtype=dtype):
        outputs = model(images)
        loss = criterion(outputs, labels)
    
    # 2. Scale loss by accumulation steps
    loss = loss / accum_steps
    
    # 3. Backward pass (gradients accumulate)
    loss.backward()
    
    # 4. Cleanup to save memory
    del outputs, loss, images, labels
    
    # 5. Update weights every N steps
    if (step + 1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
        torch.cuda.empty_cache()
```

## Memory Optimization Techniques

### 1. Mixed Precision Training
- **BF16/FP16:** Reduces memory by ~50%
- **Auto-detected:** Uses BF16 if supported, otherwise FP16
- **GradScaler:** Applied only for FP16 to prevent underflow

### 2. Aggressive Memory Cleanup
- `del` intermediate tensors after use
- `torch.cuda.empty_cache()` after optimizer steps
- `optimizer.zero_grad(set_to_none=True)` to free gradient memory

### 3. DataLoader Optimization
- `pin_memory=True`: Faster CPU-to-GPU transfer
- `num_workers=4`: Parallel data loading

## Results

### Batch Size Comparison

| Method | Batch Size | Updates/Epoch | Effective Batch | Memory Usage |
|--------|-----------|---------------|-----------------|--------------|
| **Baseline** | 20 | 5 | 20 | ~2.0 GiB |
| **Gradient Accumulation** | 10 | 1.25 | 80 | ~2.0 GiB |

*(100 samples total, 100/20=5 updates vs 100/10/8≈1.25 updates)*

### Benefits Achieved

✅ **4× larger effective batch size** (20 → 80)
✅ **Better gradient estimates** from larger batches
✅ **More stable training** with reduced gradient noise
✅ **Same memory footprint** as baseline training

### Trade-offs

❌ **Slower training:** Fewer weight updates per epoch
❌ **Increased complexity:** More hyperparameters to tune
❌ **Batch size reduction:** 50% smaller per-step batches

## When to Use Gradient Accumulation

### Good Use Cases ✅
- Training with very large batch sizes (>128)
- Limited GPU memory but need stable gradients
- Reproducing results from papers with large batches
- Distributed training simulation on single GPU

### Not Recommended ❌
- Already fitting desired batch size
- Training with small models
- When training speed is critical
- Real-time inference applications

## Running the Code

```bash
# Activate environment
conda activate PyTorchTutorial

# Run the script
python src/gpu_optimization_and_performance/gradient_accumulation/scripts/main.py
```

## Key Takeaways

1. **Gradient accumulation simulates larger batches** without requiring more memory for activations
2. **You still need to reduce batch size** by ~50% to accommodate gradient accumulation overhead
3. **Loss must be scaled** by `1/accum_steps` to maintain correct gradient magnitudes
4. **Mixed precision is essential** for maximizing memory efficiency
5. **The effective batch size** is `batch_size × accum_steps`

## References

- [PyTorch Automatic Mixed Precision](https://pytorch.org/docs/stable/amp.html)
- [Gradient Accumulation in Deep Learning](https://arxiv.org/abs/1711.00489)
- [Memory-Efficient Training Techniques](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html)

