# 1. Deleting Large Tensors

### Why it matters

PyTorch uses **reference counting**. A tensor stays in GPU memory as long as **one Python reference** points to it. If you overwrite or forget to delete a variable holding a large tensor, VRAM remains allocated.

### Example

In [8]:
import time
import subprocess
import torch

import torch
import timm
from torch.utils.data import TensorDataset, DataLoader
from torch.amp import autocast, GradScaler


def get_gpu_memory_from_nvidia_smi():
    """
    Returns total and used VRAM from nvidia-smi in MB.
    """
    result = subprocess.check_output(
        ['nvidia-smi', '--query-gpu=memory.used,memory.total',
         '--format=csv,nounits,noheader']
    )
    used, total = map(int, result.decode().strip().split(','))
    return used, total


def monitor(step=""):
    used, total = get_gpu_memory_from_nvidia_smi()

    allocated = torch.cuda.memory_allocated() / 1024**2
    reserved = torch.cuda.memory_reserved() / 1024**2
    max_alloc = torch.cuda.max_memory_allocated() / 1024**2

    print(f"\n=== {step} ===")
    print(f"nvidia-smi used      : {used:.1f} MB / {total:.1f} MB")
    print(f"PyTorch allocated    : {allocated:.1f} MB")
    print(f"PyTorch reserved     : {reserved:.1f} MB")
    print(f"PyTorch max allocated: {max_alloc:.1f} MB")


device = "cuda" if torch.cuda.is_available() else "cpu"


monitor(step="start")

x = torch.randn(20000, 20000, device='cuda')  # ~1.6 GB VRAM
y = x  # another reference


del x
monitor(step="del x")

time.sleep(5)


del y  # refcount is now 0 → GPU memory released

monitor(step="del y")
torch.cuda.empty_cache()
monitor(step="calling empty_cache()")

time.sleep(5)  # Sleep for 10 seconds


=== start ===
nvidia-smi used      : 872.0 MB / 4096.0 MB
PyTorch allocated    : 369.8 MB
PyTorch reserved     : 722.0 MB
PyTorch max allocated: 3428.3 MB

=== del x ===
nvidia-smi used      : 2398.0 MB / 4096.0 MB
PyTorch allocated    : 1895.8 MB
PyTorch reserved     : 2248.0 MB
PyTorch max allocated: 3428.3 MB

=== del y ===
nvidia-smi used      : 2398.0 MB / 4096.0 MB
PyTorch allocated    : 369.8 MB
PyTorch reserved     : 2248.0 MB
PyTorch max allocated: 3428.3 MB

=== calling empty_cache() ===
nvidia-smi used      : 872.0 MB / 4096.0 MB
PyTorch allocated    : 369.8 MB
PyTorch reserved     : 722.0 MB
PyTorch max allocated: 3428.3 MB




### What each line does

* `del large_activation_maps`:
  Removes the Python reference → PyTorch can now release the memory.

* `torch.cuda.empty_cache()`:
  Returns unused cached memory **back to CUDA**, reducing out-of-memory risks.

### Important

`empty_cache()` **does not free the memory used by live tensors**.
It only frees memory that PyTorch cached *for reuse*.

---

# 2. Using `torch.no_grad()` for Inference

### Why it matters

During inference, you do not need gradients.
If gradients are tracked, PyTorch creates a **huge computation graph**, storing:

* intermediate activations
* backward links
* gradient buffers

This easily doubles or triples memory usage.

### Wrong way (gradients tracked)

```python
pred = model(test_data)
```

This creates a full graph → memory explosion.

### Correct way

```python
with torch.no_grad():
    predictions = model(test_data)
```

Inside this context:

* No computation graph is built.
* Intermediate activations are discarded immediately.
* Memory use can drop by **40%–70%** depending on model.

---

# 3. Clearing Gradients with `zero_grad(set_to_none=True)`

### PyTorch gradient clearing options

There are two ways:

## Option A: Standard gradient zeroing

```python
model.zero_grad()
```

This:

* sets gradients to zeros tensors
* keeps the memory allocated (VRAM remains occupied)

So each parameter still owns a `grad` tensor of the same size as the weight.

## Option B: Recommended: freeing gradients

```python
model.zero_grad(set_to_none=True)
```

This:

* sets grad pointers to `None`
* **frees the GPU memory**
* PyTorch will reallocate them **only when needed** in the next backward pass

### Why this matters

Weights like ConvNeXt (tens of millions of params) produce huge gradient buffers.

Comparison:

| Operation                     | Gradients exist in VRAM? | Memory usage |
| ----------------------------- | ------------------------ | ------------ |
| `zero_grad()`                 | Yes, as zeros            | High         |
| `zero_grad(set_to_none=True)` | No                       | Lower        |

Especially useful when:

* running gradient accumulation
* switching between training and inference
* performing validation inside training loop

---

# 4. Bonus: How PyTorch GPU Memory Actually Works

Understanding the internal mechanism helps avoid surprises.

### PyTorch holds two kinds of GPU memory:

### 1. **Allocated memory**

Memory used by active tensors.
Freed only when the last reference is deleted.

### 2. **Cached memory** (PyTorch memory pool)

For performance, PyTorch **does not return memory immediately to CUDA**.
Instead, it keeps VRAM in a reusable pool.

This is why `nvidia-smi` often shows high usage even after deleting variables.

### To truly return it to CUDA:

```python
torch.cuda.empty_cache()
```

---

# 5. Putting It All Together (Best Practices Template)

### Training loop best practice:

```python
for batch in train_loader:
    optimizer.zero_grad(set_to_none=True)

    outputs = model(batch["input"])
    loss = criterion(outputs, batch["target"])
    loss.backward()
    optimizer.step()

    del outputs, loss
    torch.cuda.empty_cache()
```

### Validation inside training:

```python
model.eval()
with torch.no_grad():
    for batch in val_loader:
        preds = model(batch["input"])
        # compute metrics
model.train()
```

### After a very large temporary tensor:

```python
tmp = some_heavy_operation()
...
del tmp
torch.cuda.empty_cache()
```

---


# Illustration of Memory Effects at Each Step

Below is a conceptual diagram (VRAM allocation over time).
It uses **blocks** to illustrate memory held.

---

## Training Step Timeline

### 1. `optimizer.zero_grad(set_to_none=True)`

Before clearing:

```
[ weights | gradients | activations ]
```

After clearing:

```
[ weights ]   gradients removed → VRAM freed
```

---

### 2. Forward Pass

```
[ weights | activations | temporary buffers ]
```

---

### 3. Backward Pass

Adds more memory for gradient buffers:

```
[ weights | activations | grad buffers ]
```

---

### 4. After deleting outputs & loss

```
del outputs, loss
```

```
[ weights | grad buffers ]
```

Activations from the forward are gone once backward finished.

---

### 5. `torch.cuda.empty_cache()`

```
[ weights | grad buffers ]
unused cached blocks → returned to CUDA
```

---

## Validation Step Timeline

### 6. `with torch.no_grad():`

No graph → no activations stored.

```
[ weights | small forward buffers ]
```

Memory footprint drops by **40–70%**.

---

### 7. After deleting outputs & loss

```
[ weights ]
```

Cleaner than training, because no gradients or activations exist.

---

# Key Takeaways

| Step                          | What It Fixes            | Why It Matters                       |
| ----------------------------- | ------------------------ | ------------------------------------ |
| `zero_grad(set_to_none=True)` | frees gradient tensors   | reduces persistent VRAM load         |
| `del outputs, loss`           | frees activations        | avoids unnecessary VRAM accumulation |
| `empty_cache()`               | frees cached VRAM        | prevents out-of-memory errors        |
| `no_grad()`                   | prevents graph creation  | cuts memory usage during validation  |
| AMP                           | smaller FP16 activations | improves speed + memory              |

---



---

In [7]:


model_name = "tf_efficientnetv2_s"
model = timm.create_model(model_name=model_name, pretrained=True).to(device)

# Get model configuration
cfg = model.default_cfg


C, H, W = list(cfg['input_size'])
B = 10


num_class = cfg['num_classes']
num_samples = 100
X = torch.randn(num_samples, C, H, W)
Y = torch.randint(0, num_class, (num_samples,))

dataset = TensorDataset(X, Y)
batch_size = 22

data_loader = DataLoader(batch_size=batch_size, dataset=dataset,
                         pin_memory=True, num_workers=4)


criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)


# Detect optimal dtype
if torch.cuda.is_bf16_supported():
    dtype = torch.bfloat16
    print("dtype is bf16")
else:
    dtype = torch.float16
    print("dtype is f16")


# Use GradScaler only for FP16
use_scaler = (dtype == torch.float16)
scaler = GradScaler() if use_scaler else None

epochs = 1


for epoch in range(epochs):
    model.train()

    for batch in data_loader:

        monitor("Before zero_grad")
        optimizer.zero_grad(set_to_none=True)
        monitor("After zero_grad")

        images, labels = batch
        images = images.cuda()
        labels = labels.cuda()
        monitor("After moving batch to GPU")

        # Use autocast with detected dtype (bf16 or fp16)
        with torch.amp.autocast('cuda', dtype=dtype):
            outputs = model(images)
            loss = criterion(outputs, labels)
        monitor("After forward pass")

        # Use GradScaler for FP16, regular backward for BF16
        if use_scaler:
            scaler.scale(loss).backward()
            monitor("After backward pass")
            scaler.step(optimizer)
            scaler.update()
        else:
            loss.backward()
            monitor("After backward pass")
            optimizer.step()
        monitor("After optimizer.step")

        del outputs, loss
        torch.cuda.empty_cache()
        monitor("After del + empty_cache")
torch.cuda.empty_cache()


dtype is bf16

=== Before zero_grad ===
nvidia-smi used      : 928.0 MB / 4096.0 MB
PyTorch allocated    : 111.8 MB
PyTorch reserved     : 778.0 MB
PyTorch max allocated: 3427.1 MB

=== After zero_grad ===
nvidia-smi used      : 928.0 MB / 4096.0 MB
PyTorch allocated    : 111.8 MB
PyTorch reserved     : 778.0 MB
PyTorch max allocated: 3427.1 MB

=== After moving batch to GPU ===
nvidia-smi used      : 928.0 MB / 4096.0 MB
PyTorch allocated    : 122.1 MB
PyTorch reserved     : 778.0 MB
PyTorch max allocated: 3427.1 MB

=== After forward pass ===
nvidia-smi used      : 3606.0 MB / 4096.0 MB
PyTorch allocated    : 3245.6 MB
PyTorch reserved     : 3456.0 MB
PyTorch max allocated: 3427.1 MB

=== After backward pass ===
nvidia-smi used      : 3678.0 MB / 4096.0 MB
PyTorch allocated    : 212.8 MB
PyTorch reserved     : 3528.0 MB
PyTorch max allocated: 3427.1 MB

=== After optimizer.step ===
nvidia-smi used      : 3678.0 MB / 4096.0 MB
PyTorch allocated    : 379.7 MB
PyTorch reserved     : 352