# Automatic Mixed Precision amp


## 1. What is `torch.amp.autocast`

`torch.amp.autocast` automatically **casts operations to mixed precision** (float16 / bfloat16 and float32) to improve performance and reduce GPU memory usage, while maintaining numerical stability.

It selectively runs operations in lower precision (e.g., `float16`) **when it’s safe**, and in full precision (e.g., `float32`) **when it’s necessary**.

---

## 2. Basic Syntax

In modern PyTorch (≥1.10, ≥2.0), it’s used like this:

```python
from torch import autocast

with autocast(device_type='cuda', dtype=torch.float16):
    output = model(input)
    loss = criterion(output, target)
```

or (explicit import path):

```python
from torch.amp import autocast

with autocast('cuda'):
    output = model(input)
    loss = criterion(output, target)
```

---

## 3. When to Use

**During forward pass only**, i.e., inside `model(input)` and loss computation.
The backward pass should use `torch.cuda.amp.GradScaler`.

Typical usage is inside your training loop.

---

## 4. Complete Example (Training Loop)

```python
import torch
from torch import nn, optim
from torch.amp import autocast, GradScaler

model = nn.Linear(512, 10).cuda()
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

scaler = GradScaler()  # scales loss to avoid underflow in float16

for epoch in range(10):
    for input, target in dataloader:
        input, target = input.cuda(), target.cuda()

        optimizer.zero_grad()

        # Mixed precision forward + loss computation
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = criterion(output, target)

        # Scaled backward
        scaler.scale(loss).backward()

        # Step optimizer and update scaler
        scaler.step(optimizer)
        scaler.update()
```

---

## 5. Explanation of Each Step

| Step  | Code                            | Description                                     |
| ----- | ------------------------------- | ----------------------------------------------- |
| **1** | `with autocast(...)`            | Forward pass in mixed precision.                |
| **2** | `loss = criterion(...)`         | Loss is computed in mixed precision.            |
| **3** | `scaler.scale(loss).backward()` | Scales loss to prevent `inf`/`NaN` gradients.   |
| **4** | `scaler.step(optimizer)`        | Unscales and applies gradients.                 |
| **5** | `scaler.update()`               | Adjusts scaling dynamically for next iteration. |

---

## 6. How `GradScaler` Works in Code

```python
scaler = torch.amp.GradScaler()

with autocast('cuda'):
    output = model(input)
    loss = criterion(output, target)

# Step 1: Scale the loss
scaler.scale(loss).backward()  # multiplies loss by a large scale factor

# Step 2: Unscale before optimizer step
scaler.step(optimizer)         # divides grads back by the same scale

# Step 3: Adjust the scale factor for next iteration
scaler.update()
```



## 7. Notes and Best Practices

1. **Do not** use autocast around the backward pass.
   Only use it around the forward pass.

2. **Always use GradScaler** when training in mixed precision on GPUs with FP16.
   (Not needed for bfloat16 on newer GPUs like A100, H100.)

3. If your model uses operations not compatible with FP16 (e.g., some custom CUDA ops), wrap them in:

   ```python
   with autocast(enabled=False):
       x = custom_op(x)
   ```

4. You can disable autocast dynamically:

   ```python
   with autocast(enabled=False):
       output = model(input)
   ```

---

## 8. For Inference

No `GradScaler` needed:

```python
model.eval()
with torch.no_grad(), autocast('cuda'):
    output = model(input)
```

---

## 9. Autocast on CPU (optional)

You can also use autocast for CPU with `dtype=torch.bfloat16`:

```python
with autocast(device_type='cpu', dtype=torch.bfloat16):
    output = model(input)
```

---