Here's a **comprehensive checklist** of techniques to maximize **GPU utilization and training throughput in PyTorch**, categorized by priority and **estimated impact**.

---

#  **Maximizing GPU Utilization in PyTorch**

|  # | Technique                                  | Description                                               | Expected Impact                           | Action                                         |
| -: | ------------------------------------------ | --------------------------------------------------------- | ----------------------------------------- | ---------------------------------------------- |
|  1 | `pin_memory=True` in DataLoader            | Enables faster CPU→GPU transfers via page-locked memory   | **Moderate** (\~10–20%)                   | `DataLoader(..., pin_memory=True)`             |
|  2 | `num_workers>0` in DataLoader              | Enables multi-threaded CPU data loading                   | **High** (\~20–50%) if CPU bottlenecked   | `DataLoader(..., num_workers=4 or 8)`          |
|  3 | Larger `batch_size`                        | Reduces number of GPU kernel launches per epoch           | **High**, depends on GPU memory           | Try `256`, `512`, `1024` if possible           |
|  4 | `torch.backends.cudnn.benchmark=True`      | Optimizes cuDNN kernel selection when input size is fixed | **Moderate** (5–15%)                      | Add to script startup                          |
|  5 | Move `loss` and criterion to GPU           | Prevents CPU→GPU transfer during training                 | **Small** (\~1–3%)                        | `criterion = nn.CrossEntropyLoss().to(device)` |
|  6 | Use `torch.no_grad()` in eval loop         | Avoids unnecessary graph computation                      | **Small**, but essential for memory       |                                  |
|  7 | Profile with `torch.profiler` or `nvprof`  | Finds performance bottlenecks per op                      | **High** insight, indirect gain           | Use when fine-tuning                           |
|  8 | Prefetching with `prefetch_factor`         | Loads batches in background while GPU trains              | **Moderate** (\~10–20%)                   | `DataLoader(..., prefetch_factor=2)`           |
|  9 | Use `torch.compile(model)` (PyTorch 2.x)   | Traces and compiles model for faster inference/training   | **High** (10–40%) in PyTorch 2.x+         | Requires PyTorch 2.0+                          |
| 10 | Use AMP (mixed-precision) training         | Reduces memory and speeds up math ops on Tensor Cores     | **Very High** (2× speed on RTX/A100 GPUs) | Use `torch.cuda.amp` or `Lightning`            |
| 11 | Use `torch.jit.script` / `torch.jit.trace` | Optimizes graph and inlines ops                           | **Moderate** (\~5–15%)                    | For static models                              |
| 12 | Use `nvprof`, `nvtop`, `nvidia-smi`        | Monitor live GPU usage & bottlenecks                      | **Essential diagnostics**                 | CLI tools                                      |
| 13 | Reduce CPU bottlenecks                     | e.g., avoid slow disk reads, keep data in RAM             | **High** in I/O-heavy workloads           | Load data in RAM, SSD preferred                |

---




# **3. batch_size**

In general, **larger `batch_size`** has significant impact on **training performance, accuracy, generalization, and hardware utilization**.

---

#### **3.1. Training Speed / Performance**

* **✅ Pros:**

  * Better GPU utilization due to parallelism.
  * Fewer parameter updates per epoch → less overhead from optimizer and backpropagation calls.
  * More stable and accurate gradient estimation per batch.
* **❌ Cons:**

  * Consumes more GPU memory → might not fit in memory, leading to OOM (Out of Memory).
  * Diminishing returns beyond a certain size.

> **Rule of thumb:** Increase `batch_size` until you hit GPU memory limits.

---

####  **3.2. Model Accuracy / Generalization**

* **Small `batch_size` (e.g., 32 or 64):**

  * Noisy gradients → acts as a regularizer → better generalization.
* **❌ Large `batch_size` (e.g., 1024+):**

  * Smooth gradients → can lead to faster convergence but **poorer generalization**.
  * Might converge to **sharp minima** → worse performance on test data.

> **Empirical studies** (e.g., Keskar et al., 2016) showed that very large batch sizes can hurt generalization.

---

####  **3.3. Loss Surface & Convergence**

* **Large batches** tend to:

  * Follow **flatter paths** during training (less stochastic noise).
  * Require **more careful learning rate scheduling** (e.g., linear warmup + decay).

> When increasing `batch_size`, consider **increasing learning rate proportionally** (see **linear scaling rule**).

---


####  **3.4. Practical Advice**

* **Start small (32–128)** for better generalization.
* If using **batch norm**, large batch size can help stabilize the estimates.
* For very large `batch_size`, use:

  * Learning rate scaling: `new_lr = base_lr * (new_batch / base_batch)`
  * **Gradient accumulation** if GPU can't fit a large batch.
  * Mixed-precision training (AMP) to reduce memory footprint.

---

**Code Snippet for Gradient Accumulation**

```python
accum_iter = 4  # simulate batch_size * 4
for i, (x, y) in enumerate(loader):
    output = model(x)
    loss = criterion(output, y) / accum_iter
    loss.backward()
    
    if (i + 1) % accum_iter == 0:
        optimizer.step()
        optimizer.zero_grad()
```

---



# **4. torch.backends.cudnn.benchmark**

When you enable:

```python
torch.backends.cudnn.benchmark = True
```

you tell cuDNN to:**Profile multiple convolution algorithms** at runtime.

For every convolution layer (e.g., `Conv2d`), cuDNN has several possible algorithms (`GEMM`, `FFT`, `Winograd`, etc.) to execute it. Each has different performance depending on:

* Input size
* Kernel size
* Stride
* Padding
* GPU architecture

cuDNN **benchmarks all available algorithms** for your given layer configuration (on the first forward pass), **times them**, and then **caches the fastest one** for reuse.

---

**How it speeds up training**

If input sizes are **constant**, then cuDNN can:

* **Choose the best kernel** once
* **Avoid slower default algorithms**
* **Reuse the fast kernel efficiently** across all batches

This can lead to **significant performance gains** (10%–50% in some CNN-heavy models).

---

#### **4.1. Why is it **not** enabled by default?**



**Benchmarking costs time**
    * On first forward pass with new input size, **all candidate algorithms are tested**.
    * This can cause a **noticeable delay** if input shapes keep changing (e.g., variable image sizes).
    * Also, benchmarking might allocate **more memory** during evaluation.

---

#### **4.2.When it's harmful:**

| Scenario                                                    | Problem                                                     |
| ----------------------------------------------------------- | ----------------------------------------------------------- |
| Input sizes vary (e.g. data augmentation, NLP with padding) | cuDNN must **re-benchmark every time**, which adds overhead |
| You need **deterministic behavior**                         | Some fast algorithms are **non-deterministic**              |
| You're running on constrained GPU memory                    | Benchmarking might cause **out-of-memory errors**           |

---


**Optional Best Practice for Training Script**

```python
import torch

torch.backends.cudnn.benchmark = True       # Enable fastest algorithm selection
torch.backends.cudnn.deterministic = False  # (optional) For speed over reproducibility
```

And if you care about **reproducibility**, flip them:

```python
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
```
---

# **5.torch.backends.cudnn.deterministic = True**

It forces **cuDNN to use only deterministic algorithms**, meaning that **running your model multiple times with the same input will produce the exact same output** — **bit for bit** — every time. In practice, **deep learning on GPUs involves non-determinism** due to **parallel computation, low-level optimizations, and randomness**. There are several sources of non-determinism in training and inference:

---

#### **5.1.cuDNN Algorithm Choice**

cuDNN (and other GPU libraries) offer **multiple ways to compute convolutions**, pooling, etc. Some are non-deterministic:

* They use **atomic operations** (whose order is not guaranteed)
* The order of operations in **parallel threads** may vary → rounding errors differ → result differs slightly


A **cuDNN kernel** refers to a **highly optimized GPU function** provided by NVIDIA’s cuDNN (CUDA Deep Neural Network) library, which accelerates **deep learning operations** on NVIDIA GPUs, such as:

* Convolutions (forward & backward)
* Pooling
* Activation functions (ReLU, tanh, etc.)
* Normalization (batch norm, LRN)
* RNNs, LSTMs, GRUs
* Tensor transformations

---

For example:

* A 2D convolution used in a CNN might be executed using a cuDNN kernel that selects the best algorithm (e.g., Winograd, FFT, direct GEMM) based on input shapes and GPU architecture.
* cuDNN will dynamically choose and launch the most efficient kernel for that specific workload.

---

**Example in Practice (PyTorch or TensorFlow):**

When you run a convolution layer in PyTorch like:

```python
nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
```

under the hood, PyTorch (if CUDA is available) will use cuDNN to choose and run a convolution kernel optimized for your hardware.

You can often see messages like:

```
Using cuDNN backend: conv2d_forward_algo_1
```

---

#### **5.2.Random Number Generators (RNGs)**

* Weight initialization
* Data augmentation (flips, rotations)
* Dropout
* Shuffling of training data

Unless you seed *all* RNGs (Python, NumPy, PyTorch), you'll get different results each time.

#### **5.3.Multi-threading / CUDA kernel scheduling**

* CUDA kernel execution order can vary slightly depending on GPU load or thread scheduling.
* Even slight differences can **accumulate** during training (especially with float32).

---


In [22]:
import torch
torch.manual_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False  # don't search for fastest algo

x = torch.randn(1, 3, 32, 32, device='cuda')
conv = torch.nn.Conv2d(3, 16, 3).cuda()

# This will now always produce the same output
y1 = conv(x)
y2 = conv(x)

print(torch.allclose(y1, y2)) # Will be True if weights are fixed

True


# **8. Prefetching with `prefetch_factor`**


In PyTorch, **prefetching** is a technique that allows the `DataLoader` to **prepare data batches ahead of time**, so your model doesn’t have to wait for data. This is especially helpful when data loading (e.g., image decoding, transforms) is a bottleneck.

**How `prefetch_factor` works:**

* `prefetch_factor` is a parameter of `torch.utils.data.DataLoader`.
* It determines how **many batches per worker** are preloaded **in advance**.
* Only used when `num_workers > 0`.

---


If you set:

```python
DataLoader(..., num_workers=4, prefetch_factor=2)
```

Then:

* Each of the 4 workers will prefetch **2 batches**, so **8 batches total** are being prepared while the model trains on current data.

---

**When to use it:**

* Use `prefetch_factor > 2` **if your GPU is under-utilized** and **data loading is slow**.
* Tune it along with `num_workers` to find the optimal setup.
* Avoid setting it too high — it increases memory usage.

---

### 💡 Example

```python
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])

dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,          # Use multiple workers
    prefetch_factor=4,      # Default is 2
    pin_memory=True         # Speeds up transfer to GPU
)

# training loop
for images, labels in loader:
    images, labels = images.cuda(), labels.cuda()
    ...
```

---

### 📈 Tips for performance:

| Parameter                 | Description                                               |
| ------------------------- | --------------------------------------------------------- |
| `num_workers`             | More workers = more parallel data loading                 |
| `prefetch_factor`         | Higher = more preloaded batches (good for I/O-heavy data) |
| `pin_memory=True`         | Use when transferring to CUDA                             |
| `persistent_workers=True` | Keeps workers alive across epochs (PyTorch 1.7+)          |

---



To see the effect, use a timing tool like:

```python
import time
start = time.time()
for batch in loader:
    pass
print("Time:", time.time() - start)
```




# **9. torch.compile(model)**

`torch.compile(model)` is used to optimize and speed up the execution of your model by compiling it into a more efficient backend representation using **TorchDynamo**, **AOTAutograd**, and **Inductor** (by default). ---

####  **9.1.Where to use `torch.compile(model)`**

You typically apply it **after creating your model but before training or inference**:

```python
import torch
import torchvision.models as models

model = models.resnet18()
model = model.to('cuda')  # or 'cpu'

compiled_model = torch.compile(model)

# Then use compiled_model for training or inference
```

> Use `torch.compile()` once, ideally after model instantiation and before the training loop.

---

#### **9.2.What happens when you do `torch.compile(model)`?** 

PyTorch 2.0 introduces a compiler stack that includes:

1. **TorchDynamo**: Captures Python bytecode of your model, intercepts tensor operations.
2. **AOTAutograd**: Ahead-of-Time Autograd tracing for forward and backward passes.
3. **Inductor**: Converts the traced graph into highly efficient C++/CUDA kernels.

This process removes Python overhead and fuses operations, leading to much faster execution, especially for large models on GPUs.

---

####  **9.3.Where does the compiled code go?**

* **In-memory**: By default, the compiled code is not written to disk — it’s **kept in memory** for runtime execution.
* **Caching**: Some components (e.g., `torch._dynamo`) might cache intermediate results in RAM.
* **Debugging**: You can inspect generated code with environment variables:

  ```bash
  TORCH_LOGS="dynamo" python your_script.py
  ```

If you want to **export and save** compiled models, look into `torch.export`.

---

#### **9.4.Things to be careful about** 

* Use it with **training or eval mode** set correctly before compiling (`model.eval()` or `model.train()`).
* Some models or operations might not be fully supported (especially dynamic control flow).
* You can toggle back to eager mode by calling:

  ```python
  model = compiled_model._orig_mod
  ```

---

#### **9.5.When NOT to use it?** 

* Very small models with negligible Python overhead.
* Highly dynamic models with control flows that resist optimization.
* If you already use other tracing tools like TorchScript and want full control over the tracing.

---

# **10. Use AMP (mixed-precision) training**

In PyTorch, the **default data type** for tensors is:

**`torch.float32` (i.e., `float`)**
When you create a tensor like this:

---


In [2]:
import torch
x = torch.tensor([1.0, 2.0, 3.0])
print(x.dtype)

torch.float32


**AMP** stands for **Automatic Mixed Precision**. It's a technique that allows your model to use both **float32 (FP32)** and **float16 (FP16)** during training to **speed up computation and reduce memory usage** — **without significantly affecting model accuracy**.


Normally, training uses 32-bit floating-point (float32) numbers. Mixed precision uses:

* **float16 (half precision, dynamic range: ~1e-5 to 6e+4)** for most operations including `Conv`, `Linear`, `ReLU`, `matmul`, `activations` (faster, less memory)
* **float32 (single precision, dynamic range: ~1e-38 to 1e+38)** for critical operations including `Loss`, `normalization`, `softmax`, `batchnorm` (to maintain accuracy)

---

####  **10.1. Training loop using `autocast` and `GradScaler`**

```python
import torch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler('cuda')

for inputs, targets in dataloader:
    optimizer.zero_grad()
    
    with autocast('cuda'):  # enables mixed precision
        outputs = model(inputs)
        loss = criterion(outputs, targets)
    
    scaler.scale(loss).backward()          # scale the loss
    scaler.step(optimizer)                 # unscale + step
    scaler.update()                        # update scale
```

---

**What Each Component Does**

| Component      | Role                                                   |
| -------------- | ------------------------------------------------------ |
| `autocast()`   | Runs ops in FP16 when safe, otherwise in FP32          |
| `GradScaler()` | Prevents underflow when backpropagating FP16 gradients |
| `.scale(loss)` | Scales the loss for FP16 safe backward                 |
| `.step()`      | Unscales gradients before optimizer step               |
| `.update()`    | Adjusts scaling factor dynamically                     |

---

#### **10.2 Benefits/ Cons of Using AMP**

Use AMP **whenever you're training on a GPU that supports it**, especially:

* On **NVIDIA Volta, Turing, or Ampere** (e.g., RTX 30xx, A100)
* With **large models** or **high-resolution inputs**
* For **faster training + lower memory footprint**

---

**✅ Benefits**

*  **Faster training** on GPUs that support Tensor Cores (e.g., NVIDIA RTX/Volta/Ampere)
*  **Reduced memory usage**, allowing larger batch sizes or models


**❌ Cons**

* Slight chance of numerical instability in rare cases
* Not all ops are safe in FP16 — PyTorch handles most of this automatically

---

In [11]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.amp import autocast, GradScaler
import time

# Set device (automatically checks for CUDA)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Simple model
model = nn.Sequential(
    nn.Linear(1024, 2048),
    nn.ReLU(),
    nn.Linear(2048, 1024)
).to(device)

optimizer = optim.Adam(model.parameters())
criterion = nn.MSELoss()
scaler = GradScaler(device.type)  # 'cuda' or 'cpu'

x = torch.randn(512, 1024, device=device)
y = torch.randn(512, 1024, device=device)

epochs=1000

# Normal FP32
start = time.time()
for _ in range(epochs):
    optimizer.zero_grad()
    out = model(x)
    loss = criterion(out, y)
    loss.backward()
    optimizer.step()
print("FP32 time:", time.time() - start)

# Mixed Precision (AMP) - Only works on CUDA
if device.type == 'cuda':
    start = time.time()
    for _ in range(epochs):
        optimizer.zero_grad()
        with autocast(device.type):  # 'cuda' only
            out = model(x)
            loss = criterion(out, y)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    print("AMP time:", time.time() - start)
else:
    print("AMP not supported on CPU, skipping.")

Using device: cuda
FP32 time: 4.290900707244873
AMP time: 3.6157121658325195


# **11.`torch.jit.script` and `torch.jit.trace`**

In PyTorch, `torch.jit.script` and `torch.jit.trace` are part of **TorchScript**, a way to convert PyTorch models into a **serializable and optimizable intermediate representation**. This can improve inference speed, allow deployment in C++ environments, and enable graph-level optimizations.

---

**Two Ways to Create TorchScript Models**

| Method             | Use When...                                                              |
| ------------------ | ------------------------------------------------------------------------ |
| `torch.jit.trace`  | The model is **static**, i.e., no control flow depending on input values |
| `torch.jit.script` | The model has **conditionals, loops, or input-dependent control flow**   |

---

####  **11.1. `torch.jit.trace` Example (Static Model)**

```python
import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 5)

    def forward(self, x):
        return self.linear(x)

model = MyModel()
example_input = torch.randn(1, 10)

# Traced model
traced_model = torch.jit.trace(model, example_input)
traced_model.save("traced_model.pt")  # Save
```

> ⚠️ Use this **only** if the model’s computation graph does **not depend on input data** (e.g., no `if`, `for`).

---

####  **11.2. `torch.jit.script` Example (Dynamic Control Flow)**

```python
import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        if x.sum() > 0:  # Dynamic condition
            return x * 2
        else:
            return x - 2

model = MyModel()

# Scripted model
scripted_model = torch.jit.script(model)
scripted_model.save("scripted_model.pt")  # Save
```

---

**Benefits of `torch.jit.script` and `torch.jit.trace`**

1. **Speed**: TorchScript compiles models to an optimized graph, often faster for inference (especially on CPU).
2. **Deployment**: Models can be loaded in C++ via LibTorch.
3. **Serialization**: You can save and load complete models easily (`.pt` format).
4. **Cross-platform**: Useful for mobile (PyTorch Mobile).

---

## 📈 Benchmark Example (Speed Comparison)

```python
import time

x = torch.randn(1000, 10)

# Eager model
start = time.time()
for _ in range(1000):
    model(x)
print("Eager time:", time.time() - start)

# TorchScript model
start = time.time()
for _ in range(1000):
    traced_model(x)
print("TorchScript time:", time.time() - start)
```

