Here's a **comprehensive checklist** of techniques to maximize **GPU utilization and training throughput in PyTorch**, categorized by priority and **estimated impact**.

---

#  **Maximizing GPU Utilization in PyTorch**

|  # | Technique                                  | Description                                               | Expected Impact                           | Action                                         |
| -: | ------------------------------------------ | --------------------------------------------------------- | ----------------------------------------- | ---------------------------------------------- |
|  1 | `pin_memory=True` in DataLoader            | Enables faster CPU→GPU transfers via page-locked memory   | **Moderate** (\~10–20%)                   | `DataLoader(..., pin_memory=True)`             |
|  2 | `num_workers>0` in DataLoader              | Enables multi-threaded CPU data loading                   | **High** (\~20–50%) if CPU bottlenecked   | `DataLoader(..., num_workers=4 or 8)`          |
|  3 | Larger `batch_size`                        | Reduces number of GPU kernel launches per epoch           | **High**, depends on GPU memory           | Try `256`, `512`, `1024` if possible           |
|  4 | `torch.backends.cudnn.benchmark=True`      | Optimizes cuDNN kernel selection when input size is fixed | **Moderate** (5–15%)                      | Add to script startup                          |
|  5 | Move `loss` and criterion to GPU           | Prevents CPU→GPU transfer during training                 | **Small** (\~1–3%)                        | `criterion = nn.CrossEntropyLoss().to(device)` |
|  6 | Use `torch.no_grad()` in eval loop         | Avoids unnecessary graph computation                      | **Small**, but essential for memory       |                                  |
|  7 | Profile with `torch.profiler` or `nvprof`  | Finds performance bottlenecks per op                      | **High** insight, indirect gain           | Use when fine-tuning                           |
|  8 | Prefetching with `prefetch_factor`         | Loads batches in background while GPU trains              | **Moderate** (\~10–20%)                   | `DataLoader(..., prefetch_factor=2)`           |
|  9 | Use `torch.compile(model)` (PyTorch 2.x)   | Traces and compiles model for faster inference/training   | **High** (10–40%) in PyTorch 2.x+         | Requires PyTorch 2.0+                          |
| 10 | Use AMP (mixed-precision) training         | Reduces memory and speeds up math ops on Tensor Cores     | **Very High** (2× speed on RTX/A100 GPUs) | Use `torch.cuda.amp` or `Lightning`            |
| 11 | Use `torch.jit.script` / `torch.jit.trace` | Optimizes graph and inlines ops                           | **Moderate** (\~5–15%)                    | For static models                              |
| 12 | Use `nvprof`, `nvtop`, `nvidia-smi`        | Monitor live GPU usage & bottlenecks                      | **Essential diagnostics**                 | CLI tools                                      |
| 13 | Reduce CPU bottlenecks                     | e.g., avoid slow disk reads, keep data in RAM             | **High** in I/O-heavy workloads           | Load data in RAM, SSD preferred                |

---




# **3. batch_size**

In general, **larger `batch_size`** has significant impact on **training performance, accuracy, generalization, and hardware utilization**.

---

#### **3.1. Training Speed / Performance**

* **✅ Pros:**

  * Better GPU utilization due to parallelism.
  * Fewer parameter updates per epoch → less overhead from optimizer and backpropagation calls.
  * More stable and accurate gradient estimation per batch.
* **❌ Cons:**

  * Consumes more GPU memory → might not fit in memory, leading to OOM (Out of Memory).
  * Diminishing returns beyond a certain size.

> **Rule of thumb:** Increase `batch_size` until you hit GPU memory limits.

---

####  **3.2. Model Accuracy / Generalization**

* **Small `batch_size` (e.g., 32 or 64):**

  * Noisy gradients → acts as a regularizer → better generalization.
* **❌ Large `batch_size` (e.g., 1024+):**

  * Smooth gradients → can lead to faster convergence but **poorer generalization**.
  * Might converge to **sharp minima** → worse performance on test data.

> **Empirical studies** (e.g., Keskar et al., 2016) showed that very large batch sizes can hurt generalization.

---

####  **3.3. Loss Surface & Convergence**

* **Large batches** tend to:

  * Follow **flatter paths** during training (less stochastic noise).
  * Require **more careful learning rate scheduling** (e.g., linear warmup + decay).

> When increasing `batch_size`, consider **increasing learning rate proportionally** (see **linear scaling rule**).

---


####  **3.4. Practical Advice**

* **Start small (32–128)** for better generalization.
* If using **batch norm**, large batch size can help stabilize the estimates.
* For very large `batch_size`, use:

  * Learning rate scaling: `new_lr = base_lr * (new_batch / base_batch)`
  * **Gradient accumulation** if GPU can't fit a large batch.
  * Mixed-precision training (AMP) to reduce memory footprint.

---

**Code Snippet for Gradient Accumulation**

```python
accum_iter = 4  # simulate batch_size * 4
for i, (x, y) in enumerate(loader):
    output = model(x)
    loss = criterion(output, y) / accum_iter
    loss.backward()
    
    if (i + 1) % accum_iter == 0:
        optimizer.step()
        optimizer.zero_grad()
```

---



# **4. torch.backends.cudnn.benchmark**

When you enable:

```python
torch.backends.cudnn.benchmark = True
```

you tell cuDNN to:**Profile multiple convolution algorithms** at runtime.

For every convolution layer (e.g., `Conv2d`), cuDNN has several possible algorithms (`GEMM`, `FFT`, `Winograd`, etc.) to execute it. Each has different performance depending on:

* Input size
* Kernel size
* Stride
* Padding
* GPU architecture

cuDNN **benchmarks all available algorithms** for your given layer configuration (on the first forward pass), **times them**, and then **caches the fastest one** for reuse.

---

**How it speeds up training**

If input sizes are **constant**, then cuDNN can:

* **Choose the best kernel** once
* **Avoid slower default algorithms**
* **Reuse the fast kernel efficiently** across all batches

This can lead to **significant performance gains** (10%–50% in some CNN-heavy models).

---

#### **4.1. Why is it **not** enabled by default?**



**Benchmarking costs time**
    * On first forward pass with new input size, **all candidate algorithms are tested**.
    * This can cause a **noticeable delay** if input shapes keep changing (e.g., variable image sizes).
    * Also, benchmarking might allocate **more memory** during evaluation.

---

#### **4.2.When it's harmful:**

| Scenario                                                    | Problem                                                     |
| ----------------------------------------------------------- | ----------------------------------------------------------- |
| Input sizes vary (e.g. data augmentation, NLP with padding) | cuDNN must **re-benchmark every time**, which adds overhead |
| You need **deterministic behavior**                         | Some fast algorithms are **non-deterministic**              |
| You're running on constrained GPU memory                    | Benchmarking might cause **out-of-memory errors**           |

---


**Optional Best Practice for Training Script**

```python
import torch

torch.backends.cudnn.benchmark = True       # Enable fastest algorithm selection
torch.backends.cudnn.deterministic = False  # (optional) For speed over reproducibility
```

And if you care about **reproducibility**, flip them:

```python
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
```
---

# **5.torch.backends.cudnn.deterministic = True**

It forces **cuDNN to use only deterministic algorithms**, meaning that **running your model multiple times with the same input will produce the exact same output** — **bit for bit** — every time. In practice, **deep learning on GPUs involves non-determinism** due to **parallel computation, low-level optimizations, and randomness**. There are several sources of non-determinism in training and inference:

---

#### **5.1.cuDNN Algorithm Choice**

cuDNN (and other GPU libraries) offer **multiple ways to compute convolutions**, pooling, etc. Some are non-deterministic:

* They use **atomic operations** (whose order is not guaranteed)
* The order of operations in **parallel threads** may vary → rounding errors differ → result differs slightly


A **cuDNN kernel** refers to a **highly optimized GPU function** provided by NVIDIA’s cuDNN (CUDA Deep Neural Network) library, which accelerates **deep learning operations** on NVIDIA GPUs, such as:

* Convolutions (forward & backward)
* Pooling
* Activation functions (ReLU, tanh, etc.)
* Normalization (batch norm, LRN)
* RNNs, LSTMs, GRUs
* Tensor transformations

---

For example:

* A 2D convolution used in a CNN might be executed using a cuDNN kernel that selects the best algorithm (e.g., Winograd, FFT, direct GEMM) based on input shapes and GPU architecture.
* cuDNN will dynamically choose and launch the most efficient kernel for that specific workload.

---

**Example in Practice (PyTorch or TensorFlow):**

When you run a convolution layer in PyTorch like:

```python
nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
```

under the hood, PyTorch (if CUDA is available) will use cuDNN to choose and run a convolution kernel optimized for your hardware.

You can often see messages like:

```
Using cuDNN backend: conv2d_forward_algo_1
```

---

#### **5.2.Random Number Generators (RNGs)**

* Weight initialization
* Data augmentation (flips, rotations)
* Dropout
* Shuffling of training data

Unless you seed *all* RNGs (Python, NumPy, PyTorch), you'll get different results each time.

#### **5.3.Multi-threading / CUDA kernel scheduling**

* CUDA kernel execution order can vary slightly depending on GPU load or thread scheduling.
* Even slight differences can **accumulate** during training (especially with float32).

---


In [22]:
import torch
torch.manual_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False  # don't search for fastest algo

x = torch.randn(1, 3, 32, 32, device='cuda')
conv = torch.nn.Conv2d(3, 16, 3).cuda()

# This will now always produce the same output
y1 = conv(x)
y2 = conv(x)

print(torch.allclose(y1, y2)) # Will be True if weights are fixed

True


# **8. Prefetching with `prefetch_factor`**


In PyTorch, **prefetching** is a technique that allows the `DataLoader` to **prepare data batches ahead of time**, so your model doesn’t have to wait for data. This is especially helpful when data loading (e.g., image decoding, transforms) is a bottleneck.

**How `prefetch_factor` works:**

* `prefetch_factor` is a parameter of `torch.utils.data.DataLoader`.
* It determines how **many batches per worker** are preloaded **in advance**.
* Only used when `num_workers > 0`.

---


If you set:

```python
DataLoader(..., num_workers=4, prefetch_factor=2)
```

Then:

* Each of the 4 workers will prefetch **2 batches**, so **8 batches total** are being prepared while the model trains on current data.

---

**When to use it:**

* Use `prefetch_factor > 2` **if your GPU is under-utilized** and **data loading is slow**.
* Tune it along with `num_workers` to find the optimal setup.
* Avoid setting it too high — it increases memory usage.

---

### 💡 Example

```python
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])

dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,          # Use multiple workers
    prefetch_factor=4,      # Default is 2
    pin_memory=True         # Speeds up transfer to GPU
)

# training loop
for images, labels in loader:
    images, labels = images.cuda(), labels.cuda()
    ...
```

---

### 📈 Tips for performance:

| Parameter                 | Description                                               |
| ------------------------- | --------------------------------------------------------- |
| `num_workers`             | More workers = more parallel data loading                 |
| `prefetch_factor`         | Higher = more preloaded batches (good for I/O-heavy data) |
| `pin_memory=True`         | Use when transferring to CUDA                             |
| `persistent_workers=True` | Keeps workers alive across epochs (PyTorch 1.7+)          |

---



To see the effect, use a timing tool like:

```python
import time
start = time.time()
for batch in loader:
    pass
print("Time:", time.time() - start)
```




# **9. torch.compile(model)**

`torch.compile(model)` is used to optimize and speed up the execution of your model by compiling it into a more efficient backend representation using **TorchDynamo**, **AOTAutograd**, and **Inductor** (by default). ---

####  **9.1.Where to use `torch.compile(model)`**

You typically apply it **after creating your model but before training or inference**:

```python
import torch
import torchvision.models as models

model = models.resnet18()
model = model.to('cuda')  # or 'cpu'

compiled_model = torch.compile(model)

# Then use compiled_model for training or inference
```

> Use `torch.compile()` once, ideally after model instantiation and before the training loop.

---

#### **9.2.What happens when you do `torch.compile(model)`?** 

PyTorch 2.0 introduces a compiler stack that includes:

1. **TorchDynamo**: Captures Python bytecode of your model, intercepts tensor operations.
2. **AOTAutograd**: Ahead-of-Time Autograd tracing for forward and backward passes.
3. **Inductor**: Converts the traced graph into highly efficient C++/CUDA kernels.

This process removes Python overhead and fuses operations, leading to much faster execution, especially for large models on GPUs.

---

####  **9.3.Where does the compiled code go?**

* **In-memory**: By default, the compiled code is not written to disk — it’s **kept in memory** for runtime execution.
* **Caching**: Some components (e.g., `torch._dynamo`) might cache intermediate results in RAM.
* **Debugging**: You can inspect generated code with environment variables:

  ```bash
  TORCH_LOGS="dynamo" python your_script.py
  ```

If you want to **export and save** compiled models, look into `torch.export`.

---

#### **9.4.Things to be careful about** 

* Use it with **training or eval mode** set correctly before compiling (`model.eval()` or `model.train()`).
* Some models or operations might not be fully supported (especially dynamic control flow).
* You can toggle back to eager mode by calling:

  ```python
  model = compiled_model._orig_mod
  ```

---

#### **9.5.When NOT to use it?** 

* Very small models with negligible Python overhead.
* Highly dynamic models with control flows that resist optimization.
* If you already use other tracing tools like TorchScript and want full control over the tracing.

---

# **10. Use AMP (mixed-precision) training**

In PyTorch, the **default data type** for tensors is:

**`torch.float32` (i.e., `float`)**
When you create a tensor like this:

---


In [2]:
import torch
x = torch.tensor([1.0, 2.0, 3.0])
print(x.dtype)

torch.float32


**AMP** stands for **Automatic Mixed Precision**. It's a technique that allows your model to use both **float32 (FP32)** and **float16 (FP16)** during training to **speed up computation and reduce memory usage** — **without significantly affecting model accuracy**.


Normally, training uses 32-bit floating-point (float32) numbers. Mixed precision uses:

* **float16 (half precision, dynamic range: ~1e-5 to 6e+4)** for most operations including `Conv`, `Linear`, `ReLU`, `matmul`, `activations` (faster, less memory)
* **float32 (single precision, dynamic range: ~1e-38 to 1e+38)** for critical operations including `Loss`, `normalization`, `softmax`, `batchnorm` (to maintain accuracy)

---

####  **10.1. Training loop using `autocast` and `GradScaler`**

```python
import torch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler('cuda')

for inputs, targets in dataloader:
    optimizer.zero_grad()
    
    with autocast('cuda'):  # enables mixed precision
        outputs = model(inputs)
        loss = criterion(outputs, targets)
    
    scaler.scale(loss).backward()          # scale the loss
    scaler.step(optimizer)                 # unscale + step
    scaler.update()                        # update scale
```

---

**What Each Component Does**

| Component      | Role                                                   |
| -------------- | ------------------------------------------------------ |
| `autocast()`   | Runs ops in FP16 when safe, otherwise in FP32          |
| `GradScaler()` | Prevents underflow when backpropagating FP16 gradients |
| `.scale(loss)` | Scales the loss for FP16 safe backward                 |
| `.step()`      | Unscales gradients before optimizer step               |
| `.update()`    | Adjusts scaling factor dynamically                     |

---

#### **10.2 Benefits/ Cons of Using AMP**

Use AMP **whenever you're training on a GPU that supports it**, especially:

* On **NVIDIA Volta, Turing, or Ampere** (e.g., RTX 30xx, A100)
* With **large models** or **high-resolution inputs**
* For **faster training + lower memory footprint**

---

**✅ Benefits**

*  **Faster training** on GPUs that support Tensor Cores (e.g., NVIDIA RTX/Volta/Ampere)
*  **Reduced memory usage**, allowing larger batch sizes or models


**❌ Cons**

* Slight chance of numerical instability in rare cases
* Not all ops are safe in FP16 — PyTorch handles most of this automatically

---

In [11]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.amp import autocast, GradScaler
import time

# Set device (automatically checks for CUDA)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Simple model
model = nn.Sequential(
    nn.Linear(1024, 2048),
    nn.ReLU(),
    nn.Linear(2048, 1024)
).to(device)

optimizer = optim.Adam(model.parameters())
criterion = nn.MSELoss()
scaler = GradScaler(device.type)  # 'cuda' or 'cpu'

x = torch.randn(512, 1024, device=device)
y = torch.randn(512, 1024, device=device)

epochs=1000

# Normal FP32
start = time.time()
for _ in range(epochs):
    optimizer.zero_grad()
    out = model(x)
    loss = criterion(out, y)
    loss.backward()
    optimizer.step()
print("FP32 time:", time.time() - start)

# Mixed Precision (AMP) - Only works on CUDA
if device.type == 'cuda':
    start = time.time()
    for _ in range(epochs):
        optimizer.zero_grad()
        with autocast(device.type):  # 'cuda' only
            out = model(x)
            loss = criterion(out, y)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    print("AMP time:", time.time() - start)
else:
    print("AMP not supported on CPU, skipping.")

Using device: cuda
FP32 time: 4.290900707244873
AMP time: 3.6157121658325195


# **12. GPU Monitoring with GPUtil**

**GPUtil** is a Python library that provides an easy interface to **monitor GPU usage in real-time** during training. This is essential for:

- **Identifying bottlenecks** (is your GPU actually being used?)
- **Optimizing batch sizes** (maximize GPU memory utilization)
- **Monitoring temperature** and preventing overheating
- **Tracking training efficiency** over time

---

#### **12.1. Installation and Basic Usage**

```bash
pip install GPUtil psutil
```

**Basic GPU monitoring:**

```python
import GPUtil
import psutil

def get_gpu_info():
    """Get current GPU utilization"""
    gpus = GPUtil.getGPUs()
    
    if gpus:
        gpu = gpus[0]  # First GPU
        return {
            'name': gpu.name,
            'utilization': gpu.load * 100,      # GPU utilization %
            'memory_used': gpu.memoryUsed,      # Used memory in MB
            'memory_total': gpu.memoryTotal,    # Total memory in MB
            'memory_percent': gpu.memoryUtil * 100,  # Memory usage %
            'temperature': gpu.temperature       # Temperature in °C
        }
    return None

# Usage
gpu_info = get_gpu_info()
if gpu_info:
    print(f"GPU: {gpu_info['name']}")
    print(f"Utilization: {gpu_info['utilization']:.1f}%")
    print(f"Memory: {gpu_info['memory_used']}/{gpu_info['memory_total']}MB ({gpu_info['memory_percent']:.1f}%)")
    print(f"Temperature: {gpu_info['temperature']}°C")
```

---

#### **12.2. Integration with Training Loop**

Here's how to integrate GPU monitoring into your training:

```python
import torch
import torch.nn as nn
import GPUtil
import wandb  # For logging

def train_with_monitoring(model, dataloader, optimizer, criterion, epochs=5):
    model.train()
    
    for epoch in range(epochs):
        epoch_loss = 0
        
        for batch_idx, (data, target) in enumerate(dataloader):
            data, target = data.cuda(), target.cuda()
            
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
            
            # Monitor GPU every 10 batches
            if batch_idx % 10 == 0:
                gpu_info = get_gpu_info()
                if gpu_info:
                    print(f"Batch {batch_idx} | "
                          f"Loss: {loss.item():.4f} | "
                          f"GPU: {gpu_info['utilization']:.1f}% | "
                          f"VRAM: {gpu_info['memory_percent']:.1f}%")
                    
                    # Log to wandb
                    wandb.log({
                        'batch_loss': loss.item(),
                        'gpu_utilization': gpu_info['utilization'],
                        'gpu_memory_percent': gpu_info['memory_percent'],
                        'gpu_temperature': gpu_info['temperature']
                    })
        
        print(f"Epoch {epoch+1} completed | Avg Loss: {epoch_loss/len(dataloader):.4f}")
```

---


In [None]:
# Practical GPUtil Demo - Monitor your GPU right now!
import GPUtil
import psutil
import torch

def print_system_info():
    """Print comprehensive system information"""
    print("🖥️  SYSTEM HARDWARE INFORMATION")
    print("=" * 50)
    
    # CPU Information
    cpu_count = psutil.cpu_count(logical=True)
    cpu_percent = psutil.cpu_percent(interval=1)
    print(f"💻 CPU: {cpu_count} cores @ {cpu_percent:.1f}% usage")
    
    # Memory Information
    memory = psutil.virtual_memory()
    memory_gb = memory.total / (1024**3)
    print(f"🧠 RAM: {memory_gb:.1f}GB total, {memory.percent:.1f}% used")
    
    # GPU Information
    print(f"\n🚀 GPU INFORMATION:")
    print("-" * 30)
    
    if torch.cuda.is_available():
        print(f"✅ CUDA Available: {torch.version.cuda}")
        print(f"🎯 PyTorch CUDA Device: {torch.cuda.get_device_name()}")
        
        try:
            gpus = GPUtil.getGPUs()
            
            if gpus:
                for i, gpu in enumerate(gpus):
                    print(f"\nGPU {i}: {gpu.name}")
                    print(f"  📊 Utilization: {gpu.load * 100:.1f}%")
                    print(f"  🧠 Memory: {gpu.memoryUsed:.0f}/{gpu.memoryTotal:.0f}MB ({gpu.memoryUtil * 100:.1f}%)")
                    print(f"  💾 Free Memory: {gpu.memoryTotal - gpu.memoryUsed:.0f}MB")
                    print(f"  🌡️  Temperature: {gpu.temperature}°C")
                    print(f"  🆔 UUID: {gpu.uuid}")
                    
                    # Memory recommendations
                    free_memory = gpu.memoryTotal - gpu.memoryUsed
                    if free_memory > 3000:
                        print(f"  💡 Recommendation: You can use large batch sizes (128+)")
                    elif free_memory > 1500:
                        print(f"  💡 Recommendation: Use medium batch sizes (64-128)")
                    else:
                        print(f"  ⚠️  Warning: Limited memory, use small batch sizes (16-32)")
            else:
                print("❌ No GPUs detected by GPUtil")
                
        except Exception as e:
            print(f"❌ Error accessing GPU info: {e}")
    else:
        print("❌ CUDA not available - using CPU only")

# Run the system check
print_system_info()


#### **12.3. Real-time GPU Monitoring During Training**

Here's a practical example that monitors GPU usage while training a model:

```python
import time
import matplotlib.pyplot as plt
from collections import deque

def monitor_training_with_gpu(model, dataloader, optimizer, criterion, epochs=3):
    """Train model while monitoring GPU utilization"""
    
    # Storage for monitoring data
    gpu_utilization = deque(maxlen=100)
    gpu_memory = deque(maxlen=100)
    gpu_temp = deque(maxlen=100)
    timestamps = deque(maxlen=100)
    
    model.train()
    start_time = time.time()
    
    print("🔥 Training with GPU Monitoring")
    print("=" * 60)
    print("Epoch | Batch | Loss   | GPU%  | VRAM% | Temp°C | Time")
    print("-" * 60)
    
    for epoch in range(epochs):
        for batch_idx, (data, target) in enumerate(dataloader):
            # Training step
            data, target = data.cuda(), target.cuda()
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            # Monitor GPU every batch
            gpu_info = get_gpu_info()
            current_time = time.time() - start_time
            
            if gpu_info:
                gpu_utilization.append(gpu_info['utilization'])
                gpu_memory.append(gpu_info['memory_percent'])
                gpu_temp.append(gpu_info['temperature'])
                timestamps.append(current_time)
                
                # Print every 5 batches
                if batch_idx % 5 == 0:
                    print(f"{epoch+1:5d} | {batch_idx:5d} | {loss.item():.3f} | "
                          f"{gpu_info['utilization']:4.1f} | {gpu_info['memory_percent']:5.1f} | "
                          f"{gpu_info['temperature']:6.1f} | {current_time:6.1f}s")
    
    return list(gpu_utilization), list(gpu_memory), list(gpu_temp), list(timestamps)

# Example usage (uncomment to run with your data)
# gpu_util, gpu_mem, gpu_temp, times = monitor_training_with_gpu(model, train_loader, optimizer, criterion)
```

---

#### **12.4. GPU Memory Optimization Helper**

```python
def optimize_batch_size(model, sample_input, max_memory_percent=90):
    """
    Automatically find the optimal batch size for your GPU
    """
    if not torch.cuda.is_available():
        return 32  # Default for CPU
    
    model = model.cuda()
    model.train()
    
    # Start with small batch size
    batch_size = 16
    optimal_batch_size = 16
    
    print("🔍 Finding optimal batch size...")
    print("Batch Size | GPU Memory % | Status")
    print("-" * 35)
    
    while batch_size <= 512:  # Max reasonable batch size
        try:
            # Clear cache
            torch.cuda.empty_cache()
            
            # Create batch
            if len(sample_input.shape) == 4:  # Image data (B, C, H, W)
                batch = sample_input[:1].repeat(batch_size, 1, 1, 1).cuda()
            else:  # Other data
                batch = sample_input[:1].repeat(batch_size, 1).cuda()
            
            # Forward pass
            with torch.no_grad():
                _ = model(batch)
            
            # Check GPU memory
            gpu_info = get_gpu_info()
            if gpu_info:
                memory_percent = gpu_info['memory_percent']
                print(f"{batch_size:10d} | {memory_percent:11.1f}% | ", end="")
                
                if memory_percent < max_memory_percent:
                    optimal_batch_size = batch_size
                    print("✅ Good")
                    batch_size *= 2  # Try larger
                else:
                    print("❌ Too high")
                    break
            else:
                break
                
        except RuntimeError as e:
            if "out of memory" in str(e):
                print(f"{batch_size:10d} | {'OOM':>11s} | ❌ Out of Memory")
                break
            else:
                raise e
    
    torch.cuda.empty_cache()
    print(f"\n💡 Recommended batch size: {optimal_batch_size}")
    return optimal_batch_size
```

---


In [None]:
# Run the system information check
print_system_info()


# **13. Complete GPU Optimization Pipeline**

Based on the Brain Cancer MRI project implementation, here's a **complete optimization pipeline** that maximizes GPU utilization:

---

#### **13.1. Model-Specific Batch Size Optimization**

Different models have different memory requirements. Here's how to configure optimal batch sizes:

```python
# Model-specific configurations for RTX 3050 (4GB VRAM)
MODEL_CONFIGS = {
    'resnet18': {'batch_size': 128, 'lr': 0.001},      # Lightweight CNN
    'resnet50': {'batch_size': 64, 'lr': 0.001},       # Deeper CNN  
    'efficientnet_b0': {'batch_size': 128, 'lr': 0.001}, # Efficient CNN
    'swin_t': {'batch_size': 32, 'lr': 0.0001},        # Transformer
    'vit_b_16': {'batch_size': 16, 'lr': 0.0001},      # Large transformer
}

def get_optimal_config(model_name, gpu_memory_gb):
    """Get optimal configuration based on model and GPU memory"""
    base_config = MODEL_CONFIGS.get(model_name, {'batch_size': 32, 'lr': 0.001})
    
    # Scale batch size based on available GPU memory
    memory_scale = gpu_memory_gb / 4.0  # Baseline: 4GB
    optimal_batch = int(base_config['batch_size'] * memory_scale)
    
    return {
        'batch_size': optimal_batch,
        'lr': base_config['lr']
    }
```

---

#### **13.2. Complete Training Setup with All Optimizations**

```python
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.amp import autocast, GradScaler
import GPUtil

def setup_optimized_training(model, train_dataset, val_dataset, config):
    """Setup training with maximum GPU utilization"""
    
    # Device setup
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    
    # Performance optimizations
    if torch.cuda.is_available():
        torch.backends.cudnn.benchmark = True  # Optimize for fixed input sizes
        
        # Memory format optimization
        model = model.to(memory_format=torch.channels_last)
    
    # Model compilation (PyTorch 2.0+)
    try:
        model = torch.compile(model)
        print("✅ Model compiled for faster training")
    except:
        print("⚠️  Model compilation not available")
    
    # Optimized data loading
    dataloader_kwargs = {
        'batch_size': config['batch_size'],
        'num_workers': 8,  # Use all CPU cores
        'pin_memory': True,
        'prefetch_factor': 4,
        'persistent_workers': True,
        'drop_last': True  # Ensures consistent batch sizes
    }
    
    train_loader = DataLoader(train_dataset, shuffle=True, **dataloader_kwargs)
    val_loader = DataLoader(val_dataset, shuffle=False, **dataloader_kwargs)
    
    # Mixed precision setup
    scaler = GradScaler() if torch.cuda.is_available() else None
    
    # Optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=config['lr'], weight_decay=0.01)
    criterion = nn.CrossEntropyLoss()
    
    return model, train_loader, val_loader, optimizer, criterion, scaler, device

# Usage example
# model, train_loader, val_loader, optimizer, criterion, scaler, device = setup_optimized_training(
#     model, train_dataset, val_dataset, config
# )
```

---


#### **13.3. Advanced GPU Utilization Techniques**

Here are additional techniques for maximizing GPU performance:

---

**🔥 Gradient Accumulation for Larger Effective Batch Sizes**

When GPU memory limits your batch size, use gradient accumulation:

```python
def train_with_gradient_accumulation(model, dataloader, optimizer, criterion, 
                                   accumulation_steps=4, use_amp=True):
    """Training with gradient accumulation to simulate larger batch sizes"""
    
    scaler = GradScaler() if use_amp else None
    model.train()
    
    for batch_idx, (data, target) in enumerate(dataloader):
        data, target = data.cuda(), target.cuda()
        
        # Normalize loss by accumulation steps
        with autocast() if use_amp else nullcontext():
            output = model(data)
            loss = criterion(output, target) / accumulation_steps
        
        if use_amp:
            scaler.scale(loss).backward()
        else:
            loss.backward()
        
        # Update weights every accumulation_steps
        if (batch_idx + 1) % accumulation_steps == 0:
            if use_amp:
                scaler.step(optimizer)
                scaler.update()
            else:
                optimizer.step()
            optimizer.zero_grad()
            
            # Monitor GPU
            gpu_info = get_gpu_info()
            if gpu_info:
                print(f"Step {batch_idx//accumulation_steps} | "
                      f"Loss: {loss.item()*accumulation_steps:.4f} | "
                      f"GPU: {gpu_info['utilization']:.1f}%")
```

---

**⚡ Dynamic Batch Size Scaling**

Automatically adjust batch size based on available GPU memory:

```python
def dynamic_batch_scaling(model, dataset, target_memory_percent=85):
    """Dynamically scale batch size based on available GPU memory"""
    
    gpu_info = get_gpu_info()
    if not gpu_info:
        return 32
    
    available_memory = gpu_info['memory_total'] - gpu_info['memory_used']
    target_memory = gpu_info['memory_total'] * (target_memory_percent / 100)
    
    # Estimate memory per sample (rough approximation)
    sample_memory_mb = 10  # Adjust based on your data
    max_batch_size = int(target_memory / sample_memory_mb)
    
    # Find optimal batch size through binary search
    optimal_batch = optimize_batch_size(model, dataset[0][0].unsqueeze(0), target_memory_percent)
    
    return min(optimal_batch, max_batch_size)
```

---

**🌡️ Thermal Throttling Prevention**

Monitor and prevent GPU overheating:

```python
def check_thermal_throttling():
    """Check if GPU is thermal throttling"""
    gpu_info = get_gpu_info()
    
    if gpu_info:
        temp = gpu_info['temperature']
        
        if temp > 83:  # RTX 3050 throttling threshold
            print(f"🔥 WARNING: GPU temperature high ({temp}°C)")
            print("💡 Consider: reducing batch size, improving cooling, or taking breaks")
            return True
        elif temp > 75:
            print(f"⚠️  GPU temperature elevated ({temp}°C)")
            return False
        else:
            print(f"✅ GPU temperature normal ({temp}°C)")
            return False
    
    return False

# Usage in training loop
if check_thermal_throttling():
    time.sleep(30)  # Cool down period
```

---


# **14. Comprehensive GPU Monitoring Dashboard**

Create a real-time monitoring dashboard for your training:

---

#### **14.1. Real-time GPU Monitoring Class**

```python
import matplotlib.pyplot as plt
from IPython.display import clear_output
import threading
import time

class GPUMonitor:
    """Real-time GPU monitoring during training"""
    
    def __init__(self, update_interval=2):
        self.update_interval = update_interval
        self.monitoring = False
        self.data = {
            'timestamps': [],
            'gpu_utilization': [],
            'gpu_memory': [],
            'gpu_temperature': [],
            'cpu_percent': [],
            'ram_percent': []
        }
    
    def start_monitoring(self):
        """Start background monitoring thread"""
        self.monitoring = True
        self.monitor_thread = threading.Thread(target=self._monitor_loop)
        self.monitor_thread.daemon = True
        self.monitor_thread.start()
        print("🔍 GPU monitoring started...")
    
    def stop_monitoring(self):
        """Stop monitoring and show final plot"""
        self.monitoring = False
        if hasattr(self, 'monitor_thread'):
            self.monitor_thread.join()
        print("⏹️  GPU monitoring stopped")
        self.plot_results()
    
    def _monitor_loop(self):
        """Background monitoring loop"""
        start_time = time.time()
        
        while self.monitoring:
            current_time = time.time() - start_time
            
            # Get GPU info
            gpu_info = get_gpu_info()
            
            # Get CPU/RAM info
            cpu_percent = psutil.cpu_percent(interval=None)
            ram_percent = psutil.virtual_memory().percent
            
            # Store data
            self.data['timestamps'].append(current_time)
            self.data['cpu_percent'].append(cpu_percent)
            self.data['ram_percent'].append(ram_percent)
            
            if gpu_info:
                self.data['gpu_utilization'].append(gpu_info['utilization'])
                self.data['gpu_memory'].append(gpu_info['memory_percent'])
                self.data['gpu_temperature'].append(gpu_info['temperature'])
            else:
                self.data['gpu_utilization'].append(0)
                self.data['gpu_memory'].append(0)
                self.data['gpu_temperature'].append(0)
            
            time.sleep(self.update_interval)
    
    def plot_results(self):
        """Plot monitoring results"""
        if not self.data['timestamps']:
            print("No data to plot")
            return
        
        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
        
        times = self.data['timestamps']
        
        # GPU Utilization
        ax1.plot(times, self.data['gpu_utilization'], 'g-', linewidth=2)
        ax1.set_title('🚀 GPU Utilization (%)')
        ax1.set_ylabel('Utilization %')
        ax1.grid(True)
        ax1.set_ylim(0, 100)
        
        # GPU Memory
        ax2.plot(times, self.data['gpu_memory'], 'b-', linewidth=2)
        ax2.set_title('🧠 GPU Memory Usage (%)')
        ax2.set_ylabel('Memory %')
        ax2.grid(True)
        ax2.set_ylim(0, 100)
        
        # GPU Temperature
        ax3.plot(times, self.data['gpu_temperature'], 'r-', linewidth=2)
        ax3.set_title('🌡️ GPU Temperature (°C)')
        ax3.set_ylabel('Temperature °C')
        ax3.set_xlabel('Time (seconds)')
        ax3.grid(True)
        
        # CPU and RAM
        ax4.plot(times, self.data['cpu_percent'], 'orange', label='CPU %', linewidth=2)
        ax4.plot(times, self.data['ram_percent'], 'purple', label='RAM %', linewidth=2)
        ax4.set_title('💻 System Resources')
        ax4.set_ylabel('Usage %')
        ax4.set_xlabel('Time (seconds)')
        ax4.legend()
        ax4.grid(True)
        ax4.set_ylim(0, 100)
        
        plt.tight_layout()
        plt.show()
        
        # Print summary statistics
        if self.data['gpu_utilization']:
            avg_gpu = sum(self.data['gpu_utilization']) / len(self.data['gpu_utilization'])
            max_gpu = max(self.data['gpu_utilization'])
            avg_temp = sum(self.data['gpu_temperature']) / len(self.data['gpu_temperature'])
            max_temp = max(self.data['gpu_temperature'])
            
            print(f"\n📊 MONITORING SUMMARY:")
            print(f"🚀 Average GPU Utilization: {avg_gpu:.1f}%")
            print(f"🔥 Peak GPU Utilization: {max_gpu:.1f}%")
            print(f"🌡️  Average Temperature: {avg_temp:.1f}°C")
            print(f"🔥 Peak Temperature: {max_temp:.1f}°C")

# Usage example:
# monitor = GPUMonitor()
# monitor.start_monitoring()
# 
# # Your training code here
# train_model()
# 
# monitor.stop_monitoring()
```

---


# **15. Updated Optimization Checklist with GPUtil**

Here's an **enhanced checklist** with GPUtil integration and the latest optimization techniques:

---

| # | Technique | Expected Impact | Implementation | GPUtil Usage |
|---|-----------|----------------|----------------|--------------|
| **1** | **Model-specific batch sizes** | **Very High** (2-4x throughput) | Use optimized batch sizes per model | Monitor memory usage |
| **2** | **Mixed Precision (AMP)** | **Very High** (2x speed, 50% memory) | `autocast()` + `GradScaler()` | Track memory savings |
| **3** | **PyTorch 2.0 Compilation** | **High** (20-30% speedup) | `torch.compile(model)` | Monitor utilization increase |
| **4** | **Optimized DataLoader** | **High** (20-50% if I/O bound) | 8 workers + prefetching | Track CPU usage |
| **5** | **GPU Memory Optimization** | **High** | `channels_last` + `pin_memory` | Monitor memory efficiency |
| **6** | **CUDNN Benchmarking** | **Moderate** (5-15%) | `cudnn.benchmark=True` | Verify consistent utilization |
| **7** | **Gradient Accumulation** | **High** (larger effective batches) | Accumulate over multiple steps | Monitor during accumulation |
| **8** | **Thermal Management** | **Critical** (prevents throttling) | Monitor temperature | Real-time temp tracking |
| **9** | **Hardware Monitoring** | **Essential** (bottleneck identification) | GPUtil + psutil | Continuous monitoring |
| **10** | **Dynamic Scaling** | **Moderate** (adaptive optimization) | Auto-adjust based on resources | Memory-based scaling |

---

#### **15.1. Quick GPU Health Check**

Before starting any training, run this health check:

```python
def gpu_health_check():
    """Comprehensive GPU health and optimization check"""
    print("🏥 GPU HEALTH CHECK")
    print("=" * 50)
    
    # Basic availability
    if not torch.cuda.is_available():
        print("❌ CUDA not available")
        return False
    
    # GPU information
    gpu_info = get_gpu_info()
    if not gpu_info:
        print("❌ No GPU detected by GPUtil")
        return False
    
    print(f"✅ GPU: {gpu_info['name']}")
    
    # Temperature check
    temp = gpu_info['temperature']
    if temp > 80:
        print(f"🔥 WARNING: High temperature ({temp}°C)")
    else:
        print(f"✅ Temperature: {temp}°C")
    
    # Memory check
    memory_percent = gpu_info['memory_percent']
    free_memory = gpu_info['memory_total'] - gpu_info['memory_used']
    
    print(f"💾 Memory: {gpu_info['memory_used']:.0f}/{gpu_info['memory_total']:.0f}MB ({memory_percent:.1f}%)")
    print(f"💾 Free Memory: {free_memory:.0f}MB")
    
    # Batch size recommendations
    if free_memory > 3000:
        recommended_batch = "128+"
        print("💡 Recommendation: Large batch sizes (128+)")
    elif free_memory > 1500:
        recommended_batch = "64-128"
        print("💡 Recommendation: Medium batch sizes (64-128)")
    else:
        recommended_batch = "16-32"
        print("⚠️  Recommendation: Small batch sizes (16-32)")
    
    # Utilization check
    utilization = gpu_info['utilization']
    if utilization < 10:
        print("⚠️  Low GPU utilization - consider increasing batch size or checking bottlenecks")
    elif utilization > 90:
        print("✅ Excellent GPU utilization!")
    else:
        print(f"📊 GPU utilization: {utilization:.1f}%")
    
    return True

# Run the health check
gpu_health_check()
```

---

#### **15.2. Training Optimization Workflow**

Here's the complete workflow for maximizing GPU utilization:

```python
def optimize_training_pipeline(model_name, dataset):
    """Complete optimization pipeline"""
    
    print("🔧 TRAINING OPTIMIZATION PIPELINE")
    print("=" * 50)
    
    # Step 1: Check GPU health
    if not gpu_health_check():
        return None
    
    # Step 2: Get optimal configuration
    gpu_info = get_gpu_info()
    gpu_memory_gb = gpu_info['memory_total'] / 1024
    config = get_optimal_config(model_name, gpu_memory_gb)
    
    print(f"\n⚙️  Optimal Configuration:")
    print(f"   📦 Batch Size: {config['batch_size']}")
    print(f"   📈 Learning Rate: {config['lr']}")
    
    # Step 3: Setup monitoring
    monitor = GPUMonitor()
    monitor.start_monitoring()
    
    # Step 4: Setup optimized training
    # (Your training setup code here)
    
    # Step 5: Training with monitoring
    print("\n🚀 Starting optimized training...")
    
    # Your training loop here
    # train_model_with_optimizations()
    
    # Step 6: Stop monitoring and analyze
    monitor.stop_monitoring()
    
    return config

# Example usage:
# config = optimize_training_pipeline('resnet18', your_dataset)
```

---


# **11.`torch.jit.script` and `torch.jit.trace`**

In PyTorch, `torch.jit.script` and `torch.jit.trace` are part of **TorchScript**, a way to convert PyTorch models into a **serializable and optimizable intermediate representation**. This can improve inference speed, allow deployment in C++ environments, and enable graph-level optimizations.

---

**Two Ways to Create TorchScript Models**

| Method             | Use When...                                                              |
| ------------------ | ------------------------------------------------------------------------ |
| `torch.jit.trace`  | The model is **static**, i.e., no control flow depending on input values |
| `torch.jit.script` | The model has **conditionals, loops, or input-dependent control flow**   |

---

####  **11.1. `torch.jit.trace` Example (Static Model)**

```python
import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 5)

    def forward(self, x):
        return self.linear(x)

model = MyModel()
example_input = torch.randn(1, 10)

# Traced model
traced_model = torch.jit.trace(model, example_input)
traced_model.save("traced_model.pt")  # Save
```

> ⚠️ Use this **only** if the model’s computation graph does **not depend on input data** (e.g., no `if`, `for`).

---

####  **11.2. `torch.jit.script` Example (Dynamic Control Flow)**

```python
import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        if x.sum() > 0:  # Dynamic condition
            return x * 2
        else:
            return x - 2

model = MyModel()

# Scripted model
scripted_model = torch.jit.script(model)
scripted_model.save("scripted_model.pt")  # Save
```

---

**Benefits of `torch.jit.script` and `torch.jit.trace`**

1. **Speed**: TorchScript compiles models to an optimized graph, often faster for inference (especially on CPU).
2. **Deployment**: Models can be loaded in C++ via LibTorch.
3. **Serialization**: You can save and load complete models easily (`.pt` format).
4. **Cross-platform**: Useful for mobile (PyTorch Mobile).

---

## 📈 Benchmark Example (Speed Comparison)

```python
import time

x = torch.randn(1000, 10)

# Eager model
start = time.time()
for _ in range(1000):
    model(x)
print("Eager time:", time.time() - start)

# TorchScript model
start = time.time()
for _ in range(1000):
    traced_model(x)
print("TorchScript time:", time.time() - start)
```

