<a href="https://colab.research.google.com/github/arkeodev/pytorch-tutorial/blob/main/pytorch_mixed_precision_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Utilizing PyTorch's GradScaler for Efficient Mixed Precision Training

## Introduction

In recent years, mixed precision training has emerged as a key technique to accelerate deep learning model training without significantly impacting the model's accuracy. By utilizing both 16-bit (float16) and 32-bit (float32) floating-point arithmetic, mixed precision training reduces memory usage and speeds up computations on modern GPUs. PyTorch's `torch.cuda.amp` (Automatic Mixed Precision) package, particularly the `GradScaler` class, plays a pivotal role in facilitating this process.

## What is Mixed Precision Training?

Mixed precision training leverages the strengths of both float16 and float32 data types.

Float16 operations are faster and require less memory, enabling the training of larger models or increasing batch sizes. However, float16 can lead to issues like underflow and overflow in gradients, compromising training stability. Here's where float32 comes in, maintaining precision where necessary, especially during the calculation of loss and its subsequent gradients.

## Understanding Overflow and Underflow Concepts

Understanding underflow and overflow in the context of floating-point arithmetic is crucial, especially when dealing with mixed precision training involving float16 (half precision) and float32 (single precision) formats. These issues are fundamental to why mixed precision training needs careful management, such as what `torch.cuda.amp.GradScaler` provides in PyTorch.

### Overflow

Overflow occurs when a number is too large to be represented in the given floating-point format. Each floating-point format has a maximum limit it can represent. When calculations exceed this limit, the result is typically set to an infinity value (`inf`), which can lead to incorrect calculations or model instability.

#### Example of Overflow

In float16, the maximum positive value that can be represented is approximately $(65504)$. If you attempt to multiply two large float16 numbers, say $(32000 \times 3)$, the expected product would be $(96000)$, which exceeds the maximum representable value in float16 format.

This results in an overflow, and the operation might yield infinity $(`inf`)$ instead of the actual number.

In [4]:
# Hypothetical Python code illustrating the concept
a = torch.tensor(32000, dtype=torch.float16)
b = torch.tensor(3, dtype=torch.float16)
product = a * b  # This could result in overflow in float16
product

tensor(inf, dtype=torch.float16)

### Underflow

Underflow occurs when a number is too small to be represented in the given floating-point format, getting closer to zero than the format can accurately represent. This can result in the number being rounded down to zero. Underflow can significantly impact training by causing gradients to vanish, effectively stopping the model from learning.



#### Example of Underflow

In float16, the smallest positive number that can be represented is approximately $(6.1 \times 10^{-5})$. If a gradient during backpropagation is calculated to be $(1.2 \times 10^{-5})$, it is smaller than what float16 can represent.

In this case, the gradient might be rounded down to $(0)$, leading to a vanishing gradient problem.

In [5]:
# Hypothetical Python code illustrating the concept
small_gradient = torch.tensor(1.2e-5, dtype=torch.float16)
# This could result in underflow, potentially becoming zero
small_gradient

tensor(1.1981e-05, dtype=torch.float16)

the theoretical minimum positive value $(around\ 6.1 × 10^-5)$ is a helpful guideline, but the actual behavior can be influenced by:

 - Approximations: Floating-point representations themselves introduce slight inaccuracies.

- Underflow handling: Hardware/software might return a pre-defined minimum value or a special indicator instead of zero during underflow.

## The Role of GradScaler

`torch.cuda.amp.GradScaler` automatically adjusts the scale of the gradients, balancing between the speed of float16 and the precision of float32. This balancing act is crucial for preventing gradient underflow, ensuring that gradients are neither too small to vanish nor too large to cause overflow.

## Mathematical Foundations



The core idea behind `GradScaler` is gradient scaling. Before backward propagation, loss values are scaled up by a factor `S` to prevent underflow. The gradients calculated during backward propagation are thus scaled up, and before the optimizer step, they are scaled back down by the same factor `S`.


This process can be summarized as:

1. Scale up: $(Loss_{scaled} = Loss \times S)$
2. Backward propagation: Compute gradients $(\nabla Loss_{scaled})$
3. Scale down: $(\nabla Loss = \nabla Loss_{scaled}\ /\ S)$

## Code Implementation

Let's dive into how you can implement mixed precision training with `GradScaler` in PyTorch.

### Defining a Simple Model

First, we define a simple convolutional neural network model suitable for the MNIST dataset:


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from torch.cuda.amp import GradScaler, autocast

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
        self.fc1 = nn.Linear(1024, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2(x), 2))
        x = x.view(-1, 1024)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

### Preparing the Dataset


Next, let's load the MNIST dataset and prepare data loaders:

In [2]:
# MNIST Dataset
train_dataset = datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True)
test_dataset = datasets.MNIST(root='./data', train=False, transform=transforms.ToTensor())

# Data loader
train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=64, shuffle=False)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 90693691.24it/s]


Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 26431528.22it/s]

Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz



100%|██████████| 1648877/1648877 [00:00<00:00, 29943201.40it/s]


Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 3219626.29it/s]


Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw



### Training the Model with Mixed Precision

Finally, we incorporate the mixed precision training into the training loop:

In [None]:
# Model, optimizer, and loss function setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleCNN().to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = nn.CrossEntropyLoss()

# Initialize GradScaler
scaler = GradScaler()

# Training loop
for epoch in range(1):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()

        # Run model in mixed precision
        with autocast():
            output = model(data)
            loss = loss_fn(output, target)

        # Scales loss. Calls backward() on scaled loss to create scaled gradients.
        scaler.scale(loss).backward()

        # Scaler step. Updates the model parameters based on current gradients.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

        if batch_idx % 100 == 0:
            print(f'Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} ({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}')

Remember, the effectiveness and efficiency gains from mixed precision training can vary based on your model and hardware capabilities.

Don't forget that, always monitor the model's performance and adjust the training setup as needed.

In this code snippet, `autocast` is used to automatically perform operations in float16 wherever beneficial, while `GradScaler` manages the scaling of gradients to prevent underflow.

## Conclusion


Mixed precision training is a powerful technique to enhance the performance and efficiency of training deep learning models. With `torch.cuda.amp.GradScaler`, PyTorch provides an accessible way to leverage this technique, ensuring that models can be trained faster without compromising on accuracy. Remember, the effectiveness of mixed precision training can vary depending on the model and hardware capabilities, so experimentation is key.