# Automatic Mixed Precision

所谓混合精度计算，就是指在整个神经网络训练和推理过程中，有一些算子使用`torch.float32`的数据类型来执行计算，其他的一些算子使用`torch.float16`的数据类型来计算，有一些算子，比如`linear`和`conv`在`float16`和`bfloat16`下计算的更快，其他的一些算子，比如`Reduction`类型的算子往往需要更大的动态范围，则使用`float32`。

自动混合精度使得它自动的对每个计算的op匹配最合适的数值精度，这样可以大幅减少运行时的显存开销，提升计算性能。

在AMP训练中，我们往往需在组合使用`torch.autocast`和`torch.cuda.amp.GradScaler`

Notices: 混合精度计算对于有Tensor Core架构（Volta,Turning,Ampere）的GPU会比较有效，对于老的架构提升不明显。

In [2]:
import torch, time, gc

start_time = None


def start_timer():
    global start_time
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.synchronize()
    start_time = time.time()


def end_timer_and_print(local_msg):
    torch.cuda.synchronize()
    end_time = time.time()
    print("\n" + local_msg)
    print(f"Total execution time = {(end_time - start_time):.3f}")
    print(f"Max memory used by tensors = {torch.cuda.max_memory_allocated()} bytes")

In [3]:
def make_model(in_size, out_size, num_layers):
    layers = []
    for _ in range(num_layers - 1):
        layers.append(torch.nn.Linear(in_size, in_size))
        layers.append(torch.nn.ReLU())
    layers.append(torch.nn.Linear(in_size, out_size))
    return torch.nn.Sequential(*tuple(layers)).cuda()

In [4]:
batch_size = 512  # Try, for example, 128, 256, 513.
in_size = 4096
out_size = 4096
num_layers = 3
num_batches = 50
epochs = 3

device = "cuda" if torch.cuda.is_available() else "cpu"
torch.set_default_device(device)

# Creates data in default precision.
# The same data is used for both default and mixed precision trials below.
# You don't need to manually change inputs' ``dtype`` when enabling mixed precision.
data = [torch.randn(batch_size, in_size) for _ in range(num_batches)]
targets = [torch.randn(batch_size, out_size) for _ in range(num_batches)]

loss_fn = torch.nn.MSELoss().cuda()

# Default Precision

In [5]:
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)

start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        output = net(input)
        loss = loss_fn(output, target)
        loss.backward()
        opt.step()
        opt.zero_grad()  # set_to_none=True here can modestly improve performance
end_timer_and_print("Default precision:")




Default precision:
Total execution time = 9.405
Max memory used by tensors = 1283817984 bytes


# Adding `torch.autocast`

注意在下面的代码中，`backward`过程是不在`autocast`范围内的，我们一般不推荐在`backward`时进行混合精度计算。

In [6]:
start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.autocast(device_type=device, dtype=torch.float16):
            output = net(input)
            assert output.dtype is torch.float16

            loss = loss_fn(output, target)
            assert loss.dtype is torch.float32
        loss.backward()
        opt.step()
        opt.zero_grad()  # set_to_none=True here can modestly improve performance
end_timer_and_print("Default precision:")




Default precision:
Total execution time = 2.712
Max memory used by tensors = 1304781312 bytes


# Adding `GradScaler`

梯度缩放可以帮助阻止在混合精度训练过程中，梯度的太小时变成了零（underflowing），使用`GradScalar`，会在计算完loss后，在backward之前，对整个loss进行缩放，使得整个梯度处理合理的数值范围，保证在反向传播过程中梯度不会出现underflowing的问题。

In [7]:
# Constructs a ``scaler`` once, at the beginning of the convergence run, using default arguments.
# If your network fails to converge with default ``GradScaler`` arguments, please file an issue.
# The same ``GradScaler`` instance should be used for the entire convergence run.
# If you perform multiple convergence runs in the same script, each run should use
# a dedicated fresh ``GradScaler`` instance. ``GradScaler`` instances are lightweight.
scaler = torch.cuda.amp.GradScaler()

for epoch in range(0):  # 0 epochs, this section is for illustration only
    for input, target in zip(data, targets):
        with torch.autocast(device_type=device, dtype=torch.float16):
            output = net(input)
            loss = loss_fn(output, target)

        # Scales loss. Calls ``backward()`` on scaled loss to create scaled gradients.
        scaler.scale(loss).backward()

        # ``scaler.step()`` first unscales the gradients of the optimizer's assigned parameters.
        # If these gradients do not contain ``inf``s or ``NaN``s, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(opt)

        # Updates the scale for next iteration.
        scaler.update()

        opt.zero_grad()  # set_to_none=True here can modestly improve performance

在上面的代码中，`scaler.scale(loss).backward()`之后，整个参数的梯度都被缩放了，如果我们在`backward()`和`scalar.step(optimizer)`之间需要对参数的梯度进行修改和检查，那么就需要我们先对梯度进行`unscale`

In [9]:
for epoch in range(0):  # 0 epochs, this section is for illustration only
    for input, target in zip(data, targets):
        with torch.autocast(device_type=device, dtype=torch.float16):
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned parameters in-place
        scaler.unscale_(opt)

        # Since the gradients of optimizer's assigned parameters are now unscaled, clips as usual.
        # You may use the same value for max_norm here as you would without gradient scaling.
        torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=0.1)

        scaler.step(opt)
        scaler.update()
        opt.zero_grad()

# Use `enabled`

我们可以使用`torch.autocast`和`GradScalar`中的`enabled`参数来灵活的在默认数值精度和混合精度之间进行切换

In [8]:
use_amp = True

net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)

start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.autocast(device_type=device, dtype=torch.float16, enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad()  # set_to_none=True here can modestly improve performance
end_timer_and_print("Mixed precision:")


Mixed precision:
Total execution time = 3.113
Max memory used by tensors = 1409672192 bytes


# `GradScalar`的状态保存

当我们要对 Amp-enabled 的训练过程进行保存和恢复，那么我们需要同时保存`scalar`

In [12]:
checkpoint = {
    "model": net.state_dict(),
    "optimizer": opt.state_dict(),
    "scaler": scaler.state_dict(),
}
torch.save(checkpoint, "filename")

In [13]:
dev = torch.cuda.current_device()
checkpoint = torch.load("filename", map_location=lambda storage, loc: storage.cuda(dev))

net.load_state_dict(checkpoint["model"])
opt.load_state_dict(checkpoint["optimizer"])
scaler.load_state_dict(checkpoint["scaler"])

# Advanced topics

[Automatic Mixed Precision Examples](https://pytorch.org/docs/stable/notes/amp_examples.html)

* Gradient accumulation
* Gradient penalty/double backward
* Networks with multiple models, optimizers, or losses
* Multiple GPUs (torch.nn.DataParallel or torch.nn.parallel.DistributedDataParallel)
* Custom autograd functions (subclasses of torch.autograd.Function)