# Chapter 13: Computational Performance

In deep learning, datasets and models are usually large, which involves heavy computation. Therefore, computational performance matters a lot. This chapter will focus on the major factors that affect computational performance: imperative programming, symbolic programming, asynchronous computing, automatic parallelism, and multi-GPU computation.

---
## 13.1 Compilers and Interpreters

So far, this book has focused on imperative programming, which makes use of statements such as `print`, `+`, and `if` to change a program's state.

In [None]:
def add(a, b):
    return a + b

def fancy_func(a, b, c, d):
    e = add(a, b)
    f = add(c, d)
    g = add(e, f)
    return g

print(fancy_func(1, 2, 3, 4))

### Symbolic Programming

Consider the alternative, *symbolic programming*, where computation is usually performed only once the process has been fully defined.

> ðŸ”‘ **KEY INSIGHT**: Symbolic programming allows for significant optimization - the compiler can optimize and rewrite code like `print((1 + 2) + (3 + 4))` into `print(10)` since it sees the full code before execution.

In [None]:
def add_():
    return '''
def add(a, b):
    return a + b
'''

def fancy_func_():
    return '''
def fancy_func(a, b, c, d):
    e = add(a, b)
    f = add(c, d)
    g = add(e, f)
    return g
'''

def evoke_():
    return add_() + fancy_func_() + 'print(fancy_func(1, 2, 3, 4))'

prog = evoke_()
print(prog)
y = compile(prog, '', 'exec')
exec(y)

### Hybrid Programming

PyTorch is based on imperative programming and uses dynamic computation graphs. TorchScript lets users develop and debug using pure imperative programming, while having the ability to convert most programs into symbolic programs for product-level computing performance and deployment.

### Hybridizing the Sequential Class

In [None]:
from d2l import torch as d2l
import torch
from torch import nn

# Factory for networks
def get_net():
    net = nn.Sequential(nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 2))
    return net

x = torch.randn(size=(1, 512))
net = get_net()
net(x)

By converting the model using `torch.jit.script` function, we are able to compile and optimize the computation in the MLP.

In [None]:
net = torch.jit.script(net)
net(x)

### Acceleration by Hybridization

> ðŸ”‘ **KEY INSIGHT**: The `Benchmark` context manager is a reusable pattern for measuring execution time throughout deep learning experiments.

In [None]:
#@save
class Benchmark:
    """For measuring running time."""
    def __init__(self, description='Done'):
        self.description = description

    def __enter__(self):
        self.timer = d2l.Timer()
        return self

    def __exit__(self, *args):
        print(f'{self.description}: {self.timer.stop():.4f} sec')

Now we can invoke the network twice, once with and once without torchscript.

In [None]:
net = get_net()
with Benchmark('Without torchscript'):
    for i in range(1000): net(x)

net = torch.jit.script(net)
with Benchmark('With torchscript'):
    for i in range(1000): net(x)

### Serialization

One of the benefits of compiling the models is that we can serialize (save) the model and its parameters to disk.

In [None]:
net.save('my_mlp')
!ls -lh my_mlp*

---
## 13.2 Asynchronous Computation

Today's computers are highly parallel systems, consisting of multiple CPU cores, multiple processing elements per GPU, and often multiple GPUs per device.

> ðŸ”‘ **KEY INSIGHT**: By default, GPU operations are asynchronous in PyTorch. When you call a function that uses the GPU, the operations are enqueued but not necessarily executed until later. This allows executing more computations in parallel.

In [None]:
from d2l import torch as d2l
import numpy, os, subprocess
import torch
from torch import nn

### Asynchrony via Backend

For a warmup consider the following toy problem: we want to generate a random matrix and multiply it.

In [None]:
# Warmup for GPU computation
device = d2l.try_gpu()
a = torch.randn(size=(1000, 1000), device=device)
b = torch.mm(a, a)

with d2l.Benchmark('numpy'):
    for _ in range(10):
        a = numpy.random.normal(size=(1000, 1000))
        b = numpy.dot(a, a)

with d2l.Benchmark('torch'):
    for _ in range(10):
        a = torch.randn(size=(1000, 1000), device=device)
        b = torch.mm(a, a)

The benchmark output via PyTorch is orders of magnitude faster. Forcing PyTorch to finish all computation prior to returning shows what happened previously.

In [None]:
with d2l.Benchmark():
    for _ in range(10):
        a = torch.randn(size=(1000, 1000), device=device)
        b = torch.mm(a, a)
    torch.cuda.synchronize(device)

Let's look at another toy example to understand the dependency graph a bit better.

In [None]:
x = torch.ones((1, 2), device=device)
y = torch.ones((1, 2), device=device)
z = x * y + 2
z

---
## 13.3 Automatic Parallelism

Deep learning frameworks automatically construct computational graphs at the backend. Using a computational graph, the system is aware of all the dependencies, and can selectively execute multiple non-interdependent tasks in parallel to improve speed.

> ðŸ”‘ **KEY INSIGHT**: Automatic parallelism works because the framework tracks dependencies in the computational graph - operations without dependencies on each other can run simultaneously on different devices.

In [None]:
from d2l import torch as d2l
import torch

### Parallel Computation on GPUs

In [None]:
devices = d2l.try_all_gpus()
def run(x):
    return [x.mm(x) for _ in range(50)]

x_gpu1 = torch.rand(size=(4000, 4000), device=devices[0])
x_gpu2 = torch.rand(size=(4000, 4000), device=devices[1])

In [None]:
run(x_gpu1)
run(x_gpu2)  # Warm-up all devices
torch.cuda.synchronize(devices[0])
torch.cuda.synchronize(devices[1])

with d2l.Benchmark('GPU1 time'):
    run(x_gpu1)
    torch.cuda.synchronize(devices[0])

with d2l.Benchmark('GPU2 time'):
    run(x_gpu2)
    torch.cuda.synchronize(devices[1])

If we remove the `synchronize` statement between both tasks the system is free to parallelize computation on both devices automatically.

In [None]:
with d2l.Benchmark('GPU1 & GPU2'):
    run(x_gpu1)
    run(x_gpu2)
    torch.cuda.synchronize()

### Parallel Computation and Communication

In many cases we need to move data between different devices, say between the CPU and GPU, or between different GPUs.

In [None]:
def copy_to_cpu(x, non_blocking=False):
    return [y.to('cpu', non_blocking=non_blocking) for y in x]

with d2l.Benchmark('Run on GPU1'):
    y = run(x_gpu1)
    torch.cuda.synchronize()

with d2l.Benchmark('Copy to CPU'):
    y_cpu = copy_to_cpu(y)
    torch.cuda.synchronize()

Setting `non_blocking=True` allows us to overlap computation and communication.

In [None]:
with d2l.Benchmark('Run on GPU1 and copy to CPU'):
    y = run(x_gpu1)
    y_cpu = copy_to_cpu(y, True)
    torch.cuda.synchronize()

---
## 13.4 Hardware

Building systems with great performance requires a good understanding of the algorithms and models to capture the statistical aspects of the problem. At the same time it is also indispensable to have at least a modicum of knowledge of the underlying hardware.

> ðŸ”‘ **KEY INSIGHT**: A good hardware-aware design can easily make a difference of an order of magnitude - this can mean the difference between training a network in a week versus 3 months.

### Key Latency Numbers

| Action | Time |
|--------|------|
| L1 cache reference | 1.5 ns |
| L2 cache reference | 5 ns |
| L3 cache hit | 16-40 ns |
| Main memory reference | 46-120 ns |
| NVMe SSD random read | 120 Î¼s |
| GPU Global Memory access | 200 ns |
| Transfer 1MB via NVLink | 30 Î¼s |
| Transfer 1MB via PCIe | 80 Î¼s |

> ðŸ”‘ **KEY INSIGHT**: Memory hierarchy matters enormously - L1 cache is ~500x faster than main memory for random access. Design algorithms to maximize cache utilization through sequential access patterns.

---
## 13.5 Training on Multiple GPUs

We discuss how to actually parallelize deep learning training across multiple GPUs.

> ðŸ”‘ **KEY INSIGHT**: There are three main parallelization strategies: (1) network partitioning across GPUs, (2) layer-wise partitioning, and (3) data parallelism. Data parallelism is the most practical - each GPU processes different data with the same model, then gradients are aggregated.

In [None]:
%matplotlib inline
from d2l import torch as d2l
import torch
from torch import nn
from torch.nn import functional as F

### A Toy Network

We use LeNet (with slight modifications) defined from scratch to illustrate parameter exchange and synchronization in detail.

In [None]:
# Initialize model parameters
scale = 0.01
W1 = torch.randn(size=(20, 1, 3, 3)) * scale
b1 = torch.zeros(20)
W2 = torch.randn(size=(50, 20, 5, 5)) * scale
b2 = torch.zeros(50)
W3 = torch.randn(size=(800, 128)) * scale
b3 = torch.zeros(128)
W4 = torch.randn(size=(128, 10)) * scale
b4 = torch.zeros(10)
params = [W1, b1, W2, b2, W3, b3, W4, b4]

# Define the model
def lenet(X, params):
    h1_conv = F.conv2d(input=X, weight=params[0], bias=params[1])
    h1_activation = F.relu(h1_conv)
    h1 = F.avg_pool2d(input=h1_activation, kernel_size=(2, 2), stride=(2, 2))
    h2_conv = F.conv2d(input=h1, weight=params[2], bias=params[3])
    h2_activation = F.relu(h2_conv)
    h2 = F.avg_pool2d(input=h2_activation, kernel_size=(2, 2), stride=(2, 2))
    h2 = h2.reshape(h2.shape[0], -1)
    h3_linear = torch.mm(h2, params[4]) + params[5]
    h3 = F.relu(h3_linear)
    y_hat = torch.mm(h3, params[6]) + params[7]
    return y_hat

# Cross-entropy loss function
loss = nn.CrossEntropyLoss(reduction='none')

### Data Synchronization

For efficient multi-GPU training we need the ability to distribute parameters to multiple devices and to sum parameters across multiple devices (allreduce).

In [None]:
def get_params(params, device):
    new_params = [p.to(device) for p in params]
    for p in new_params:
        p.requires_grad_()
    return new_params

In [None]:
new_params = get_params(params, d2l.try_gpu(0))
print('b1 weight:', new_params[1])
print('b1 grad:', new_params[1].grad)

The `allreduce` function adds up all vectors and broadcasts the result back to all GPUs.

In [None]:
def allreduce(data):
    for i in range(1, len(data)):
        data[0][:] += data[i].to(data[0].device)
    for i in range(1, len(data)):
        data[i][:] = data[0].to(data[i].device)

In [None]:
data = [torch.ones((1, 2), device=d2l.try_gpu(i)) * (i + 1) for i in range(2)]
print('before allreduce:\n', data[0], '\n', data[1])
allreduce(data)
print('after allreduce:\n', data[0], '\n', data[1])

### Distributing Data

In [None]:
data = torch.arange(20).reshape(4, 5)
devices = [torch.device('cuda:0'), torch.device('cuda:1')]
split = nn.parallel.scatter(data, devices)
print('input :', data)
print('load into', devices)
print('output:', split)

In [None]:
#@save
def split_batch(X, y, devices):
    """Split `X` and `y` into multiple devices."""
    assert X.shape[0] == y.shape[0]
    return (nn.parallel.scatter(X, devices),
            nn.parallel.scatter(y, devices))

### Training

> ðŸ”‘ **KEY INSIGHT**: In data parallel training, the computational graph has no dependencies across devices within a minibatch, so it executes in parallel automatically. Synchronization only happens when aggregating gradients.

In [None]:
def train_batch(X, y, device_params, devices, lr):
    X_shards, y_shards = split_batch(X, y, devices)
    # Loss is calculated separately on each GPU
    ls = [loss(lenet(X_shard, device_W), y_shard).sum()
          for X_shard, y_shard, device_W in zip(
              X_shards, y_shards, device_params)]
    for l in ls:  # Backpropagation is performed separately on each GPU
        l.backward()
    # Sum all gradients from each GPU and broadcast them to all GPUs
    with torch.no_grad():
        for i in range(len(device_params[0])):
            allreduce([device_params[c][i].grad for c in range(len(devices))])
    # The model parameters are updated separately on each GPU
    for param in device_params:
        d2l.sgd(param, lr, X.shape[0]) # Here, we use a full-size batch

In [None]:
def train(num_gpus, batch_size, lr):
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
    devices = [d2l.try_gpu(i) for i in range(num_gpus)]
    # Copy model parameters to `num_gpus` GPUs
    device_params = [get_params(params, d) for d in devices]
    num_epochs = 10
    animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs])
    timer = d2l.Timer()
    for epoch in range(num_epochs):
        timer.start()
        for X, y in train_iter:
            # Perform multi-GPU training for a single minibatch
            train_batch(X, y, device_params, devices, lr)
            torch.cuda.synchronize()
        timer.stop()
        # Evaluate the model on GPU 0
        animator.add(epoch + 1, (d2l.evaluate_accuracy_gpu(
            lambda x: lenet(x, device_params[0]), test_iter, devices[0]),))
    print(f'test acc: {animator.Y[0][-1]:.2f}, {timer.avg():.1f} sec/epoch '
          f'on {str(devices)}')

In [None]:
train(num_gpus=1, batch_size=256, lr=0.2)

In [None]:
train(num_gpus=2, batch_size=256, lr=0.2)

---
## 13.6 Concise Implementation for Multiple GPUs

Implementing parallelism from scratch for every new model is no fun. Here we show how to use high-level APIs.

In [None]:
from d2l import torch as d2l
import torch
from torch import nn

### A Toy Network

We pick a ResNet-18 variant modified for small images.

In [None]:
#@save
def resnet18(num_classes, in_channels=1):
    """A slightly modified ResNet-18 model."""
    def resnet_block(in_channels, out_channels, num_residuals,
                     first_block=False):
        blk = []
        for i in range(num_residuals):
            if i == 0 and not first_block:
                blk.append(d2l.Residual(out_channels, use_1x1conv=True, 
                                        strides=2))
            else:
                blk.append(d2l.Residual(out_channels))
        return nn.Sequential(*blk)

    # This model uses a smaller convolution kernel, stride, and padding and
    # removes the max-pooling layer
    net = nn.Sequential(
        nn.Conv2d(in_channels, 64, kernel_size=3, stride=1, padding=1),
        nn.BatchNorm2d(64),
        nn.ReLU())
    net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True))
    net.add_module("resnet_block2", resnet_block(64, 128, 2))
    net.add_module("resnet_block3", resnet_block(128, 256, 2))
    net.add_module("resnet_block4", resnet_block(256, 512, 2))
    net.add_module("global_avg_pool", nn.AdaptiveAvgPool2d((1,1)))
    net.add_module("fc", nn.Sequential(nn.Flatten(),
                                       nn.Linear(512, num_classes)))
    return net

### Network Initialization

In [None]:
net = resnet18(10)
# Get a list of GPUs
devices = d2l.try_all_gpus()
# We will initialize the network inside the training loop

### Training

> ðŸ”‘ **KEY INSIGHT**: PyTorch's `nn.DataParallel` handles all the complexity of splitting data, computing on multiple GPUs, and aggregating gradients automatically.

In [None]:
def train(net, num_gpus, batch_size, lr):
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
    devices = [d2l.try_gpu(i) for i in range(num_gpus)]
    def init_weights(module):
        if type(module) in [nn.Linear, nn.Conv2d]:
            nn.init.normal_(module.weight, std=0.01)
    net.apply(init_weights)
    # Set the model on multiple GPUs
    net = nn.DataParallel(net, device_ids=devices)
    trainer = torch.optim.SGD(net.parameters(), lr)
    loss = nn.CrossEntropyLoss()
    timer, num_epochs = d2l.Timer(), 10
    animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs])
    for epoch in range(num_epochs):
        net.train()
        timer.start()
        for X, y in train_iter:
            trainer.zero_grad()
            X, y = X.to(devices[0]), y.to(devices[0])
            l = loss(net(X), y)
            l.backward()
            trainer.step()
        timer.stop()
        animator.add(epoch + 1, (d2l.evaluate_accuracy_gpu(net, test_iter),))
    print(f'test acc: {animator.Y[0][-1]:.2f}, {timer.avg():.1f} sec/epoch '
          f'on {str(devices)}')

In [None]:
train(net, num_gpus=1, batch_size=256, lr=0.1)

In [None]:
train(net, num_gpus=2, batch_size=512, lr=0.2)

---
## 13.7 Parameter Servers

As we move from a single GPU to multiple GPUs and then to multiple servers containing multiple GPUs, our algorithms for distributed and parallel training need to become much more sophisticated.

> ðŸ”‘ **KEY INSIGHT**: Different interconnects have vastly different bandwidth - NVLink offers up to 100 GB/s, PCIe 4.0 offers 32 GB/s, while 100GbE Ethernet only amounts to 10 GB/s. The synchronization strategy must be adapted to the available hardware topology.

### Data-Parallel Training

The key steps in data parallel training:
1. Compute loss and gradient on each GPU
2. Aggregate all gradients on one GPU
3. Parameter update happens and parameters are re-distributed to all GPUs

### Ring Synchronization

> ðŸ”‘ **KEY INSIGHT**: Ring synchronization is optimal for modern GPU servers. By decomposing gradients into n chunks and synchronizing chunk i starting at node i, the time to aggregate gradients does NOT grow as we increase the ring size - it stays approximately constant!

### Key--Value Stores

The key--value store abstraction simplifies distributed training:
- **push(key, value)**: sends a gradient from a worker to common storage where it is aggregated
- **pull(key, value)**: retrieves an aggregate value from common storage

This decouples statistical modeling concerns from distributed systems engineering complexity.

---
## Summary

### Key Takeaways from Chapter 13:

1. **Compilers vs Interpreters**: Symbolic programming (TorchScript) can significantly improve performance by allowing compiler optimizations.

2. **Asynchronous Computation**: GPU operations are asynchronous by default - use `torch.cuda.synchronize()` when you need to wait for results.

3. **Automatic Parallelism**: The framework automatically parallelizes operations without dependencies across different devices.

4. **Hardware Matters**: Understanding memory hierarchy (caches, RAM, GPU memory) and interconnect bandwidth is crucial for performance.

5. **Multi-GPU Training**: Data parallelism is the most practical approach - split data, compute gradients independently, then aggregate.

6. **High-Level APIs**: Use `nn.DataParallel` for simple multi-GPU training instead of implementing from scratch.

7. **Parameter Servers**: For distributed training across machines, ring synchronization and key-value store abstractions help manage complexity.