In [1]:
from IPython.display import Image

In [14]:
# https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc
Image(url='https://developer-blogs.nvidia.com/wp-content/uploads/2012/12/pinned-1024x541.jpg', 
      width=400)

## basics

- Host (CPU)
    - pinned memory 定义在 host（cpu）上；
- HtoD: host to device
- DtoH: device to host

As you can see in the figure, pinned memory is used as a staging area for transfers from the device to the host. We can avoid the cost of the transfer between pageable and pinned host arrays by directly allocating our host arrays in pinned memory. 

In [1]:
import torch
import time

### host to device

In [2]:
# 创建一个大的Tensor以便看到明显的时间差异
size = (20000, 20000)

# 普通内存Tensor
normal_tensor = torch.FloatTensor(*size)
# 将普通Tensor复制到GPU并计时
t0 = time.time()
normal_tensor_gpu = normal_tensor.to("cuda")
time.time() - t0

0.3839099407196045

In [3]:
# Pinned内存Tensor
pinned_tensor = torch.FloatTensor(*size).pin_memory()
# 将Pinned Tensor复制到GPU并计时
t0 = time.time()
pinned_tensor_gpu = pinned_tensor.to("cuda", non_blocking=True)
time.time() - t0

0.00041222572326660156

### device to host

In [4]:
size = (20000, 20000)
gpu_tensor = torch.randn(*size, device="cuda")

# 复制到普通内存并计时
t0 = time.time()
normal_tensor_cpu = gpu_tensor.to("cpu")
time.time() - t0

0.7305982112884521

In [5]:
# 为了使用pinned memory，首先在CPU上创建一个pinned memory Tensor
pinned_tensor_cpu = torch.randn(*size).pin_memory()

# 确保GPU操作完成
torch.cuda.synchronize()

# 使用非阻塞方式复制到Pinned内存并计时
t0 = time.time()
pinned_tensor_cpu.copy_(gpu_tensor, non_blocking=True)
torch.cuda.synchronize()  # 等待数据传输完成
time.time() - t0

0.06072068214416504

### non_blocking

- Use `tensor.to(non_blocking=True)` when it’s applicable to **overlap data transfers**
    - 使用non_blocking=True将异步地将数据移动到GPU，而不会阻塞CPU，
```
cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice);
increment<<<1,N>>>(d_a)
cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);
```

- `d_a`: device 
- 第一行是将数据从Host（CPU内存）拷贝到device（GPU显存）。注意此时还是在Host上执行的，也就是说这个时候Host上的CPU在将数据拷贝到Device上，所以必须得等到第一行运行结束后，才会进入到第二行代码
    - `cudaMemcpy(void* dst, const void* src, size_t count, cudaMemcpyKind kind)`
- 第二行代码是在Device上启动(launch)和执行(execute)的。注意分成启动和执行两步骤。一旦第二行启动后，主机上的CPU就会立马执行第三行，并不会再去等执行了
- 第三行代码是将数据从Device拷贝到Host，但是此时的data transfer需要等到第二行Device执行结束才能开始。

```
model.train()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)
scaler = GradScaler()

for i, (features, target) in enumerate(dataloader):
    # these two calls are nonblocking and overlapping
    features = features.to('cuda:0', non_blocking=True)
    target = target.to('cuda:0', non_blocking=True)

    # Forward pass with mixed precision
    with torch.cuda.amp.autocast(): # autocast as a context manager
        output = model(features)
        loss = criterion(output, target)
```

- 当您设置non_blocking=True时，数据传输（CPU到GPU的复制）是异步的，这意味着它不会阻塞程序的执行。因此，在features和target被复制到GPU的同时，CPU可以继续执行下面的代码，直到实际需要使用这些变量的值进行计算。
- 在异步数据传输的情况下，当执行到model(features)时，如果features和target还没有完全复制到GPU完成，GPU会等待这个复制结束，然后开始计算。这个等待过程是自动管理的。如果复制过程在模型开始计算之前完成，则不会有任何等待时间。

## cuda 编程

- https://github.com/NVIDIA-developer-blog/code-samples.git
    - code-samples/series/cuda-cpp/optimize-data-transfers/bandwidthtest.cu

```
$ nvcc bandwidthtest.cu -o a.out
$ ./a.out
```

```
Device: NVIDIA GeForce RTX 4090
Transfer size (MB): 16

Pageable transfers
  Host to Device bandwidth (GB/s): 5.959241
  Device to Host bandwidth (GB/s): 5.124604

Pinned transfers
  Host to Device bandwidth (GB/s): 13.453977
  Device to Host bandwidth (GB/s): 13.369578
```