- references
    - https://tigress-web.princeton.edu/~jdh4/PyTorchPerformanceTuningGuide_GTC2021.pdf
    - https://towardsdatascience.com/optimize-pytorch-performance-for-speed-and-memory-efficiency-2022-84f453916ea6

In [1]:
import torch
import torch.jit
import timeit

## data loading

- Move the active data to the SSD
- `Dataloader(dataset, num_workers=4*num_GPU)`
- `Dataloader(dataset, pin_memory=True, non_blocking=True)`

## data operations

### JIT（Just-In-Time compilation) ）

- JIT 通过将模型编译成中间表示（Intermediate Representation, IR），然后进一步将其转换为机器代码
- Fuse the pointwise (elementwise) operations into a single kernel by PyTorch JIT
    - JIT fuse the pointwise operations

In [13]:
# 创建一个大型的随机张量作为输入数据
x = torch.randn(15000, 15000)

# 使用 JIT 编译的函数
@torch.jit.script
def fused_gelu(x):
    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))

# 未使用 JIT 编译的相同函数
def gelu(x):
    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))

# 使用 timeit 测量 JIT 编译函数的执行时间
jit_time = timeit.timeit('fused_gelu(x)', globals=globals(), number=100)
nonjit_time = timeit.timeit('gelu(x)', globals=globals(), number=100)

print(jit_time, nonjit_time)

20.05574530499871 31.39065190600013
