- https://arxiv.org/pdf/1710.03740.pdf
    - MIXED PRECISION TRAINING
- https://developer.nvidia.com/automatic-mixed-precision
    - https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/
    - https://on-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-1_Internal%20Speaker_Michael%20Carilli_PDF%20For%20Sharing.pdf
- automatic mixed precision
    - single precision：fp32（float32）
    - half precision：fp16（float16）
    - large batch size/models，加速训练；
    - 模型 performance 并不会有显著降低；
- training steps
    - Porting the model to use FP16 data type where appropriate
    - Adding loss scaling to preserve samll gradient values；

In [2]:
import torch
from IPython.display import Image
from transformers import AutoModelForCausalLM

There was a problem when trying to write in your cache folder (/media/whaow/.cache/huggingface/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.


In [18]:
model = AutoModelForCausalLM.from_pretrained('gpt2')
# all float32
print(model.get_memory_footprint() / (1024**2))
model = AutoModelForCausalLM.from_pretrained('gpt2', torch_dtype=torch.float16)
# all float16
print(model.get_memory_footprint() / (1024**2))
model = AutoModelForCausalLM.from_pretrained('gpt2', torch_dtype=torch.float16, load_in_8bit=True)
# float16, torch.int8
print(model.get_memory_footprint() / (1024**2))

486.7002410888672
249.3501205444336
168.3501205444336


In [11]:
for name, para in model.named_parameters():
    print(para.dtype, name, para.device)

NameError: name 'model' is not defined

- transformer.wte.weight、transformer.wpe.weight： torch.float16
- h.0 - h.11
    - ln_1.weight, ln_1.bias, ln_2.weight, ln_2.bias: torch.float16
    - attn
        - c_attn.weight: torch.int8
            - bias: torch.float16
        - c_proj.weight: torch.int8
            - bias: torch.float16
    - mlp
        - c_fc.weight: torch.int8
        - bias: torch.float16
- ln_f.weight, ln_f.bias: torch.float16

## demo

In [3]:
torch.cuda.amp.autocast??

In [13]:
# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")

# with torch.autocast(device_type="cuda"):
with torch.cuda.amp.autocast():
    # torch.mm is on autocast's list of ops that should run in float16.
    # Inputs are float32, but the op runs in float16 and produces float16 output.
    # No manual casts are required.
    e_float16 = torch.mm(a_float32, b_float32)
    print('in autocast', e_float16.dtype, e_float16.device)
    # Also handles mixed input types
    f_float16 = torch.mm(d_float32, e_float16)
    print('in autocast', f_float16.dtype, e_float16.device)

# After exiting autocast, calls f_float16.float() to use with d_float32
g_float32 = torch.mm(d_float32, f_float16.float())
print('out autocast', g_float32.dtype, g_float32.device)

in autocast torch.float16 cuda:0
in autocast torch.float16 cuda:0
out autocast torch.float32 cuda:0


## basics

In [16]:
Image(url='../imgs/fp32-fp16.png', width=600)

- fp32（single precision） vs. fp16（half precision）
    - fp16 的 dynamic range 是足够的，gradient （weight update）的计算需要将其 scale 避免 fp16 的浮点数下溢；
- fp16 is fast and memory-efficient；
    - 更快的 compute throughout （8x）
    - 更高的 memory throughout (2x)
    - 更小的显存占用 (1/2x)
- fp32 offers precison and range benefits.
- 因此需要混合；
    - 需要 fp32 的场景：
        - reductions，exponentiation；
        - large + small：weight updates, reductions again;
            - 1+0.0001
            - update/para < 2^{-11} (0.00049), no effect

In [5]:
Image(url='https://docscontent.nvidia.com/dims4/default/3252e0a/2147483647/strip/true/crop/944x532+0+0/resize/1888x1064!/format/webp/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fdita%2F00000189-949d-d46e-abe9-bcdf9f8c0000%2Fdeeplearning%2Fperformance%2Fmixed-precision-training%2Fgraphics%2Fgradients2.png', 
      width=600)

In [14]:
torch.cuda.HalfTensor([2**-24])  + 1.

tensor([1.], device='cuda:0', dtype=torch.float16)

In [12]:
4096 * 16

65536

In [21]:
# torch.float16
a = torch.cuda.HalfTensor(4096)
# 4096 * 16
a.fill_(16)
a.sum()

tensor(inf, device='cuda:0', dtype=torch.float16)

In [25]:
# torch.float32
b = torch.cuda.FloatTensor(4096)
# 4096 * 16
b.fill_(16)
b.sum()

tensor(65536., device='cuda:0')

In [26]:
para = torch.cuda.HalfTensor([1.])
update = torch.cuda.HalfTensor([.0001])
para + update

tensor([1.], device='cuda:0', dtype=torch.float16)

In [27]:
para = torch.cuda.FloatTensor([1.])
update = torch.cuda.FloatTensor([.0001])
para + update

tensor([1.0001], device='cuda:0')

In [4]:
# GEMM：General Matrix Multiply
Image(url='../imgs/amp_32_16.png', width=600)

## amp

- https://arxiv.org/pdf/1710.03740.pdf

In [5]:
# model: half, inputs: half, 
# targets: float32, 
# optimizer: float32
Image(url='https://blog.paperspace.com/content/images/2022/05/image-16.png', width=400)

In [10]:
# gradient update 发生在 float32
Image(url='../imgs/master-weights.png', width=400)

In [30]:
Image(url='https://pic1.zhimg.com/v2-0e8ef3ea96a60a2dfa45c8e4cb658a5c_r.jpg', width=400)

- forward: weights, activations
- backward: activation grad, weight grad
- updates(weight gradients 乘上学习率)会非常小，在FP16中，小于2^(-24)的值都会被置为0. 

$$
\frac{\partial L}{\partial w}=\frac{\partial L}{\partial a}\cdot\frac{\partial a}{\partial w}
$$

- 2-level NN

$$
\begin{split}
\frac{\partial L}{\partial W^{[2]}}=\frac{\partial L}{\partial a^{[2]}}\frac{\partial a^{[2]}}{\partial W^{[2]}}\\
\frac{\partial L}{\partial a^{[1]}}=\frac{\partial L}{\partial a^{[2]}}\frac{\partial a^{[2]}}{\partial a^{[1]}}\\
\frac{\partial L}{\partial W^{[1]}}=\frac{\partial L}{\partial a^{[1]}}\frac{\partial a^{[1]}}{\partial W^{[1]}}
\end{split}
$$

## loss scaling

```
scaler = GradScaler()

# forward
with autocast():
    output = model(input)
    loss = loss_fn(output, target)

# backward
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
```

- 针对的是 loss（loss scaling）
    - small gradients may underflow in FP16 regions of the network
    - scaling the loss brings gradients into the fp16 dynamic range
    - unscale gradients in FP32 for `optimizer.step()`


In [29]:
Image(url='../imgs/loss-scaling.png', width=500)

In [9]:
Image(url='../imgs/master-weights-scale.png', width=500)

```
# 计算梯度
loss.backward()

# 将计算的梯度从float16模型复制到float32模型
for param, param_float32 in zip(model.parameters(), model_float32.parameters()):
    if param.grad is not None:
        param_float32.grad = param.grad.float() * scale_factor  # 应用梯度缩放

# 更新主权重（float32模型）
optimizer.step()
```