https://huggingface.co/blog/zh/hf-bitsandbytes-integration

In [4]:
from IPython.display import Image
import torch

### precision matters

- Int8 (INT8) 数据类型，它是一个 8 位的整型数据表示，可以存储 $2^8$ 个不同的值 (对于有符号整数，区间为 [-128, 127]，而对于无符号整数，区间为 [0, 255])。
- 虽然理想情况下训练和推理都应该在 FP32 中完成，但 FP32 比 FP16/BF16 慢两倍，因此实践中常常使用混合精度方法，
    - 其中，使用 FP32 权重作为精确的 “主权重 (master weight)”，
    - 使用 FP16/BF16 权重进行前向和后向传播计算以提高训练速度，最后在梯度更新阶段再使用 FP16/BF16 梯度更新 FP32 主权重。
    - 因为只有在模型梯度更新时才需要精确的 FP32 权重

In [1]:
# 设置 FP32 和 FP16 的张量
fp32_tensor = torch.tensor([10000.0], dtype=torch.float32)
fp16_tensor = torch.tensor([10000.0], dtype=torch.float16)

# 执行乘法运算
fp32_result = fp32_tensor * fp32_tensor
fp16_result = fp16_tensor * fp16_tensor

print(f"FP32 result: {fp32_result.item()}")
print(f"FP16 result: {fp16_result.item()}")

# 检查 FP16 的最大值
fp16_max = torch.finfo(torch.float16).max
print(f"FP16 max value: {fp16_max}")

# 尝试在 FP16 中表示 100M
fp16_overflow = torch.tensor([100000000.0], dtype=torch.float16)
print(f"Attempting to represent 100M in FP16: {fp16_overflow.item()}")

FP32 result: 100000000.0
FP16 result: inf
FP16 max value: 65504.0
Attempting to represent 100M in FP16: inf


### LLM.int8

In [3]:
Image(url='https://huggingface.co/blog/assets/96_hf_bitsandbytes_integration/quant-freeze.png', width=500)

In [5]:
def quantize_and_dequantize(fp16_vector):
    # Get max(abs)
    max_abs = torch.max(torch.abs(fp16_vector))
    
    # Calculate quantization factor α
    alpha = max_abs / 127.0
    
    # Quantize to int8
    quantized = torch.round(fp16_vector / alpha).clamp(-127, 127).to(torch.int8)
    
    # Dequantize back to fp16
    dequantized = quantized.to(torch.float16) * alpha
    
    return quantized, dequantized, alpha

# Example usage
fp16_vector = torch.tensor([1.2, -0.5, -4.3, 1.2, -3.1, 0.8, 2.4, 5.4], dtype=torch.float16)

quantized, dequantized, alpha = quantize_and_dequantize(fp16_vector)

print("Original fp16 vector:", fp16_vector)
print("Quantized int8 vector:", quantized)
print("Dequantized fp16 vector:", dequantized)
print("Quantization factor α:", alpha)

Original fp16 vector: tensor([ 1.2002, -0.5000, -4.3008,  1.2002, -3.0996,  0.7998,  2.4004,  5.3984],
       dtype=torch.float16)
Quantized int8 vector: tensor([  28,  -12, -101,   28,  -73,   19,   56,  127], dtype=torch.int8)
Dequantized fp16 vector: tensor([ 1.1904, -0.5103, -4.2930,  1.1904, -3.1035,  0.8076,  2.3809,  5.3984],
       dtype=torch.float16)
Quantization factor α: tensor(0.0425, dtype=torch.float16)
