# Quantization

## Pre-reading

Watch the video **DeepLearningAI-Quantization_Fundamentals-Handling_Big_Models** posted in Teams

### Objectives

1. Describe different data types supported by ARM, PyTorch, and TensorFlow Lite.
2. Quantize data into different data types.
3. Assess the impact on memory usage of quantization.

In [1]:
%pip install -q torch

Collecting torch
  Downloading torch-2.6.0-cp312-cp312-manylinux1_x86_64.whl.metadata (28 kB)
Collecting networkx (from torch)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from tor

In [2]:
import torch

## Data Types and Sizes

*Some of this content is from [DeepLearning AI "Quantization Fundamentals"](https://learn.deeplearning.ai/courses/quantization-fundamentals/lesson/dig9h/data-types-and-sizes)

PyTorch and TensorFlow both support various data types. Many - but not all - of these are familiar to you from your experience with C programming.

This course has thusfar focussed on TensorFlow, but PyTorch has better support for this sort of stuff, particularly as HuggingFace and the rest of the community continues to favor PyTorch over TensorFlow.

### Integer Types

Unsigned go from $[0, 2^N -1]$

Signed are two's complement and go from $[-2^{N-1}, 2^{N-1}-1]$

In [4]:
# Information of `8-bit unsigned integer`
torch.iinfo(torch.uint8)

iinfo(min=0, max=255, dtype=uint8)

In [None]:
# Information of `8-bit (signed) integer`
torch.iinfo(torch.int8)

iinfo(min=-128, max=127, dtype=int8)

In [None]:
# TODO: Information of `16-bit (signed) integer

In [None]:
# TODO: Information of `32-bit (signed) integer

In [6]:
# TODO: Information of `64-bit (signed) integer

iinfo(min=-9.22337e+18, max=9.22337e+18, dtype=int64)

### Floating Point Types

The decimal "floats", and the number is expressed as a base and exponent.

IEEE 754 single-precision **FP32** has:

- 1 sign bit
- 8 exponent bits
- 23 fraction bits

But there are other formats!

![floating point formats](https://frankdenneman.nl/wp-content/uploads/2022/07/FP16-FP32-BFfloat16-50dpi.png)

Python defaults to FP64 for float data.

In [None]:
# Information of `64-bit floating point`
torch.finfo(torch.float64)

finfo(resolution=1e-15, min=-1.79769e+308, max=1.79769e+308, eps=2.22045e-16, smallest_normal=2.22507e-308, tiny=2.22507e-308, dtype=float64)

In [12]:
# Information of `32-bit floating point`
torch.finfo(torch.float32)

finfo(resolution=1e-06, min=-3.40282e+38, max=3.40282e+38, eps=1.19209e-07, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=float32)

In [None]:
# TODO: Information of `16-bit floating point`

In [14]:
# by default, python stores float data in fp64
value = 1/3

In [7]:
format(value, '.60f')

'0.333333333333333314829616256247390992939472198486328125000000'

In [None]:
tensor_fp64 = torch.tensor(value, dtype = torch.float64)
tensor_fp32 = torch.tensor(value, dtype = torch.float32)
tensor_fp16 = torch.tensor(value, dtype = torch.float16)
tensor_bf16 = torch.tensor(value, dtype = torch.bfloat16)

print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")
print(f"fp32 tensor: {format(tensor_fp32.item(), '.60f')}")
print(f"fp16 tensor: {format(tensor_fp16.item(), '.60f')}")
print(f"bf16 tensor: {format(tensor_bf16.item(), '.60f')}") # More on this below

fp64 tensor: 0.333333333333333314829616256247390992939472198486328125000000
fp32 tensor: 0.333333343267440795898437500000000000000000000000000000000000
fp16 tensor: 0.333251953125000000000000000000000000000000000000000000000000
bf16 tensor: 0.333984375000000000000000000000000000000000000000000000000000


#### bfloat16


Developed by Google Brian, **bfloat16** has approximately the same dynamic range as 32-bit float, but only has 8-bit precision instead of float32's 24-bits of precision.

Most machine learning applications do not require single-precision, but simply casting to FP16 sacrifices dynamic range.
The smaller size of bfloat16 numbers allow for more efficient memory usage and calculation speed compared to float32.

See [bfloat16 Wikipedia](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) for more!

In [None]:
# Information of `16-bit brain floating point (bfloat16)`
torch.finfo(torch.bfloat16)

finfo(resolution=0.01, min=-3.38953e+38, max=3.38953e+38, eps=0.0078125, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=bfloat16)

##### bfloat16 on ARM processors

> Recent Arm processors support the BFloat16 (BF16) number format in PyTorch. BFloat16 provides improved performance and smaller memory footprint with the same dynamic range. You might experience a drop in model inference accuracy with BFloat16, but the impact is acceptable for the majority of applications. ~ [ARM Learn: PyTorch](https://learn.arm.com/install-guides/pytorch/)

To check if your system includes BFloat16, use the `lscpu` command:

In [None]:
# Will print flags if your processor supports BFloat16
# If result is blank you do not have a processor with BFloat16.
!lscpu | grep bf16

## Quantization