# LLM Quantization

**Quantization** of LLMs (Large Language Models) is a technique used to **reduce the memory footprint** and improve inference speed by **representing model weights and activations with lower-precision numbers instead of the standard 32-bit floating-point numbers**. It’s widely used when deploying large models on limited hardware like GPUs with less memory, edge devices, or even for cost-efficient cloud inference.

## Why Qunatization?

LLMs, like GPT or LLaMA, are huge (billions of parameters). Storing them in full precision (FP32) requires a lot of memory and slows down inference. Quantization reduces both memory usage and compute cost.

- FP32 (32-bit float): Standard precision; high memory usage.

- FP16 / BF16 (16-bit float): Half precision; reduces memory by 2×.

- INT8 (8-bit integer): Quantization to 8-bit; memory reduced by 4×, faster inference.

- INT4 / INT2 (4-bit / 2-bit integer): Aggressive quantization; memory reduced further but can hurt model quality if not done carefully.

![quantization](https://miro.medium.com/v2/resize:fit:1200/1*_ggBJzuSBWRImqhaJjjATQ.png)

## Notes

| Format   | Bits | Type  | Range             | Precision                 |
| -------- | ---- | ----- | ----------------- | ------------------------- |
| **FP32** | 32   | Float | ±3.4e38           | ~7 digits                 |
| **FP16** | 16   | Float | ±65504            | ~3 digits                 |
| **BF16** | 16   | Float | ±3.4e38           | ~2.5 digits               |
| **INT8** | 8    | Int   | –128 to 127       | exact ints                |
| **INT4** | 4    | Int   | –8 to 7           | exact ints                |
| **NF4**  | 4    | Quant | 16 learned values | highest accuracy in 4-bit |

A LLM with 7 Billion Parameters, would take require following amount of memory.

**In FP32**
7,000,000,000×4=28,000,000,000 bytes= 28.0 GB

**In FP16**
7,000,000,000×2=14,000,000,000 bytes=14.0 GB → 50% MEMORY reduction.

**In INT8**
7,000,000,000×1=7,000,000,000 bytes=7.0 GB → 75% MEMORY reduction.

**In INT4/NF4**
7,000,000,000 × 0.5 bytes=3.5,000,000,000 bytes=3.5 GB → 87% MEMORY reduction.

## How Quantization Work?
Quantization involves mapping high-precision weights/activations to lower-precision representations. There are a few approaches:

**a) Post-Training Quantization (PTQ)**
  - Done after the model is trained.
  - Fast and simple.
  - Can convert weights from FP32 → INT8 or FP16.
  - Minimal or moderate accuracy loss for many LLMs

**b) Quantization-Aware Training (QAT)**
  - Done during training/fine-tuning.
  - Model learns to adjust to low-precision arithmetic.
  - Maintains higher accuracy than PTQ, especially for aggressive quantization (like INT4).

## Popular Quantization Tools

**Hugging Face** transformers + bitsandbytes
  - Supports 8-bit and 4-bit quantization (bnb library).
  - Example: load_in_8bit=True for AutoModelForCausalLM.

**Intel Neural Compressor**
  - Optimized for INT8 deployment.

**GPTQ / AWQ**
  - Aggressive 4-bit quantization for LLaMA-like models with minimal accuracy loss.

## Diagram
```
 ┌───────────────────────────────────────────────────────────────────┐
 │                         ORIGINAL MODEL (FP32)                      │
 │                32-bit floating point weights & activations        │
 │                 Memory: ★★★★★   Speed: ★     Accuracy: ★★★★★       │
 └───────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
 ┌───────────────────────────────────────────────────────────────────┐
 │                           FP16 / BF16                              │
 │    16-bit floats → half the size but almost no accuracy loss       │
 │           Memory: ★★★★     Speed: ★★     Accuracy: ★★★★★          │
 └───────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
 ┌───────────────────────────────────────────────────────────────────┐
 │                               INT8                                │
 │    8-bit integers for weights + optional activation quantization   │
 │             Memory: ★★★       Speed: ★★★     Accuracy: ★★★★       │
 │      Common Tools: bitsandbytes (8-bit), Intel Neural Compressor   │
 └───────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
 ┌───────────────────────────────────────────────────────────────────┐
 │                               INT4                                │
 │         4-bit extremely compressed model (e.g., AWQ, GPTQ)         │
 │             Memory: ★★        Speed: ★★★★     Accuracy: ★★★        │
 │   Popular for LLaMA-style models; best tradeoff for local inference │
 └───────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
 ┌───────────────────────────────────────────────────────────────────┐
 │                                INT2                               │
 │       2-bit weights (experimental), biggest speed/memory savings   │
 │            Memory: ★          Speed: ★★★★★    Accuracy: ★★         │
 │           Used only in research or highly constrained devices      │
 └───────────────────────────────────────────────────────────────────┘
 ```


## Trade-offs

| Precision | Memory Reduction | Speed    | Accuracy Impact |
| --------- | ---------------- | -------- | --------------- |
| FP32      | 1×               | baseline | None            |
| FP16/BF16 | 2×               | +        | None/minimal    |
| INT8      | 4×               | ++       | Small           |
| INT4      | 8×               | +++      | Moderate        |

- Lower precision → smaller model, faster inference, but some accuracy loss.

- Some LLMs are surprisingly robust even to INT8/INT4.


In [None]:
!pip install -U bitsandbytes
!pip install -U transformers accelerate


^C


KeyboardInterrupt: 

In [None]:
import os
from google.colab import userdata
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

In [None]:
## Example using Hugging Face
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, #loads the model weights in 4-bit quantized format. It reduces GPU memory usage by 75%
    bnb_4bit_use_double_quant=True, # Applies a second-level quantization (a quantization of quantization parameters). It reduces memory more, but improves accuracy slightly.
    bnb_4bit_quant_type="nf4", # It's quantization algorithm. nf4 means: Normal-Float 4-bit. "nf4" → best accuracy, recommended. "fp4" → faster but lower accuracy
    bnb_4bit_compute_dtype="bfloat16" #Even if weights are 4-bit, they are temporarily converted to bfloat16 during computation in the forward pass.
)

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_name)

# loads the base model locally. While loading, it dequantizes it.
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config
)


## Can you de-Qunatize?
- YES, if you fine-tuned the model using LoRA / QLoRA (PEFT) on top of a 4-bit base model.

- NO, if the base model itself is GPTQ / AWQ quantized (because that quantization is not reversible).

**Note**: Dequnatization never restores the accuracy dropped. It's like a high-resolution is once reduced to smaller resolution image, can never be restored to original quality when resolution is restored.

Even though dequantization doesn’t restore original precision, we still need it in multiple real-world scenarios.

- To merge LoRA weights into the base model

- Many hardware backends cannot run 4-bit models

- Exporting to other formats requires FP16/FP32

- Running inference on large CPU clusters


In [None]:
# Code to de-quantize after training.
model = model.merge_and_unload()
model.to(dtype=torch.bfloat16)
model.save_pretrained("my-16bit-model")
