# ðŸ”¢ Quantization

## Outline

- [Quantization Techniques](#quantization-techniques)
- [Linear Quantization](#linear)
- [GPTQ Quantization](#gptq)
- [Other Methods](#other-methods)
- [Summary](#summary)
- [Reference](#reference)

<a id="qunatization-techniques"></a>
## Quantization Techniques

- Post trainging quantization (PTQ):
    - Post training dynamic quantization: the range for each activation is computed on the fly at runtime.
    - Post training static quantization: the range for each activation is computed in advance at quantization-time, typically by passing representative data thro
        - Observers are put on activations to record their values.
        - A certain number of forward passes on a calibration dataset is done (around 200 examples is enough).
        - The ranges for each computation are computed according to some calibration technique.
- Quantization aware training (QAT): the range for each activation is computed at training-time. They simulate the error induced by quantization to let the model

Reference: https://huggingface.co/docs/optimum/concept_guides/quantization#calibration


<a id="linear"></a>
## Linear quantization

![](/LLM-workshop/day1/figures/quantization_symmetry.webp)

- $x = S * (x_q - Z)$
- When $Z = 0$: symmetrics quantization
- It can be applied per tensor or per channel


### Affine quantization in Quanto (Int8)

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.quanto import QuantizedModelForCausalLM, qint8

# https://www.geeksforgeeks.org/nlp/perplexity-for-llm-evaluation/
def compute_perplexity_for_batch(model, tokenizer, input_texts):
    inputs = tokenizer(
        input_texts, return_tensors="pt", padding=True, truncation=True
    )

    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits

    shift_logits = logits[:, :-1, :] 
    shift_labels = input_ids[:, 1:] 

    log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)
    target_log_probs = log_probs.gather(dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)
    target_log_probs = target_log_probs * attention_mask[:, 1:].to(log_probs.dtype)
    negative_log_likelihood = -target_log_probs.sum(dim=-1) / attention_mask[:, 1:].sum(dim=-1)
    perplexities = torch.exp(negative_log_likelihood)
    mean_perplexity_score = torch.mean(perplexities)

    return {
        "perplexities": perplexities.tolist(),
        "mean_perplexity": mean_perplexity_score.item()
    }

In [2]:
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"

model = AutoModelForCausalLM.from_pretrained(model_name)
print(model)
print(model.model.layers[0].self_attn.q_proj.weight)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb):

In [3]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

example_texts = [
    "Once upon a time, there was a brave knight.",
    "In a galaxy far, far away, a new adventure began."
]

# Compute perplexity scores for the batch of input texts
results = compute_perplexity_for_batch(model, tokenizer, example_texts)
print(f"Perplexity scores for each text: {results['perplexities']}")

Perplexity scores for each text: [42.679718017578125, 16.073394775390625]


In [4]:
qmodel = QuantizedModelForCausalLM.quantize(model, weights=qint8, exclude='lm_head')
print(qmodel)
print(qmodel.model.layers[0].self_attn.q_proj.weight)

qmodel.save_pretrained('output/official/QLlama-3.2-1B')

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): QLinear(in_features=2048, out_features=2048, bias=False)
          (k_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (v_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (o_proj): QLinear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (up_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (down_proj): QLinear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotar

In [5]:
results = compute_perplexity_for_batch(qmodel, tokenizer, example_texts)
print(f"Perplexity scores for each text: {results['perplexities']}")

Perplexity scores for each text: [42.3848876953125, 16.23394012451172]


### Quanto integration in Transformers

In [6]:
from transformers import AutoModelForCausalLM, QuantoConfig

quantization_config = QuantoConfig(weights="int8", activations=None)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config)
print(model)
print(model.model.layers[0].self_attn.q_proj.weight)

# quanto quantized model cannot be serialized from transformers and cannot be saved
# model.save_pretrained("output/transformers/QLlama-3.2-1B")

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): QLinear(in_features=2048, out_features=2048, bias=False)
          (k_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (v_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (o_proj): QLinear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (up_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (down_proj): QLinear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotar

### Activation quantiztion / Calibration in Quanto

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.quanto import quantize, freeze, qint8, Calibration, quantization_map
from safetensors.torch import save_file
from datasets import load_dataset

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", use_cache=False)
tokenizer = AutoTokenizer.from_pretrained(model_name)

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
).select(range(1024))

quantize(model, weights=qint8, activations=qint8)
with torch.no_grad(), Calibration(momentum=0.9):
    model.eval()
    for batch in calibration_dataset.iter(batch_size=2):
        inputs = tokenizer(batch["text"], return_tensors="pt", padding=True)
        input_ids = inputs.input_ids.to(model.device)
        attention_mask = inputs.attention_mask.to(model.device)
        output = model(input_ids, attention_mask=attention_mask)

        # good habit
        del input_ids, attention_mask
        torch.cuda.empty_cache()

print(model)

Generating train split: 0 examples [00:00, ? examples/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): QLinear(in_features=2048, out_features=2048, bias=False)
          (k_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (v_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (o_proj): QLinear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (up_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (down_proj): QLinear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotar

In [2]:
import os
import json

os.makedirs("output/calibration", exist_ok=True)

# Freeze integer weights
freeze(model)

# Serialize quantized model
save_file(model.state_dict(), 'output/calibration/QLlama-3.2-1B/model.safetensors')
# Store the quantized models quantization map
with open('output/calibration/QLlama-3.2-1B/quantization_map.json', 'w') as f:
    json.dump(quantization_map(model), f)

In [3]:
from safetensors.torch import load_file
from optimum.quanto import requantize
from transformers import AutoModelForCausalLM, AutoConfig

state_dict = load_file('output/calibration/QLlama-3.2-1B/model.safetensors')
with open('output/calibration/QLlama-3.2-1B/quantization_map.json', 'r') as f:
    quantization_map = json.load(f)

# Create an empty model from your modeling code and requantize it
config = AutoConfig.from_pretrained("/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/config.json")
model = AutoModelForCausalLM.from_config(config)
requantize(model, state_dict, quantization_map, device=torch.device('cuda'))


In [4]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): QLinear(in_features=2048, out_features=2048, bias=False)
          (k_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (v_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (o_proj): QLinear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (up_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (down_proj): QLinear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotar

### Outlier problem

- Quanto simply uses `absmax()` to calculate scale
- Outlier would compress most values to 0

### LLM.int8() in Bitsandbytes

- Save outlier in another tensor to keep information
- Model is quantized on the fly without loading model in full precision

![](/LLM-workshop/day2/figures/bitsandbytes_int8.png)
(Credit: [Dettmers+2022](https://arxiv.org/abs/2208.07339))

In [1]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map="auto",
    quantization_config=quantization_config, 
    torch_dtype="auto"
)
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear8bitLt(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear8bitLt(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear8bitLt(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear8bitLt(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear8bitLt(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMS

In [2]:
for name, param in model.named_parameters():  
    if hasattr(param, "SCB"):
        print(name)
        print(param)
        print(param.SCB)
        break
print(model.get_memory_footprint() / 1e9)

model.layers.0.self_attn.q_proj.weight
Parameter containing:
Parameter(Int8Params([[-44,  16,  61,  ..., -22, -29,  50],
            [ 14,  70,  65,  ..., -39, -18,  13],
            [ 16,  14,  30,  ..., -34, -34, -24],
            ...,
            [ 17,  20,  40,  ..., -40, -15, -16],
            [ 32, -35,  50,  ..., -17, -41, -21],
            [-14, -30,  -7,  ...,  30,   5,  -2]], device='cuda:0',
           dtype=torch.int8))
tensor([0.0515, 0.1079, 0.1436,  ..., 0.2256, 0.0898, 0.2305], device='cuda:0')
1.4985504


<a id="gptq"></a>
## GPTQ (Generalized Post-Training Quantization) Quantization

- Error compensation: after quantizing a specific weight (or group of weights), GPTQ calculates the error introduced by this quantization step and immediately updates the remaining, not-yet-quantized weights in the layer to counteract that error.

### GPTQ in [GPTQModel](https://github.com/ModelCloud/GPTQModel)

In [1]:
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
from transformers import AutoModelForCausalLM

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4)

model = GPTQModel.load(model_name, quant_config)

# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=2)

model.save("output/official/QLlama-3.2-1B")


[32mINFO[0m  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
[32mINFO[0m  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          


[32mINFO[0m  Estimated Quantization BPW (bits per weight): 4.2875 bpw, based on [bits: 4, group_size: 128]
[32mINFO[0m  Loader: Auto dtype (native bfloat16): `torch.bfloat16`                   


INFO:tokenicer.tokenicer:Tokenicer: Auto fixed pad_token_id=128004 (token='<|finetune_right_pad_id|>').


[32mINFO[0m  Model: Loaded `generation_config`: GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "temperature": 0.6,
  "top_p": 0.9
}

[32mINFO[0m  Kernel: loaded -> `[]`                                                   
[32mINFO[0m  Packing Kernel: Auto-selection: adding candidate `TritonV2QuantLinear`   
[32mINFO[0m  Process: progress logs for `gptq` will be streamed to file: `gptq_log_mythopoetry_time_11_03_2025_11h_03m_44s.log`
[32mINFO[0m  --------------------------------------------------------------------------------------------------------------------------
[32mINFO[0m  | process     | layer     | module               | loss           | samples     | damp        | time      | fwd_time     |
[32mINFO[0m  --------------------------------------------------------------------------------------------------------------------------
[32mINFO[0m  | gptq        | 0         | self_attn.k_proj   

### GPTQ integration in Transformers

In [1]:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"

tokenizer=AutoTokenizer.from_pretrained(model_name)
gptq_config=GPTQConfig(
    bits=4,
    dataset="c4", # optimum will download 'en/c4-train.00000-of-01024.json.gz'
    tokenizer=tokenizer,
)

quantized_model=AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=gptq_config
)
print(quantized_model)
print(quantized_model.get_memory_footprint())

quantized_model.save_pretrained("output/transformers/QLlama-3.2-1B")
tokenizer.save_pretrained("output/transformers/QLlama-3.2-1B")


[32mINFO[0m  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
[32mINFO[0m  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          


en/c4-train.00000-of-01024.json.gz:   0%|          | 0.00/319M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Quantizing model.layers blocks :   0%|          | 0/16 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Start quantizing block model.layers 1/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 1/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 1/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 1/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 1/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 1/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 1/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 1/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 2/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 2/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 2/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 2/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 2/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 2/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 2/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 2/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 3/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 3/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 3/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 3/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 3/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 3/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 3/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 3/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 4/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 4/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 4/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 4/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 4/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 4/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 4/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 4/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 5/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 5/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 5/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 5/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 5/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 5/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 5/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 5/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 6/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 6/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 6/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 6/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 6/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 6/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 6/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 6/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 7/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 7/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 7/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 7/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 7/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 7/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 7/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 7/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 8/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 8/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 8/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 8/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 8/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 8/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 8/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 8/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 9/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 9/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 9/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 9/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 9/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 9/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 9/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 9/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 10/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 10/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 10/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 10/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 10/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 10/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 10/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 10/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 11/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 11/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 11/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 11/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 11/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 11/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 11/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 11/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 12/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 12/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 12/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 12/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 12/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 12/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 12/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 12/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 13/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 13/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 13/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 13/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 13/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 13/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 13/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 13/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 14/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 14/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 14/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 14/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 14/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 14/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 14/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 14/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 15/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 15/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 15/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 15/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 15/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 15/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 15/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 15/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 16/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]


Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 16/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 16/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 16/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 16/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 16/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 16/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 16/16...
INFO:optimum.gptq.quantizer:Packing model...


[32mINFO[0m  Packing Kernel: Auto-selection: adding candidate `TritonV2QuantLinear`   


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
INFO:optimum.gptq.quantizer:model.layers.0.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.0.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.0.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.0.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.0.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.0.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.0.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.1.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.1.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.1.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.1.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.1.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.1.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.1.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.2.self_attn.k_proj
INFO:optimum.gptq

[32mINFO[0m  Optimize: `TritonV2QuantLinear` compilation triggered.                   
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (k_proj): TritonV2QuantLinear()
          (o_proj): TritonV2QuantLinear()
          (q_proj): TritonV2QuantLinear()
          (v_proj): TritonV2QuantLinear()
        )
        (mlp): LlamaMLP(
          (act_fn): SiLU()
          (down_proj): TritonV2QuantLinear()
          (gate_proj): TritonV2QuantLinear()
          (up_proj): TritonV2QuantLinear()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
1032327296


('output/transformers/QLlama-3.2-1B/tokenizer_config.json',
 'output/transformers/QLlama-3.2-1B/special_tokens_map.json',
 'output/transformers/QLlama-3.2-1B/chat_template.jinja',
 'output/transformers/QLlama-3.2-1B/tokenizer.json')

<a id="other-methods"></a>
## Other quantization methods supported in Transformers

- https://huggingface.co/docs/transformers/quantization/overview

| Quantization Method                        | On the fly quantization | CPU             | CUDA GPU | ROCm GPU  | Metal (Apple Silicon)              | Intel GPU       | Torch compile() | Bits         | PEFT Fine Tuning | Serializable with ðŸ¤—Transformers | ðŸ¤—Transformers Support  | Link to library                                         |
|--------------------------------------------|-------------------------|-----------------|----------|-----------|------------------------------------|-----------------|-----------------|--------------|------------------|----------------------------------|-------------------------|---------------------------------------------------------|
| [AQLM](./aqlm)                             | ðŸ”´                      | ðŸŸ¢              | ðŸŸ¢       | ðŸ”´        | ðŸ”´                                 | ðŸŸ¢              | ðŸŸ¢              | 1/2          | ðŸŸ¢               | ðŸŸ¢                               | ðŸŸ¢                      | https://github.com/Vahe1994/AQLM                        |
| [AutoRound](./auto_round)                  | ðŸ”´                      | ðŸŸ¢              | ðŸŸ¢       | ðŸ”´        | ðŸ”´                                 | ðŸŸ¢              | ðŸ”´              | 2/3/4/8      | ðŸ”´               | ðŸŸ¢                               | ðŸŸ¢                      | https://github.com/intel/auto-round                     |
| [AWQ](./awq)                               | ðŸ”´                      | ðŸŸ¢              | ðŸŸ¢       | ðŸŸ¢        | ðŸ”´                                 | ðŸŸ¢              | ?               | 4            | ðŸŸ¢               | ðŸŸ¢                               | ðŸŸ¢                      | https://github.com/casper-hansen/AutoAWQ                |
| [bitsandbytes](./bitsandbytes)             | ðŸŸ¢                      | ðŸŸ¢              | ðŸŸ¢       | ðŸŸ¡        | ðŸŸ¡                                 | ðŸŸ¢              | ðŸŸ¢              | 4/8          | ðŸŸ¢               | ðŸŸ¢                               | ðŸŸ¢                      | https://github.com/bitsandbytes-foundation/bitsandbytes |
| [compressed-tensors](./compressed_tensors) | ðŸ”´                      | ðŸŸ¢              | ðŸŸ¢       | ðŸŸ¢        | ðŸ”´                                 | ðŸ”´              | ðŸ”´              | 1/8          | ðŸŸ¢               | ðŸŸ¢                               | ðŸŸ¢                      | https://github.com/neuralmagic/compressed-tensors       |
| [EETQ](./eetq)                             | ðŸŸ¢                      | ðŸ”´              | ðŸŸ¢       | ðŸ”´        | ðŸ”´                                 | ðŸ”´              | ?               | 8            | ðŸŸ¢               | ðŸŸ¢                               | ðŸŸ¢                      | https://github.com/NetEase-FuXi/EETQ                    |
| [FP-Quant](./fp_quant)                     | ðŸŸ¢                      | ðŸ”´              | ðŸŸ¢       | ðŸ”´        | ðŸ”´                                 | ðŸ”´              | ðŸŸ¢              | 4            | ðŸ”´               | ðŸŸ¢                               | ðŸŸ¢                      | https://github.com/IST-DASLab/FP-Quant                  |
| [GGUF / GGML (llama.cpp)](../gguf)         | ðŸŸ¢                      | ðŸŸ¢              | ðŸŸ¢       | ðŸ”´        | ðŸŸ¢                                 | ðŸŸ¢              | ðŸ”´              | 1/8          | ðŸ”´               | [See Notes](../gguf)             | [See Notes](../gguf)    | https://github.com/ggerganov/llama.cpp                  |
| [GPTQModel](./gptq)                        | ðŸ”´                      | ðŸŸ¢              | ðŸŸ¢       | ðŸŸ¢        | ðŸŸ¢                                 | ðŸŸ¢              | ðŸ”´              | 2/3/4/8      | ðŸŸ¢               | ðŸŸ¢                               | ðŸŸ¢                      | https://github.com/ModelCloud/GPTQModel                 |
| [AutoGPTQ](./gptq)                         | ðŸ”´                      | ðŸ”´              | ðŸŸ¢       | ðŸŸ¢        | ðŸ”´                                 | ðŸ”´              | ðŸ”´              | 2/3/4/8      | ðŸŸ¢               | ðŸŸ¢                               | ðŸŸ¢                      | https://github.com/AutoGPTQ/AutoGPTQ                    |
| [HIGGS](./higgs)                           | ðŸŸ¢                      | ðŸ”´              | ðŸŸ¢       | ðŸ”´        | ðŸ”´                                 | ðŸ”´              | ðŸŸ¢              | 2/4          | ðŸ”´               | ðŸŸ¢                               | ðŸŸ¢                      | https://github.com/HanGuo97/flute                       |
| [HQQ](./hqq)                               | ðŸŸ¢                      | ðŸŸ¢              | ðŸŸ¢       | ðŸ”´        | ðŸ”´                                 | ðŸŸ¢              | ðŸŸ¢              | 1/8          | ðŸŸ¢               | ðŸ”´                               | ðŸŸ¢                      | https://github.com/mobiusml/hqq/                        |
| [optimum-quanto](./quanto)                 | ðŸŸ¢                      | ðŸŸ¢              | ðŸŸ¢       | ðŸ”´        | ðŸŸ¢                                 | ðŸŸ¢              | ðŸŸ¢              | 2/4/8        | ðŸ”´               | ðŸ”´                               | ðŸŸ¢                      | https://github.com/huggingface/optimum-quanto           |
| [FBGEMM_FP8](./fbgemm_fp8)                 | ðŸŸ¢                      | ðŸ”´              | ðŸŸ¢       | ðŸ”´        | ðŸ”´                                 | ðŸ”´              | ðŸ”´              | 8            | ðŸ”´               | ðŸŸ¢                               | ðŸŸ¢                      | https://github.com/pytorch/FBGEMM                       |
| [torchao](./torchao)                       | ðŸŸ¢                      | ðŸŸ¢              | ðŸŸ¢       | ðŸ”´        | ðŸŸ¡                                 | ðŸŸ¢              |                 | 4/8          |                  | ðŸŸ¢ðŸ”´                             | ðŸŸ¢                      | https://github.com/pytorch/ao                           |
| [VPTQ](./vptq)                             | ðŸ”´                      | ðŸ”´              | ðŸŸ¢       | ðŸŸ¡        | ðŸ”´                                 | ðŸ”´              | ðŸŸ¢              | 1/8          | ðŸ”´               | ðŸŸ¢                               | ðŸŸ¢                      | https://github.com/microsoft/VPTQ                       |
| [FINEGRAINED_FP8](./finegrained_fp8)       | ðŸŸ¢                      | ðŸ”´              | ðŸŸ¢       | ðŸ”´        | ðŸ”´                                 | ðŸŸ¢              | ðŸ”´              | 8            | ðŸ”´               | ðŸŸ¢                               | ðŸŸ¢                      |                                                         |
| [SpQR](./spqr)                             | ðŸ”´                      | ðŸ”´              | ðŸŸ¢       | ðŸ”´        | ðŸ”´                                 | ðŸ”´              | ðŸŸ¢              | 3            | ðŸ”´               | ðŸŸ¢                               | ðŸŸ¢                      | https://github.com/Vahe1994/SpQR/                       |
| [Quark](./quark)                           | ðŸ”´                      | ðŸŸ¢              | ðŸŸ¢       | ðŸŸ¢        | ðŸŸ¢                                 | ðŸŸ¢              | ?               | 2/4/6/8/9/16 | ðŸ”´               | ðŸ”´                               | ðŸŸ¢                      | https://quark.docs.amd.com/latest/                      |


<a id="summary"></a>
## Summary

We have introduced
1. Using `optimum-quanto` to linearly quantize llama3.2-1b to 8-bit
2. Using `bitsandbyte` to linearly quantize llama3.2-1b to 8-bit and handle outlier
3. Using `GPTQModel` to quantize llama3.2-1b to 4-bit with GPTQ method
4. Saving quantized models and reloading them
5. Calibration

<a id="reference"></a>
## Reference

- https://www.kaggle.com/code/aisuko/introduction-to-weight-quantization/notebook
- https://www.kaggle.com/code/aisuko/quantization-methods
- https://www.kaggle.com/code/aisuko/quantization-with-gptq
- https://apxml.com/courses/practical-llm-quantization
- https://github.com/huggingface/optimum-quanto
- https://github.com/bitsandbytes-foundation/bitsandbytes
- https://github.com/ModelCloud/GPTQModel
- https://huggingface.co/docs/transformers/quantization