## Quantization in Large Language Models

### **Introduction to Quantization**

Quantization is a process used to reduce the memory requirements and computational complexity of large machine learning models. By representing model parameters with lower-precision values, quantization makes it possible to run models more efficiently on devices with limited memory and computational resources.

For large language models (LLMs), quantization can:
- **Reduce Memory Usage:** Lower-precision data types (such as int8) use less memory than higher-precision types (like float32), allowing models to fit into memory-constrained environments.
- **Improve Inference Speed:** By using simpler operations on smaller data types, quantization can reduce the time it takes for a model to process inputs and generate outputs.
- **Preserve Accuracy:** Quantization is carefully designed to minimize the impact on model accuracy, though a trade-off often exists between precision and efficiency.

We will focus on Post-Training Quantization (PTQ), a quantization technique that applies quantization to a pre-trained model. PTQ is a popular method for quantizing large language models because it can be applied to a wide range of models that we may want to use in inference mode. By contrast, quantization-aware training (QAT) requires retraining the model with quantization in mind, which can be more complex and time-consuming.

We will first get a general understanding of quantization by manually implementing two commonly adopted approaches: absmax and minmax (or zero-point). 

Next, we will explore two different ways (one using PyTorch, the other using HuggingFace) to run a **dynamic quantization** (i.e., PTQ where only the weights are quantized, and not the activations). 

## 1. Absmax and Minmax Quantization

The goal of quantization is, remember, mapping continuos values (e.g., float32) into a discrete set of values (e.g., int8). 

So let's create a matrix $W$ and a vector $x$ to be quantized. Let's initialize them randomly (but, just to see what happens, let's set W[0,0] and x[0] = 0).

In [3]:
import torch

torch.random.manual_seed(0)

n_rows = 3
n_cols = 5
W = torch.randn(n_rows, n_cols)
x = torch.randn(n_cols)

W[0,0] = 0
x[0] = 0

Let's first compute the matrix multiplication to observe the result. This is the operation that we typically want to execute, and that we want to quantize.

We will quantize $W$ and $x$ separately, and then multiply them together. Finally, we will need to dequantize the result to compare it with the original result.

In [4]:
out = W @ x
print(out)

tensor([0.4829, 1.3345, 0.2121])


## 1.1 Absmax quantization

In absmax quantization, we use a symmetric range around 0. This means that we need to identify the maximum absolute value in the matrix $W$ and the vector $x$.

We define a function `absmax_quantize` that takes as input any tensor and produces a version of the same tensor, but quantized. 

In [5]:
def absmax_quantize(W):
    # NOTE: we assume that we always map to 8-bit integers
    max_value = W.abs().max()
    scale = 127 / max_value # how "long" the step between any two int8 values is
    W_q = (W * scale).round().to(torch.int8)
    return W_q, scale

Notice, we return both the quantized tensor and the scale factor. The scale factor is used to dequantize the tensor. So we might as well define a dequantize function:

In [6]:
def absmax_dequantize(W_q, scale):
    return W_q.float() / scale

Let's get the quantized version of W, and of x. Then, we can check how much we are losing by quantizing the values.

In [7]:
W_q, scale_W = absmax_quantize(W)
x_q, scale_x = absmax_quantize(x)

In [8]:
print(W_q)
print(W)
W_deq = absmax_dequantize(W_q, scale_W)
print(W_deq)
print((W - W_deq).abs().mean())

tensor([[   0,  -17, -127,   33,  -63],
        [ -82,   24,   49,  -42,  -24],
        [ -35,   11,  -50,   64,  -62]], dtype=torch.int8)
tensor([[ 0.0000, -0.2934, -2.1788,  0.5684, -1.0845],
        [-1.3986,  0.4033,  0.8380, -0.7193, -0.4033],
        [-0.5966,  0.1820, -0.8567,  1.1006, -1.0712]])
tensor([[ 0.0000, -0.2916, -2.1788,  0.5661, -1.0808],
        [-1.4068,  0.4117,  0.8406, -0.7205, -0.4117],
        [-0.6005,  0.1887, -0.8578,  1.0980, -1.0637]])
tensor(0.0039)


In [9]:
print(x_q)
print(x)
x_deq = absmax_dequantize(x_q, scale_x)
print(x_deq)
print((x - x_deq).abs().mean())

tensor([   0,  -48,   31,  -75, -127], dtype=torch.int8)
tensor([ 0.0000, -0.5663,  0.3731, -0.8920, -1.5091])
tensor([ 0.0000, -0.5704,  0.3684, -0.8912, -1.5091])
tensor(0.0019)


Notice that, in both cases, absmax maps the value 0 to 0. This is a good property, as it allows us to represent the zero value without losing any information. This property stems from the symmetry around 0 we imposed.

However, do note that we are also "wasting" some bits of the range! Can you spot where?

Let's now compute the matrix multiplication between W_q and x_q. 

In [10]:
W_q @ x_q

tensor([101, -91, -28], dtype=torch.int8)

Can you see that there's something wrong? Let's see what one of the rows of W_q and x_q contain:

In [11]:
W_q[0], x_q

(tensor([   0,  -17, -127,   33,  -63], dtype=torch.int8),
 tensor([   0,  -48,   31,  -75, -127], dtype=torch.int8))

The dot product of these two vectors definitely isn't what we get as the first number of the matrix multiplication -- i.e. (W_q @ x_q)[0]. Indeed, we can run as int16, and see that the result is quite different:

In [12]:
W_q.to(torch.int16) @ x_q.to(torch.int16)

tensor([2405, 6565,  996], dtype=torch.int16)

The result of the dot product overflows the int8 range. This is a well-known problem. Indeed, the accumulation of results, in quantization, is typically done with higher precision than the single values. This is tricky to do in pure Python/PyTorch, but can be done efficiently in other ways.

Let's stick to the simple approach for now. 

In [13]:
out_q = W_q.to(torch.int16) @ x_q.to(torch.int16)

To get the correct result, we need to dequantize the result. This is done by multiplying the result by the scale factor of the two operands.

In [14]:
out_deq = absmax_dequantize(absmax_dequantize(out_q, scale_W), scale_x)
# Or, alternatively:
# out_deq = out_q / scale_W / scale_x
# out_deq = absmax_dequantize(out_q, scale_W * scale_x)
print(out_deq)
print(out)

tensor([0.4903, 1.3383, 0.2030])
tensor([0.4829, 1.3345, 0.2121])


Remember, our goal was `out`. How much did we lose by quantizing and dequantizing?

In [15]:
(out_deq - out).abs().mean()

tensor(0.0068)

## 2. Minmax Quantization

In minmax quantization, we use the minimum and maximum values in the matrix $W$ and the vector $x$ to define the range. In this way, we get a range that is as tight as possible around the values we are quantizing. This will, however, change the zero value, which will not be mapped to 0 anymore.



In [16]:
def minmax_quantize(W):
    # the following notations come from:
    # (1) scaling W to [0,1] ==> W' = (W - min(W)) / (max(W) - min(W)),
    # (2) scaling W' to [-128, 127] ==> W_q = W' * 255 - 128
    # by combining the two, we get that:
    # W_q = W * scale + offset
    # we will call the offset "zero_point", as it represent the value that maps to 0
    delta = W.max() - W.min()
    scale = 255 / delta
    zero_point = -(128*W.max() + 127*W.min()) / delta
    W_q = (W * scale + zero_point).round().to(torch.int8)
    return W_q, scale, zero_point

def minmax_dequantize(W_q, scale, zero_point):
    return (W_q.float() - zero_point) / scale

In [17]:
W_q, scale_W, zero_point_W = minmax_quantize(W)
x_q, scale_x, zero_point_x = minmax_quantize(x)

Let's see the results for $W$ (same considerations will apply for $x$).

In [18]:
print(W_q)
print(W)
W_deq = minmax_dequantize(W_q, scale_W, zero_point_W)
print(W_deq)
print((W - W_deq).abs().mean())
print("zero point", zero_point_W)

tensor([[  41,   19, -128,   86,  -43],
        [ -67,   73,  107,  -15,   10],
        [  -5,   56,  -25,  127,  -42]], dtype=torch.int8)
tensor([[ 0.0000, -0.2934, -2.1788,  0.5684, -1.0845],
        [-1.3986,  0.4033,  0.8380, -0.7193, -0.4033],
        [-0.5966,  0.1820, -0.8567,  1.1006, -1.0712]])
tensor([[-0.0054, -0.2883, -2.1788,  0.5733, -1.0857],
        [-1.3943,  0.4061,  0.8434, -0.7256, -0.4041],
        [-0.5970,  0.1875, -0.8542,  1.1006, -1.0728]])
tensor(0.0031)
zero point tensor(41.4189)


First, notice that 0 no longer maps to 0! Indeed, it maps to zero_point_W (after rounding). This implies that the dequantization of 0 will no longer be 0. This may be a problem!

But, notice that we are using the full range of the int8 values. This means that we are not wasting any bits of the range! (the minimum value is -128, the maximum value is 127). This can also be seen in the average absolute error, which is lower than what we had with absmax.

Similarly to what we did before, let's compute the output of the operation, and then dequantize it!

In [19]:
out_q = W_q.to(torch.int16) @ x_q.to(torch.int16)

The dequantification is a bit trickier, in this case. Can you figure out why we need the following operations?

Hint: consider the transformation we are applying to each value (value * scale + zero_point). What happens when we compute the dot product?

In [20]:
out_deq = (out_q - W.shape[1] * zero_point_W * zero_point_x - W.sum(axis=1) * scale_W * zero_point_x - x.sum() * scale_x * zero_point_W) / (scale_W * scale_x)

In [21]:
print(out_deq)
print(out)
print((out_deq - out).abs().mean())

tensor([0.4812, 1.3489, 0.2222])
tensor([0.4829, 1.3345, 0.2121])
tensor(0.0087)


<span style="color:red">Extra stuff!</span>

We could have computed scales and zero points at different granularities (e.g., for each row, or column of $W$). How would that have changed the results? What changes would we have to do to the code?

# Dynamic quantization

In this second part, we will apply dynamic quantization by using PyTorch or HuggingFace (with BitsAndBytes). We will quantize both to 8 and to 4 bits, and we will see how that affects LLMs (in terms of memory and speed). 

In [22]:
import torch
import os
import time 
from transformers import AutoTokenizer, AutoModelForCausalLM

  from .autonotebook import tqdm as notebook_tqdm


In [23]:
from huggingface_hub import login

# Login to the Hugging Face model hub to be able to upload models
with open("../hf_token.txt", "r") as f:
    token = f.read()
    f.close()

login(token=token)

First, let's load our model (Llama 3.2 1B) and let's see some base statistics (memory usage, inference time).

In [24]:
model_id = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(model_id)  
tokenizer = AutoTokenizer.from_pretrained(model_id) 

tokenizer.pad_token = tokenizer.eos_token

In [25]:
def get_model_size(model):
    """Get the size of the model in MB"""
    torch.save(model.state_dict(), "temp.pth")
    size = os.path.getsize("temp.pth") / 1e6  # size in "MB" (technically, it should be 1024**2, but we approximate to 1e6 to get an easier conversion #params <=> MB)
    os.remove("temp.pth")
    return size

print(f"Model size before quantization {(get_model_size(model)):.2f} MB")

Model size before quantization 4943.31 MB


Wait, wasn't Llama 1B supposed to be 4GB (4 bytes * 1B parameters)? Why do we get ~ 5 GB (i.e., 1.25B parameters)?

We are not considering the parameters used in the embedding layer (you can count how many parameters you have in the embedding layer and see that it matches the difference). 

Additionally, the count does not include the `lm_head`, i.e. the layer used to go from the hidden states to the logits. This is because in Llama (and other models) the `lm_head` is shared with the embedding layer, so it is not counted twice.

In [26]:
text = "The secret of life is"
# Notice we use a batch of 20 sentences -- we will get better results
# on quantized models when processing a batch of inputs
inputs = tokenizer([text]*20, return_tensors="pt")

tic = time.time()

with torch.no_grad():
    baseline_output = model.generate(inputs['input_ids'], max_length=100)

elapsed_time = time.time() - tic

baseline_decoded = tokenizer.decode(baseline_output[0], skip_special_tokens=True)

print("\nBaseline model output:", baseline_decoded)
print("\nTime taken for baseline model:", elapsed_time)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



Baseline model output: The secret of life is to live it.
The secret of life is to live it.
I have always been fascinated by the idea of reincarnation, and I have always wondered why people believe in it. I have read many books on the subject, and I have also interviewed several people who have experienced it. The answer I have found is that people believe in reincarnation because they believe that it is the only way to explain the mysteries of life. The idea that we are all one and that

Time taken for baseline model: 32.85346055030823


Dynamic quantization applies lower precision to model weights and activations at runtime. This method doesn’t require modifications to the model architecture or retraining, which makes it relatively easy to apply.

- **Advantages:** 
  - Quick to implement with minimal changes. No calibration step is needed.

- **Limitations:** 
  - Activations are not pre-quantized, meaning some precision is maintained but at the cost of slightly higher resource use at inference time.

We can use the `quantize_dynamic()` function, available in PyTorch, to apply dynamic quantization to a model.

We can specify a set of layer types to be quantize. Let's stick with Linear layers. We specify the desired type (represented by torch.qint8) , and off we go!

In [27]:

quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
).to('cpu')

# Model size after quantization
print(f"Model size after quantization {(get_model_size(quantized_model)):.2f} MB")

For migrations of users: 
1. Eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead 
2. FX graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e) 
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e) 
see https://github.com/pytorch/ao/issues/2259 for more details
  quantized_model = torch.quantization.quantize_dynamic(


Model size after quantization 2286.83 MB


Okay -- 2.3GB? Why not 5GB / 4 = 1.25GB? After all, we are going from float32 to int8. 

That's correct -- technically. Except, we are only encoding linear layers, and not the embedding layer. That means that, of the original 1.25B parameters, we are only quantizing 1B. The rest, in the embedding layer, is kept as float32.

If you run the numbers, though, you should still find a problem: 1B * 1 byte + 0.25B * 4 bytes = 2GB. What about the rest? There's one more thing: remember, the `lm_head` was shared with the Embedding layer. However, since it is "copied" into a linear layer in Llama, the quantization process will quantize it as well. So that's an extra 0.25B parameters encoded as int8 -- hence 2.3GB.

Finally, we could technically also quantize the embeddings (it has been introduced in later versions of PyTorch), but for simplicity we will not do it here (it would require some additional steps).

In [28]:
quantized_model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): DynamicQuantizedLinear(in_features=2048, out_features=2048, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          (k_proj): DynamicQuantizedLinear(in_features=2048, out_features=512, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          (v_proj): DynamicQuantizedLinear(in_features=2048, out_features=512, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          (o_proj): DynamicQuantizedLinear(in_features=2048, out_features=2048, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
        )
        (mlp): LlamaMLP(
          (gate_proj): DynamicQuantizedLinear(in_features=2048, out_features=8192, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          (up_proj): DynamicQuantizedLinear(in_features=2048, out_features=8192, dtype=torch.qint8, qscheme=torch.per_

In [29]:
tic = time.time()

with torch.no_grad():
    output = quantized_model.generate(inputs['input_ids'], max_length=100)

elapsed_time = time.time() - tic

output_decoded = tokenizer.decode(baseline_output[0], skip_special_tokens=True)

print("\nQuantized model output:", output_decoded)
print("\nTime taken for baseline model:", elapsed_time)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Quantized model output: The secret of life is to live it.
The secret of life is to live it.
I have always been fascinated by the idea of reincarnation, and I have always wondered why people believe in it. I have read many books on the subject, and I have also interviewed several people who have experienced it. The answer I have found is that people believe in reincarnation because they believe that it is the only way to explain the mysteries of life. The idea that we are all one and that

Time taken for baseline model: 12.075728178024292


Hugging Face provides several built-in quantization options, each suited to different model and deployment needs:
https://huggingface.co/docs/transformers/v4.46.0/quantization/overview

For this lab, we will use `Quanto`.

In [34]:
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
import torch
    
model_id = "meta-llama/Llama-3.2-1B"

# Quantize to 4-bit weights
quant = QuantoConfig(weights="int4")

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant,
    device_map="auto",           # moves to your GPU
    dtype=torch.bfloat16,  # or torch.float16 if your GPU prefers it
)

print(f"Model size after quantization: {get_model_size(quantized_model)} MB")

Model size after quantization: 2286.825316 MB


In [35]:
tic = time.time()

with torch.no_grad():
    output = quantized_model.generate(inputs['input_ids'], max_length=100)

elapsed_time = time.time() - tic

output_decoded = tokenizer.decode(baseline_output[0], skip_special_tokens=True)

print("\nquantized model output:", output_decoded)
print("\nTime taken for baseline model:", elapsed_time)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



quantized model output: The secret of life is to live it.
The secret of life is to live it.
I have always been fascinated by the idea of reincarnation, and I have always wondered why people believe in it. I have read many books on the subject, and I have also interviewed several people who have experienced it. The answer I have found is that people believe in reincarnation because they believe that it is the only way to explain the mysteries of life. The idea that we are all one and that

Time taken for baseline model: 11.96183180809021
