# Quantization

Quantization speeds up LLM inference by using fewer bits to represent model weights. This reduces both computation time and memory usage, at the cost of some loss in accuracy.

By default, weights are stored as 32-bit floating point numbers (FP32). Quantization approximates these values using lower-precision formats such as 16-bit floats (FP16), 8-bit integers (INT8), or 4-bit integers (INT4). Fewer bits mean less precise weight values, which introduces approximation errors and can slightly degrade model performance.


In [8]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


#### Original weight dtype:

In [34]:
model_name = "distilgpt2"

model1 = AutoModelForCausalLM.from_pretrained(model_name)

for name, param in model1.named_parameters():
    print(param.dtype)
    break

print(model1.get_memory_footprint())

torch.float32
333941784


#### FP16:

you can change the dtype **to a float** like as follows

In [29]:
model2 = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.float16 # this line quantizes the model to float16
)

for name, param in model2.named_parameters():
    print(param.dtype)
    break

print(model2.get_memory_footprint())

torch.float16
170116620


#### 8-bit int:

We canâ€™t directly change the type like above because integer weights cannot represent real-valued parameters used in neural networks. Instead, we use a `quantization_config` with BitsAndBytes, which stores weights as INT8 along with scale factors that map them back to floating-point values during computation.


In [47]:
bnb_config = BitsAndBytesConfig(load_in_8bit=True)

model3 = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config # add quantization config here
)

print(model3.get_memory_footprint())

127649292


We can see that the memory footprint got reduced a lot.

#### Small example and comparison:

Lets see how they behave with an example prompt:

In [45]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt").to(model1.device)

out = model1.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(out[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time of war, the United States was the only country in the world to have a military presence. The United States was the only country in the world to


In [46]:
inputs = tokenizer(prompt, return_tensors="pt").to(model3.device)

out = model3.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(out[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time of war, the United States was the only country in the world to have a military presence. The United States was the only country in the world to


The responses produced by the FP32 and INT8 models are identical for this example prompt. For short prompts, the small numerical differences introduced by INT8 quantization often do not affect the final output.

In addition, the model used here (DistilGPT-2) is very small by modern standards. In larger models, or with longer and more complex prompts, small differences caused by quantization are more likely to become noticeable.