# PyTorch quantization

PyTorch quantization is a technique used to optimize deep learning models by reducing their numerical precision, which helps decrease memory usage and improve inference speed. It encompasses several methods, including **static and dynamic quantization**. 

Static quantization involves quantizing both weights and activations prior to model deployment, often requiring calibration with representative data. In contrast, dynamic quantization focuses on quantizing only the model's weights at inference time, while the activations remain in floating-point format. This flexibility allows developers to tailor quantization strategies based on specific use cases, ultimately enhancing the performance of models, particularly on resource-constrained environments like mobile devices and edge computing platforms.

## Pytorch dynamic quantization

* You may note a marked reduction in accuracy for some models
* INT4 not supported
* Model will be downloaded to cache (may take a few minutes)

### 1. Load model and invoke

In [1]:
import os  
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load pre-trained model and tokenizer (e.g., GPT-2)
# model_name = "openai-community/gpt2"
model_name = "facebook/opt-125m"

tokenizer = AutoTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces=False)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Test the model before quantization
text = "Once upon a time,"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10)

# Print model output
print("Original Model Output:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Original Model Output:
Once upon a time, I was a student at the University of California,


### 2. Apply dynamic quantization

In [2]:
# Apply dynamic quantization to the model
quantized_model = torch.quantization.quantize_dynamic(
    model,  # the model to quantize
    {torch.nn.Linear},  # layers to quantize (focusing on Linear layers)
    dtype=torch.qint8
)

# Test the quantized model's output
outputs_quantized = quantized_model.generate(**inputs, max_new_tokens=10)

# Print the output
print("\nQuantized Model Output:")
print(tokenizer.decode(outputs_quantized[0], skip_special_tokens=True))


Quantized Model Output:
Once upon a time, I was a student at the University of California,


### 3. Compare the sizes of original & quantized model

In [3]:
# Function to print and compare model sizes
# Code below serializes the model to the file system
# Note that this is just to get an idea of relatives sizes
# and not the exact memory footprints.
def print_size_of_model(model, model_name=""):
    torch.save(model.state_dict(), f"{model_name}.pt")
    size_mb = os.path.getsize(f'{model_name}.pt') / 1e6
    print(f"\nModel size of {model_name}: {size_mb:.2f} MB")

# Compare sizes of original and quantized models
print_size_of_model(model, "Original_Model")
print_size_of_model(quantized_model, "Quantized_Model")

# Clean up saved files after checking size
os.remove("Original_Model.pt")
os.remove("Quantized_Model.pt")


Model size of Original_Model: 501.03 MB

Model size of Quantized_Model: 284.89 MB
