# üß™ LLM.int8(): Outlier-aware Weight Quantization (2022)

[!["Open In Colab"](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adiel2012/model-size-reduction/blob/main/chronology/llm_int8_demo.ipynb)

LLM.int8() was one of the first methods to successfully quantize 175B+ parameter models to 8-bit without loss in perplexity. It achieves this by identifying "outlier" features in the activations and processing them in FP16 while using INT8 for everything else.

In this notebook, we use PyTorch's `quantize_dynamic` to demonstrate the impact of 8-bit quantization on GPT-2 size and speed.

In [None]:
import os
import torch
import time
from transformers import GPT2LMHeadModel, GPT2Tokenizer

def print_size_of_model(model):
    torch.save(model.state_dict(), "temp.p")
    size = os.path.getsize("temp.p")/(1024*1024)
    print(f'Size (MB): {size:.2f}')
    os.remove('temp.p')

# 1. Load Model & Tokenizer
model_id = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model_fp32 = GPT2LMHeadModel.from_pretrained(model_id)
model_fp32.to('cpu').eval()

## üõ†Ô∏è Performance Comparison (FP32 vs INT8)
We use dynamic quantization to reduce memory footprint.

In [None]:
model_int8 = torch.quantization.quantize_dynamic(
    model_fp32, {torch.nn.Linear}, dtype=torch.qint8
)

print("--- Model Sizes ---")
print("Original FP32:", end=" ")
print_size_of_model(model_fp32)
print("Quantized INT8:", end=" ")
print_size_of_model(model_int8)

def benchmark_inference(model, input_text="Model quantization is"):
    inputs = tokenizer(input_text, return_tensors="pt")
    start = time.time()
    with torch.no_grad():
        model.generate(**inputs, max_length=20)
    return time.time() - start

lat_32 = benchmark_inference(model_fp32)
lat_8 = benchmark_inference(model_int8)
print(f"\nFP32 Latency: {lat_32:.4f}s")
print(f"INT8 Latency: {lat_8:.4f}s")
print(f"Speedup: {lat_32/lat_8:.2f}x")