# ü§ñ GPT-2 Quantization Demo: FP32 vs. INT8

In this notebook, we demonstrate **Dynamic 8-bit Quantization** on the GPT-2 model using PyTorch. This is the simplest way to reduce model size and improve inference speed with minimal impact on accuracy.

In [None]:
import os
import torch
import time
from transformers import GPT2LMHeadModel, GPT2Tokenizer

def print_size_of_model(model):
    torch.save(model.state_code(), "temp.p")
    size = os.path.getsize("temp.p")/(1024*1024)
    print(f'Size (MB): {size:.2f}')
    os.remove('temp.p')

# 1. Load Model & Tokenizer
model_id = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model_fp32 = GPT2LMHeadModel.from_pretrained(model_id)

# Prepare for CPU inference
model_fp32.to('cpu')
model_fp32.eval()

## üõ†Ô∏è Step 1: Baseline (FP32)
Let's check the size and performance of the original full-precision model.

In [None]:
print_size_of_model(model_fp32)

## ‚ö° Step 2: Dynamic Quantization
We use PyTorch's `quantize_dynamic` to convert `nn.Linear` layers to `INT8`.

In [None]:
model_int8 = torch.quantization.quantize_dynamic(
    model_fp32,  # original model
    {torch.nn.Linear},  # layers to quantize
    dtype=torch.qint8  # target precision
)

print_size_of_model(model_int8)

## üèéÔ∏è Step 3: Performance Benchmark
We'll generate a short sequence and measure the time taken.

In [None]:
input_text = "Quantization is an essential technique for"
inputs = tokenizer(input_text, return_tensors="pt")

def benchmark_inference(model, inputs, name):
    start_time = time.time()
    # Generate 30 tokens
    output = model.generate(**inputs, max_length=30, do_sample=True)
    end_time = time.time()
    
    duration = end_time - start_time
    text = tokenizer.decode(output[0], skip_special_tokens=True)
    
    print(f"[{name}] Duration: {duration:.4f}s")
    print(f"[{name}] Output: {text}\n")
    return duration

latency_fp32 = benchmark_inference(model_fp32, inputs, "FP32")
latency_int8 = benchmark_inference(model_int8, inputs, "INT8")

speedup = latency_fp32 / latency_int8
print(f"Speedup: {speedup:.2f}x")

## üìä Conclusion
As we can see:
1. The model size dropped significantly (from ~500MB to ~150-180MB).
2. Inference on CPU is faster with the quantized model.
3. The output text remains coherent, showing that the precision loss is manageable.