# üß™ Unified Model Compression Experimentation Framework

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adiel2012/model-quantization/blob/main/experiment_framework.ipynb)

This notebook provides a modular framework to compare different model compression techniques (Quantization, Distillation, Pruning) using **GPT-2** as the base model. 

### Why use this framework?
1. **Modular**: Easily add new algorithms by wrapping them in a standard class.
2. **Controlled**: Benchmarks all models on the same hardware and the same prompts.
3. **Comprehensive**: Measures size, latency, and generation quality simultaneously.

In [None]:
!pip install transformers datasets torch bitsandbytes accelerate -q

In [None]:
import os
import torch
import time
import pandas as pd
import matplotlib.pyplot as plt
from transformers import GPT2LMHeadModel, GPT2Tokenizer, BitsAndBytesConfig
import torch.nn.utils.prune as prune

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

## üõ†Ô∏è Utility Functions
Standardized ways to measure performance.

In [None]:
def get_model_size(model):
    param_size = 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
    buffer_size = 0
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()
    size_all_mb = (param_size + buffer_size) / 1024**2
    return size_all_mb

def benchmark_model(model, tokenizer, prompt="The future of artificial intelligence is", max_new_tokens=30):
    model.to(device)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    # Warm-up
    _ = model.generate(**inputs, max_new_tokens=5)
    
    start_time = time.time()
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=max_new_tokens, min_new_tokens=max_new_tokens)
    end_time = time.time()
    
    latency = end_time - start_time
    tokens_generated = max_new_tokens
    throughput = tokens_generated / latency
    
    decoded = tokenizer.decode(output[0], skip_special_tokens=True)
    return {
        "latency": latency,
        "throughput": throughput,
        "size_mb": get_model_size(model),
        "sample_output": decoded
    }

## üèóÔ∏è The Framework
We loop through various models and collect data.

In [None]:
results = []
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# 1. Baseline (FP32)
print("Benchmarking Base Model...")
model_base = GPT2LMHeadModel.from_pretrained("gpt2")
res_base = benchmark_model(model_base, tokenizer)
res_base["method"] = "Baseline (FP32)"
results.append(res_base)
del model_base

# 2. Dynamic Quantization (INT8)
print("Benchmarking Dynamic Quantization...")
model_quant = GPT2LMHeadModel.from_pretrained("gpt2").to("cpu") # Dynamic quant usually on CPU
model_quant = torch.quantization.quantize_dynamic(model_quant, {torch.nn.Linear}, dtype=torch.qint8)
# Benchmarking on CPU specifically for comparison
orig_device = device
device = "cpu"
res_quant = benchmark_model(model_quant, tokenizer)
res_quant["method"] = "Quantization (INT8-CPU)"
results.append(res_quant)
device = orig_device
del model_quant

# 3. Pruning (30% Unstructured)
print("Benchmarking Pruning...")
model_prune = GPT2LMHeadModel.from_pretrained("gpt2")
for name, module in model_prune.named_modules():
    if isinstance(module, torch.nn.Linear):
        prune.l1_unstructured(module, name='weight', amount=0.3)
res_prune = benchmark_model(model_prune, tokenizer)
res_prune["method"] = "Pruning (30% L1)"
results.append(res_prune)
del model_prune

## üìä Comparison Dashboard

In [None]:
df = pd.DataFrame(results)
display(df[["method", "size_mb", "latency", "throughput"]])

# Visualizations
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

df.plot(x="method", y="size_mb", kind="bar", ax=ax[0], title="Model Size (MB) - Lower is Better", color="skyblue")
df.plot(x="method", y="throughput", kind="bar", ax=ax[1], title="Throughput (Tokens/sec) - Higher is Better", color="salmon")

plt.show()

## üöÄ Add Your Own Algorithm
To test a new method, simply load your processed model and call `benchmark_model(my_model, tokenizer)`.