# ðŸ§ª Experimentation Framework From Scratch

[!["Open In Colab"](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adiel2012/model-size-reduction/blob/main/experiment_framework.ipynb)

## ðŸ“– The Theory: Evaluation Metrics for Compression

When compressing models, we track the **Pareto Frontier** between quality and efficiency. 

### 1. Throughput (Tokens/sec)
Measures how many tokens the model generates per second. This is the primary metric for user experience in real-time LLM apps.
$$\text{Throughput} = \frac{N_{\text{tokens}}}{\Delta t}$$

### 2. Compression Ratio
The ratio of original size to compressed size.
$$\text{Ratio} = \frac{\text{Size}_{FP32}}{\text{Size}_{compressed}}$$

### 3. Degradation (Perplexity Delta)
Measures how much accuracy or coherence is lost. For LLMs, we often use **Perplexity** or **Zero-shot Acc** on benchmarks like MMLU.

---

In [None]:
import torch
import time
import pandas as pd
from transformers import GPT2LMHeadModel, GPT2Tokenizer

class CompressionBenchmark:
    """
    A unified class to measure and compare different models.
    """
    def __init__(self, model_name="gpt2"):
        self.tokenizer = GPT2Tokenizer.from_pretrained(model_name)
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

    def run(self, model, name):
        model.to(self.device).eval()
        
        # 1. Size Calculation (Manual parameter counting x bytes)
        param_count = sum(p.numel() for p in model.parameters())
        # Simple heuristic for size in MB
        size_mb = sum(p.numel() * p.element_size() for p in model.parameters()) / (1024**2)
        
        # 2. Latency / Throughput
        prompt = "The key to model optimization is"
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        
        # Warmup
        _ = model.generate(**inputs, max_new_tokens=5)
        
        start = time.time()
        with torch.no_grad():
            # Generate fixed number of tokens
            _ = model.generate(**inputs, max_new_tokens=30, min_new_tokens=30)
        duration = time.time() - start
        
        throughput = 30 / duration
        
        return {
            "Method": name,
            "Size (MB)": f"{size_mb:.2f}",
            "Throughput (tok/s)": f"{throughput:.2f}",
            "Latency (s)": f"{duration:.4f}"
        }

benchmark = CompressionBenchmark()
base_model = GPT2LMHeadModel.from_pretrained("gpt2")
results = benchmark.run(base_model, "Baseline (FP32)")
print(results)