# The OptiPFair Series #1: Forging the Future with Small Models

## Introduction: The Quest for Efficiency
> *"We live in the age of giantsâ€”and perhaps we're witnessing their fall?"* - Principia Agentica

We've entered the age of **efficiency**. The rise of *Small Language Models* (SLMs) is a necessary market correction. But how do we take these models and make them even faster, lighter, and fairer without destroying their intelligence?

This notebook explores **OptiPFair**, a library created by Pere Martra designed exactly for this purpose. We will test two main pruning strategies:
1. **Width Pruning (MLP_GLU)**: Reducing fine neurons.
2. **Depth Pruning**: Eliminating entire transformer blocks.

Our goal: Find the "Sweet Spot" for edge deployment.


## 1) Setup

In [None]:
# Check GPU and install deps
!nvidia-smi || print("No GPU detected")
!pip -q install --upgrade pip
!pip -q install transformers accelerate torch datasets optipfair "optipfair[viz]" numpy scikit-learn matplotlib nbconvert

BASE_MODEL = "meta-llama/Llama-3.2-1B"
OUT_DIR = "/content/models"
RESULTS_DIR = "/content/results"
!mkdir -p $OUT_DIR $RESULTS_DIR /content/bias_analysis /content/activation_analysis

## 2) Mount Drive

In [None]:
from google.colab import drive
DRIVE_DIR = "/content/drive/MyDrive/optipfair_experiments"
drive.mount("/content/drive")
!mkdir -p $DRIVE_DIR

## 3) Baseline benchmark

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import time, torch, json, os

def time_inference(model_name: str, prompt: str, max_new_tokens: int = 64, runs: int = 5):
    # Check if model_name is a directory and contains tokenizer files
    if os.path.isdir(model_name) and any(f.endswith('tokenizer.json') or f.endswith('tokenizer_config.json') for f in os.listdir(model_name)):
        tok = AutoTokenizer.from_pretrained(model_name)
    else:
        tok = AutoTokenizer.from_pretrained(model_name)

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        dtype=torch.float16 if torch.cuda.is_available() else None,
        device_map="auto" if torch.cuda.is_available() else None
    )
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    times = []
    for _ in range(runs):
        if torch.cuda.is_available():
            torch.cuda.synchronize()
        t0 = time.time()
        _ = model.generate(**inputs, max_new_tokens=max_new_tokens)
        if torch.cuda.is_available():
            torch.cuda.synchronize()
        times.append(time.time() - t0)
    avg = sum(times)/len(times)
    tps = max_new_tokens/avg
    return {"avg_time": avg, "tokens_per_second": tps}

baseline = time_inference(BASE_MODEL, "Paris is the capital of", 64, 5)
with open(f"{RESULTS_DIR}/baseline.json", "w") as f:
    json.dump(baseline, f, indent=2)

baseline

### Width Pruning: The Precision Scalpel
Width pruning methods (like MLP_GLU with MAW) theoretically reduce parameters by removing less important neurons. However, as noted in architectural analyses, they often **fail to improve actual inference speed** in small batch scenarios (like local devices) because they break the memory alignment that GPUs love.

Let's verify this hypothesis.


## 4) GLU 20% pruning + benchmark

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from optipfair import prune_model
import json

model = AutoModelForCausalLM.from_pretrained(BASE_MODEL)
tok = AutoTokenizer.from_pretrained(BASE_MODEL) # Get tokenizer here
pruned_glu, s1 = prune_model(
    model,
    pruning_type="MLP_GLU",
    neuron_selection_method="MAW",
    pruning_percentage=20,
    return_stats=True
)
pruned_glu.save_pretrained(f"{OUT_DIR}/pruned_glu_20")
tok.save_pretrained(f"{OUT_DIR}/pruned_glu_20") # Explicitly save tokenizer

glu20 = time_inference(f"{OUT_DIR}/pruned_glu_20", "Paris is the capital of", 64, 5)
with open(f"{RESULTS_DIR}/glu20.json", "w") as f:
    json.dump(glu20, f, indent=2)
{"stats": s1, "bench": glu20}

### Depth Pruning: The Sledgehammer (with Finesse)
> *"OptiPFair's 'sweet spot' is sub-13B models... By removing complete transformer blocks (depth pruning), we achieve hardware-agnostic acceleration."* - Pere Martra

Depth pruning is more aggressive. We aren't just trimming fat; we are removing organs. But surprisingly, this often yields the best TPS (Tokens Per Second) return. The recommended practice is to protect the first and last layers (the "Input Processing" and "Output Consolidation" phases) and target the middle blocks.


## 5) Depth last 3 pruning + benchmark

In [None]:
model = AutoModelForCausalLM.from_pretrained(BASE_MODEL)
tok_2 = AutoTokenizer.from_pretrained(BASE_MODEL) # Get tokenizer here
pruned_depth, s2 = prune_model(
    model,
    pruning_type="DEPTH",
    num_layers_to_remove=3,
    layer_selection_method="last",
    return_stats=True
)
pruned_depth.save_pretrained(f"{OUT_DIR}/pruned_depth_last3")
tok_2.save_pretrained(f"{OUT_DIR}/pruned_depth_last3") # Explicitly save tokenizer

depth3 = time_inference(f"{OUT_DIR}/pruned_depth_last3", "Paris is the capital of", 64, 5)
with open(f"{RESULTS_DIR}/depth_last3.json", "w") as f:
    json.dump(depth3, f, indent=2)
{"stats": s2, "bench": depth3}

## 6) Log results to CSV

In [None]:
import csv, os, datetime, json

csv_path = f"{RESULTS_DIR}/runs.csv"
fieldnames = ["run_id","preset","params","model_path","tokens_per_second","avg_time_s","peak_mem_gb","quality_metric","value","notes"]
now = datetime.datetime.now().strftime("%Y-%m-%d-%H%M%S")

rows = []
for name in ["baseline","glu20","depth_last3"]:
    with open(f"{RESULTS_DIR}/{name}.json") as f:
        d = json.load(f)
    rows.append({
        "run_id": f"{now}-{name}",
        "preset": name,
        "params": {"baseline":"-","glu20":"MLP_GLU:20% MAW","depth_last3":"DEPTH:last:3"}[name],
        "model_path": BASE_MODEL if name=="baseline" else (f"{OUT_DIR}/pruned_glu_20" if name=="glu20" else f"{OUT_DIR}/pruned_depth_last3"),
        "tokens_per_second": d.get("tokens_per_second"),
        "avg_time_s": d.get("avg_time"),
        "peak_mem_gb": "",
        "quality_metric": "",
        "value": "",
        "notes": ""
    })

exists = os.path.exists(csv_path)
with open(csv_path, "a", newline="") as f:
    w = csv.DictWriter(f, fieldnames=fieldnames)
    if not exists:
        w.writeheader()
    w.writerows(rows)

csv_path

## 7) Analysis & Visualization


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Path handling for Colab vs Local
csv_path = f"{RESULTS_DIR}/runs.csv"

if os.path.exists(csv_path):
    try:
        df = pd.read_csv(csv_path)
        # Clean col names
        df.columns = df.columns.str.strip()

        # Setup plot
        plt.figure(figsize=(10, 6))
        sns.set_style("whitegrid")

        # Define colors: Highlight 'depth' strategies
        # Assuming 'preset' column exists
        colors = ['#6c757d' if 'depth' not in str(x).lower() else '#28a745' for x in df['preset']]

        ax = sns.barplot(x='preset', y='tokens_per_second', data=df, palette=colors)

        plt.title("The Laboratory Verdict: Depth vs Width Pruning Speed", fontsize=16, fontweight='bold', pad=20)
        plt.ylabel("Tokens Per Second (TPS)", fontsize=12)
        plt.xlabel("Pruning Strategy", fontsize=12)

        # Add labels
        for p in ax.patches:
            height = p.get_height()
            if height > 0:
                ax.text(p.get_x() + p.get_width()/2., height + 0.05,
                        '{:1.2f}'.format(height),
                        ha="center", fontsize=11, fontweight='bold')

        plt.tight_layout()
        plt.show()

        # Print speedup
        baseline = df[df['preset']=='baseline']['tokens_per_second'].values
        depth = df[df['preset']=='depth_last3']['tokens_per_second'].values

        if len(baseline) > 0 and len(depth) > 0:
            speedup = ((depth[0] - baseline[0]) / baseline[0]) * 100
            print(f"\nðŸš€ Depth Pruning Speedup: +{speedup:.1f}% over Baseline")

    except Exception as e:
        print(f"Visualization error: {e}")
else:
    print("No results csv found to visualize.")


### The Laboratory Verdict
As visualized above, **Depth Pruning** typically delivers the most tangible jump in Tokens Per Second (TPS).

While width pruning maintains the global structure better, depth pruning offers raw speed. For an architect looking to deploy on edge devices where latency is king, this is the metric that matters.

> *"Efficiency isn't just a technical metric. It's a commitment to a sustainable future for AI."*


## 7) Save artifacts to Drive

In [None]:
import os, shutil

# Expected dirs from previous cells
RESULTS_DIR = globals().get("RESULTS_DIR", "/content/results")
OUT_DIR = globals().get("OUT_DIR", "/content/models")
DRIVE_DIR = globals().get("DRIVE_DIR", "/content/drive/MyDrive/optipfair_experiments")

assert os.path.isdir(RESULTS_DIR), f"Results dir not found: {RESULTS_DIR}"
assert os.path.isdir(OUT_DIR), f"Models dir not found: {OUT_DIR}"

if os.path.ismount("/content/drive") and os.path.isdir(DRIVE_DIR):
    os.makedirs(DRIVE_DIR, exist_ok=True)
    dest_results = os.path.join(DRIVE_DIR, "results")
    dest_models = os.path.join(DRIVE_DIR, "models")
    shutil.copytree(RESULTS_DIR, dest_results, dirs_exist_ok=True)
    shutil.copytree(OUT_DIR, dest_models, dirs_exist_ok=True)
    print("Artifacts copied:")
    print("  Results â†’", dest_results)
    print("  Models  â†’", dest_models)
else:
    print("Google Drive not mounted or target folder missing.")
    print("Run the Drive mount cell first and ensure DRIVE_DIR exists:", DRIVE_DIR)

### 7.1) Verify copied artifacts (sizes and listing)

In [None]:
import os
from pathlib import Path

def sizeof_gb(p: Path) -> float:
    if p.is_file():
        return p.stat().st_size / (1024**3)
    total = 0
    for root, _, files in os.walk(p):
        for f in files:
            try:
                total += (Path(root)/f).stat().st_size
            except Exception:
                pass
    return total / (1024**3)

DRIVE_DIR = globals().get("DRIVE_DIR", "/content/drive/MyDrive/optipfair_experiments")
results_dir = Path(DRIVE_DIR)/"results"
models_dir = Path(DRIVE_DIR)/"models"

print("Drive base:", DRIVE_DIR)
if results_dir.exists():
    print("Results files:")
    for p in sorted(results_dir.glob("**/*")):
        if p.is_file():
            print(f"  {p.relative_to(DRIVE_DIR)} â€” {p.stat().st_size/1024:.1f} KB")
    print(f"Total results size: {sizeof_gb(results_dir):.3f} GB")
else:
    print("Results folder not found.")

if models_dir.exists():
    print("Models files:")
    for p in sorted(models_dir.glob("**/*")):
        if p.is_file():
            size_gb = p.stat().st_size / (1024**3)
            print(f"  {p.relative_to(DRIVE_DIR)} â€” {size_gb:.3f} GB")
    print(f"Total models size: {sizeof_gb(models_dir):.3f} GB")
else:
    print("Models folder not found.")

## 8) Optional â€” DEPTH informed (analyze_layer_importance)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.utils.data import DataLoader
from datasets import load_dataset
from optipfair import analyze_layer_importance, prune_model
import numpy as np, json

# Tiny dataset for importance (keep it small on Colab)
model = AutoModelForCausalLM.from_pretrained(BASE_MODEL)
tok_3 = AutoTokenizer.from_pretrained(BASE_MODEL) # Get tokenizer here

def prepare_dataloader():
    ds = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train[:200]')
    tok = AutoTokenizer.from_pretrained(BASE_MODEL)
    tok.pad_token = tok.eos_token # Add this line to set the padding token
    def tok_fn(ex):
        return tok(ex['text'], truncation=True, padding='max_length', max_length=256, return_tensors='pt')
    ds = ds.map(tok_fn, batched=True)
    ds.set_format(type='torch', columns=['input_ids','attention_mask'])
    return DataLoader(ds, batch_size=4)

loader = prepare_dataloader()
imp = analyze_layer_importance(model, loader)
# Pick least important 3 layers
layers_to_remove = [k for k,_ in sorted(imp.items(), key=lambda x: x[1])[:3]]
print('Least important layers:', layers_to_remove)

# Prune using informed indices
pruned_inf, s_inf = prune_model(
    model,
    pruning_type='DEPTH',
    layer_indices=layers_to_remove,
    return_stats=True
)
pruned_inf.save_pretrained(f"{OUT_DIR}/pruned_depth_informed")
tok_3.save_pretrained(f"{OUT_DIR}/pruned_depth_informed") # Explicitly save tokenizer

# Bench
inf = time_inference(f"{OUT_DIR}/pruned_depth_informed", "Paris is the capital of", 64, 5)
with open(f"{RESULTS_DIR}/depth_informed.json", "w") as f:
    json.dump(inf, f, indent=2)
{"layers_removed": layers_to_remove, "stats": s_inf, "bench": inf}

## Visualizing the Invisible: Ethics & Bias
It's not enough to make the model fast. We need to know if pruning it amplifies biases.

Pere Martra warns: *"An 'efficient' model that amplifies prejudices isn't production-ready; it's a liability risk."*

OptiPFair includes a bias visualization module. Let's inspect if our model treats demographic groups differently.


## 9) Optional â€” Bias visualization quick check

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from optipfair.bias import visualize_bias
import json

model = AutoModelForCausalLM.from_pretrained(BASE_MODEL)
tok = AutoTokenizer.from_pretrained(BASE_MODEL)

prompt_pairs = [
    ("The white doctor examined the patient. The nurse thought",
     "The Black doctor examined the patient. The nurse thought")
]
_, metrics = visualize_bias(
    model,
    tok,
    prompt_pairs=prompt_pairs,
    visualization_types=["mean_diff"],
    layers="first_middle_last",
    output_dir="/content/bias_analysis",
)
print(json.dumps(metrics, indent=2))

In [None]:
import shutil, os
if os.path.exists(DRIVE_DIR):
    shutil.copytree(RESULTS_DIR, f"{DRIVE_DIR}/results", dirs_exist_ok=True)
    shutil.copytree(OUT_DIR, f"{DRIVE_DIR}/models", dirs_exist_ok=True)
    print("Artifacts copied to", DRIVE_DIR)
else:
    print("Drive not mounted; skipped copy")