# Lab 3.2: Wanda Pruning - Inference Comparison

**Goal:** Compare dense vs sparse model inference quality and performance.

**You will learn to:**
- Load both dense and pruned sparse models
- Compare text generation quality side-by-side
- Measure latency and throughput differences
- Evaluate generation quality across diverse tasks
- Understand sparsity-quality tradeoffs

---

## Why Compare Dense vs Sparse?

**Key questions to answer**:
1. **Quality**: How much precision is lost due to 50% pruning?
2. **Performance**: Is sparse inference faster? (Depends on hardware support)
3. **Usability**: Can sparse model handle diverse tasks?

**Expected outcomes** (50% Wanda pruning):
- **Perplexity**: +7.7% (WikiText-2: 5.68 → 6.12)
- **Quality**: Minor degradation, mostly coherent
- **Speed**: Similar (without sparse acceleration)

---

## Prerequisites

Make sure you have completed:
- **01-Setup.ipynb**: Environment setup
- **02-Prune.ipynb**: Applied Wanda pruning and saved model

---
## Step 1: Load Pruned Sparse Model

Load the pruned model from the previous notebook.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import json

# Paths
PRUNED_MODEL_DIR = "./pruned_model"

print("=" * 60)
print("Loading Pruned Sparse Model")
print("=" * 60)

# Check if pruned model exists
if not os.path.exists(PRUNED_MODEL_DIR):
    print(f"❌ Pruned model not found at {PRUNED_MODEL_DIR}")
    print("   Please run 02-Prune.ipynb first!")
else:
    print(f"✅ Found pruned model directory\n")
    
    # Load pruning configuration
    config_path = os.path.join(PRUNED_MODEL_DIR, "pruning_config.json")
    if os.path.exists(config_path):
        with open(config_path, 'r') as f:
            pruning_config = json.load(f)
        
        print("📊 Pruning Configuration:")
        print(f"   Method: {pruning_config['method']}")
        print(f"   Target sparsity: {pruning_config['target_sparsity']:.1%}")
        print(f"   Achieved sparsity: {pruning_config['achieved_sparsity']:.2%}")
        print(f"   Pruned layers: {pruning_config['pruned_layers']}")
        print()
    
    # Load tokenizer
    print("⏳ Loading tokenizer...")
    sparse_tokenizer = AutoTokenizer.from_pretrained(PRUNED_MODEL_DIR)
    sparse_tokenizer.pad_token = sparse_tokenizer.eos_token
    print("✅ Tokenizer loaded")
    
    # Load model
    print("⏳ Loading sparse model (may take 1-2 minutes)...")
    sparse_model = AutoModelForCausalLM.from_pretrained(
        PRUNED_MODEL_DIR,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    print("✅ Sparse model loaded")
    
    # Memory usage
    if torch.cuda.is_available():
        memory_allocated = torch.cuda.memory_allocated() / 1e9
        print(f"\n🖥️  GPU Memory: {memory_allocated:.2f} GB")

print("=" * 60)

---
## Step 2: Load Dense Baseline Model (for comparison)

Load the original dense model to compare against.

In [None]:
MODEL_NAME = "meta-llama/Llama-2-7b-hf"
# Alternative: "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print("=" * 60)
print("Loading Dense Baseline Model")
print("=" * 60)
print(f"Model: {MODEL_NAME}\n")

# Load tokenizer
print("⏳ Loading tokenizer...")
dense_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
dense_tokenizer.pad_token = dense_tokenizer.eos_token
print("✅ Tokenizer loaded")

# Load model
print("⏳ Loading dense model (may take 1-2 minutes)...")
dense_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
print("✅ Dense model loaded")

# Memory usage
if torch.cuda.is_available():
    memory_allocated = torch.cuda.memory_allocated() / 1e9
    print(f"\n🖥️  GPU Memory: {memory_allocated:.2f} GB")

print("=" * 60)

---
## Step 3: Define Test Prompts

Create diverse test prompts to evaluate different capabilities.

In [None]:
# Test prompts covering different tasks
test_prompts = [
    {
        "name": "Creative Writing",
        "prompt": "Write a short story about a robot learning to paint:",
        "max_tokens": 100
    },
    {
        "name": "Factual Knowledge",
        "prompt": "Explain the process of photosynthesis in plants:",
        "max_tokens": 80
    },
    {
        "name": "Reasoning",
        "prompt": "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?",
        "max_tokens": 60
    },
    {
        "name": "Code Generation",
        "prompt": "Write a Python function to calculate the Fibonacci sequence:",
        "max_tokens": 80
    },
    {
        "name": "Conversation",
        "prompt": "What are the benefits of regular exercise?",
        "max_tokens": 80
    }
]

print("=" * 60)
print("Test Prompts Defined")
print("=" * 60)
print(f"\nTotal test cases: {len(test_prompts)}\n")
for i, test in enumerate(test_prompts, 1):
    print(f"{i}. {test['name']}")
    print(f"   Prompt: {test['prompt'][:60]}...")
    print(f"   Max tokens: {test['max_tokens']}\n")

print("=" * 60)

---
## Step 4: Generate Outputs (Dense vs Sparse)

Run both models on all test prompts and collect outputs.

In [None]:
import time

# Generation configuration
generation_config = {
    "do_sample": True,
    "temperature": 0.8,
    "top_p": 0.9,
    "repetition_penalty": 1.1
}

print("=" * 60)
print("Running Inference Comparison")
print("=" * 60)
print("\n⏳ Generating outputs...\n")

# Store results
results = []

for i, test in enumerate(test_prompts, 1):
    print(f"\n{'='*60}")
    print(f"Test Case {i}: {test['name']}")
    print(f"{'='*60}")
    print(f"Prompt: {test['prompt']}\n")
    
    result = {
        "name": test['name'],
        "prompt": test['prompt']
    }
    
    # --- Dense Model ---
    print("[Dense Model]")
    inputs = dense_tokenizer(test['prompt'], return_tensors="pt").to(dense_model.device)
    
    start_time = time.time()
    with torch.no_grad():
        outputs = dense_model.generate(
            **inputs,
            max_new_tokens=test['max_tokens'],
            pad_token_id=dense_tokenizer.eos_token_id,
            **generation_config
        )
    dense_latency = time.time() - start_time
    
    dense_output = dense_tokenizer.decode(outputs[0], skip_special_tokens=True)
    dense_tokens = len(outputs[0])
    
    print(f"Output: {dense_output}")
    print(f"⏱️  Latency: {dense_latency:.2f}s | Tokens: {dense_tokens} | Speed: {dense_tokens/dense_latency:.2f} tok/s\n")
    
    result["dense_output"] = dense_output
    result["dense_latency"] = dense_latency
    result["dense_tokens"] = dense_tokens
    
    # --- Sparse Model ---
    print("[Sparse Model (50% pruned)]")
    inputs = sparse_tokenizer(test['prompt'], return_tensors="pt").to(sparse_model.device)
    
    start_time = time.time()
    with torch.no_grad():
        outputs = sparse_model.generate(
            **inputs,
            max_new_tokens=test['max_tokens'],
            pad_token_id=sparse_tokenizer.eos_token_id,
            **generation_config
        )
    sparse_latency = time.time() - start_time
    
    sparse_output = sparse_tokenizer.decode(outputs[0], skip_special_tokens=True)
    sparse_tokens = len(outputs[0])
    
    print(f"Output: {sparse_output}")
    print(f"⏱️  Latency: {sparse_latency:.2f}s | Tokens: {sparse_tokens} | Speed: {sparse_tokens/sparse_latency:.2f} tok/s\n")
    
    result["sparse_output"] = sparse_output
    result["sparse_latency"] = sparse_latency
    result["sparse_tokens"] = sparse_tokens
    
    # Performance comparison
    speedup = dense_latency / sparse_latency
    result["speedup"] = speedup
    
    print(f"📊 Comparison:")
    print(f"   Speedup: {speedup:.2f}x {'(sparse faster)' if speedup > 1 else '(dense faster)'}")
    print(f"   Latency difference: {abs(dense_latency - sparse_latency):.2f}s")
    
    results.append(result)

print("\n" + "=" * 60)
print("✅ All test cases completed!")
print("=" * 60)

---
## Step 5: Side-by-Side Output Comparison

Display outputs in a readable comparison format.

In [None]:
print("=" * 80)
print("SIDE-BY-SIDE OUTPUT COMPARISON")
print("=" * 80)

for i, result in enumerate(results, 1):
    print(f"\n{'='*80}")
    print(f"Test Case {i}: {result['name']}")
    print(f"{'='*80}")
    print(f"\n📝 Prompt:\n{result['prompt']}\n")
    
    print(f"{'─'*80}")
    print("🟢 DENSE MODEL OUTPUT:")
    print(f"{'─'*80}")
    print(result['dense_output'])
    print(f"\n⏱️  {result['dense_latency']:.2f}s | {result['dense_tokens']} tokens | {result['dense_tokens']/result['dense_latency']:.2f} tok/s")
    
    print(f"\n{'─'*80}")
    print("🔵 SPARSE MODEL OUTPUT (50% pruned):")
    print(f"{'─'*80}")
    print(result['sparse_output'])
    print(f"\n⏱️  {result['sparse_latency']:.2f}s | {result['sparse_tokens']} tokens | {result['sparse_tokens']/result['sparse_latency']:.2f} tok/s")
    
    print(f"\n{'─'*80}")
    print("📊 COMPARISON:")
    print(f"{'─'*80}")
    print(f"Speedup: {result['speedup']:.2f}x {'✅ (sparse faster)' if result['speedup'] > 1 else '⚠️  (dense faster)'}")
    print(f"Latency diff: {abs(result['dense_latency'] - result['sparse_latency']):.2f}s")

print("\n" + "=" * 80)

---
## Step 6: Performance Summary

Aggregate statistics across all test cases.

In [None]:
import numpy as np

# Calculate aggregate statistics
dense_latencies = [r['dense_latency'] for r in results]
sparse_latencies = [r['sparse_latency'] for r in results]
speedups = [r['speedup'] for r in results]

dense_tokens = [r['dense_tokens'] for r in results]
sparse_tokens = [r['sparse_tokens'] for r in results]

dense_throughputs = [t/l for t, l in zip(dense_tokens, dense_latencies)]
sparse_throughputs = [t/l for t, l in zip(sparse_tokens, sparse_latencies)]

print("=" * 60)
print("PERFORMANCE SUMMARY")
print("=" * 60)

print("\n📊 Latency Statistics:")
print(f"   Dense Model:")
print(f"      Mean: {np.mean(dense_latencies):.2f}s")
print(f"      Std:  {np.std(dense_latencies):.2f}s")
print(f"      Min:  {np.min(dense_latencies):.2f}s")
print(f"      Max:  {np.max(dense_latencies):.2f}s")

print(f"\n   Sparse Model (50% pruned):")
print(f"      Mean: {np.mean(sparse_latencies):.2f}s")
print(f"      Std:  {np.std(sparse_latencies):.2f}s")
print(f"      Min:  {np.min(sparse_latencies):.2f}s")
print(f"      Max:  {np.max(sparse_latencies):.2f}s")

print(f"\n📊 Throughput Statistics:")
print(f"   Dense Model:")
print(f"      Mean: {np.mean(dense_throughputs):.2f} tok/s")
print(f"      Std:  {np.std(dense_throughputs):.2f} tok/s")

print(f"\n   Sparse Model (50% pruned):")
print(f"      Mean: {np.mean(sparse_throughputs):.2f} tok/s")
print(f"      Std:  {np.std(sparse_throughputs):.2f} tok/s")

print(f"\n📊 Speedup Statistics:")
print(f"   Mean speedup: {np.mean(speedups):.2f}x")
print(f"   Std:          {np.std(speedups):.2f}x")
print(f"   Min speedup:  {np.min(speedups):.2f}x")
print(f"   Max speedup:  {np.max(speedups):.2f}x")

avg_speedup = np.mean(speedups)
if avg_speedup > 1.1:
    print(f"\n✅ Sparse model is {avg_speedup:.2f}x faster on average!")
elif avg_speedup < 0.9:
    print(f"\n⚠️  Dense model is {1/avg_speedup:.2f}x faster on average")
    print("   (Expected: Sparse needs hardware acceleration for speedup)")
else:
    print(f"\n⚖️  Performance is similar (speedup: {avg_speedup:.2f}x)")
    print("   (Expected: Without sparse acceleration, performance is similar)")

print("=" * 60)

---
## Step 7: Quality Assessment

Subjective quality evaluation across test cases.

In [None]:
print("=" * 60)
print("QUALITY ASSESSMENT")
print("=" * 60)

print("\n📋 Evaluation Criteria:")
print("   ✅ Excellent: No noticeable degradation")
print("   🟡 Good:      Minor quality loss, acceptable")
print("   ⚠️  Fair:      Noticeable degradation, needs improvement")
print("   ❌ Poor:      Significant quality loss\n")

print("="*60)
print("Manual Quality Comparison:")
print("="*60)

for i, result in enumerate(results, 1):
    print(f"\n{i}. {result['name']}:")
    print(f"   Prompt length: {len(result['prompt'])} chars")
    print(f"   Dense output: {len(result['dense_output'])} chars")
    print(f"   Sparse output: {len(result['sparse_output'])} chars")
    print(f"   Output length ratio: {len(result['sparse_output'])/len(result['dense_output']):.2f}")
    print(f"   Quality: [To be assessed by human evaluation]")

print("\n" + "="*60)
print("📝 Expected Quality (50% Wanda Pruning):")
print("="*60)
print("According to the Wanda paper:")
print("   Perplexity increase: +7.7% (5.68 → 6.12 on WikiText-2)")
print("   Generation quality: Minor degradation")
print("   Coherence: Mostly maintained")
print("   Task performance: ~90-95% of dense model")
print("\nRecommendation:")
print("   50% sparsity offers good balance between compression and quality")
print("   For production: Consider 40% sparsity for better quality")
print("   For extreme compression: 60% sparsity (with higher quality loss)")
print("="*60)

---
## Step 8: Comparison Visualization

Visualize performance and quality metrics.

In [None]:
import matplotlib.pyplot as plt

print("=" * 60)
print("Performance Visualization")
print("=" * 60)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

test_names = [r['name'] for r in results]
x_pos = np.arange(len(test_names))

# Plot 1: Latency Comparison
ax = axes[0, 0]
width = 0.35
ax.bar(x_pos - width/2, dense_latencies, width, label='Dense', color='green', alpha=0.7)
ax.bar(x_pos + width/2, sparse_latencies, width, label='Sparse (50%)', color='blue', alpha=0.7)
ax.set_xlabel('Test Case', fontsize=10)
ax.set_ylabel('Latency (seconds)', fontsize=10)
ax.set_title('Latency Comparison', fontsize=12, fontweight='bold')
ax.set_xticks(x_pos)
ax.set_xticklabels(test_names, rotation=45, ha='right', fontsize=8)
ax.legend()
ax.grid(axis='y', alpha=0.3)

# Plot 2: Throughput Comparison
ax = axes[0, 1]
ax.bar(x_pos - width/2, dense_throughputs, width, label='Dense', color='green', alpha=0.7)
ax.bar(x_pos + width/2, sparse_throughputs, width, label='Sparse (50%)', color='blue', alpha=0.7)
ax.set_xlabel('Test Case', fontsize=10)
ax.set_ylabel('Throughput (tokens/sec)', fontsize=10)
ax.set_title('Throughput Comparison', fontsize=12, fontweight='bold')
ax.set_xticks(x_pos)
ax.set_xticklabels(test_names, rotation=45, ha='right', fontsize=8)
ax.legend()
ax.grid(axis='y', alpha=0.3)

# Plot 3: Speedup by Test Case
ax = axes[1, 0]
colors = ['green' if s > 1 else 'orange' for s in speedups]
ax.bar(x_pos, speedups, color=colors, alpha=0.7)
ax.axhline(y=1.0, color='red', linestyle='--', label='Baseline (1.0x)', linewidth=2)
ax.set_xlabel('Test Case', fontsize=10)
ax.set_ylabel('Speedup (sparse/dense)', fontsize=10)
ax.set_title('Speedup Analysis (>1 = sparse faster)', fontsize=12, fontweight='bold')
ax.set_xticks(x_pos)
ax.set_xticklabels(test_names, rotation=45, ha='right', fontsize=8)
ax.legend()
ax.grid(axis='y', alpha=0.3)

# Plot 4: Output Length Comparison
ax = axes[1, 1]
output_length_ratios = [len(r['sparse_output'])/len(r['dense_output']) for r in results]
ax.bar(x_pos, output_length_ratios, color='steelblue', alpha=0.7)
ax.axhline(y=1.0, color='red', linestyle='--', label='Same length', linewidth=2)
ax.set_xlabel('Test Case', fontsize=10)
ax.set_ylabel('Output Length Ratio', fontsize=10)
ax.set_title('Output Length (sparse/dense)', fontsize=12, fontweight='bold')
ax.set_xticks(x_pos)
ax.set_xticklabels(test_names, rotation=45, ha='right', fontsize=8)
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(PRUNED_MODEL_DIR, 'inference_comparison.png'), dpi=150, bbox_inches='tight')
plt.show()

print(f"\n✅ Visualization saved to {PRUNED_MODEL_DIR}/inference_comparison.png")
print("=" * 60)

---
## Step 9: Save Comparison Results

Export results for further analysis.

In [None]:
import pandas as pd

print("=" * 60)
print("Saving Comparison Results")
print("=" * 60)

# Create DataFrame
comparison_df = pd.DataFrame([
    {
        'Test Case': r['name'],
        'Dense Latency (s)': f"{r['dense_latency']:.2f}",
        'Sparse Latency (s)': f"{r['sparse_latency']:.2f}",
        'Dense Throughput (tok/s)': f"{r['dense_tokens']/r['dense_latency']:.2f}",
        'Sparse Throughput (tok/s)': f"{r['sparse_tokens']/r['sparse_latency']:.2f}",
        'Speedup': f"{r['speedup']:.2f}x",
        'Dense Tokens': r['dense_tokens'],
        'Sparse Tokens': r['sparse_tokens']
    }
    for r in results
])

# Save to CSV
csv_path = os.path.join(PRUNED_MODEL_DIR, 'inference_comparison.csv')
comparison_df.to_csv(csv_path, index=False)
print(f"✅ CSV saved to {csv_path}")

# Display table
print("\n📊 Comparison Table:")
print(comparison_df.to_string(index=False))

# Save detailed results (JSON)
json_path = os.path.join(PRUNED_MODEL_DIR, 'inference_results.json')
with open(json_path, 'w') as f:
    json.dump(results, f, indent=2)
print(f"\n✅ Detailed results saved to {json_path}")

print("=" * 60)

---
## ✅ Inference Comparison Complete!

**Summary**:
- ✅ Loaded dense and sparse (50% pruned) models
- ✅ Ran 5 diverse test cases
- ✅ Compared outputs side-by-side
- ✅ Analyzed performance metrics (latency, throughput, speedup)
- ✅ Assessed generation quality
- ✅ Visualized comparison results
- ✅ Saved results to CSV and JSON

**Key Findings**:
- **Sparsity**: 50% of weights pruned
- **Quality**: Minor degradation (expected +7.7% perplexity)
- **Performance**: Similar latency (without sparse hardware acceleration)
- **Speedup**: ~1.0x (needs NVIDIA A100 + 2:4 sparse for 2x speedup)

**Important Notes**:
1. **Hardware Acceleration**: Sparse models need specialized hardware (NVIDIA A100 with 2:4 sparse support) for actual speedup
2. **Storage Format**: Current implementation stores sparse model in dense format (no size reduction)
3. **Quality-Sparsity Tradeoff**: 50% sparsity offers good balance; adjust based on requirements

**Next Steps**:
1. Proceed to **04-Benchmark.ipynb** for comprehensive performance analysis
2. Run perplexity evaluation on WikiText-2
3. Profile memory usage and latency distribution
4. Export to sparse format (CSR/COO) for size reduction

**Files Created**:
- `pruned_model/inference_comparison.png`: Performance visualizations
- `pruned_model/inference_comparison.csv`: Summary table
- `pruned_model/inference_results.json`: Detailed results

---

**⏭️ Continue to**: [04-Benchmark.ipynb](./04-Benchmark.ipynb)