# Phase 1B: Benchmark Trained Model vs GPT-4

**✅ Run this on H100 after Phase 1A training completes**

**What this does:**
1. Loads your trained model from `/data/Cogumi-LLM/checkpoints/final/`
2. Tests on 50 diverse examples per category (6 categories = 300 total)
3. Compares against GPT-4 baseline
4. Generates performance report

**Requirements:**
- ✅ Completed Phase 1A training
- 🔑 OpenAI API key
- 💰 ~$5-10 in OpenAI credits
- ⏱️ ~30-60 minutes runtime

**Expected Results:**
- **Target:** 75-82% of GPT-4 performance
- **Good:** ≥80%
- **Excellent:** ≥85%

---

## Step 1: Verify Environment

Check that your model exists and environment is ready.

In [None]:
import os
from pathlib import Path

# Check model exists - try multiple possible locations
possible_paths = [
    "/workspace/data/Cogumi-LLM/checkpoints/final",           # Your actual location!
    "/workspace/data/Cogumi-LLM/checkpoints/checkpoint-240240",
    "/workspace/Cogumi-LLM/checkpoints/final",                # Alternative paths
    "/data/Cogumi-LLM/checkpoints/final",
    "/workspace/checkpoints/final",
    "/data/Cogumi-LLM/checkpoints/checkpoint-240240",
]

model_path = None
for path in possible_paths:
    if os.path.exists(path):
        model_path = path
        print(f"✅ Model found at: {model_path}")
        break

if model_path:
    print(f"\n📁 Model contents:")
    !ls -lh {model_path}
    
    # Show disk info
    print(f"\n📊 Disk usage:")
    !df -h {model_path}
    
    # Check model size
    print(f"\n📦 Total model size:")
    !du -sh {model_path}
else:
    print("❌ ERROR: Model not found in any expected location!")
    print("\n🔍 Checked these locations:")
    for path in possible_paths:
        print(f"   ❌ {path}")
    print("\n💡 Let's find where your model actually is:")
    print("   Run this in a new cell: !find /workspace -name 'adapter_model.safetensors' 2>/dev/null")
    raise FileNotFoundError("Model not found at any expected path")


## Step 2: Verify Dependencies

Check that PyTorch, Unsloth, and other packages are installed correctly.

In [None]:
# Verify all required packages are installed
import sys

print("🔍 Checking installed packages...")
print("="*70)

try:
    import torch
    print(f"✅ PyTorch: {torch.__version__}")
    print(f"   CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"   GPU: {torch.cuda.get_device_name(0)}")
        print(f"   CUDA version: {torch.version.cuda}")
except ImportError as e:
    print(f"❌ PyTorch not installed: {e}")
    sys.exit(1)

try:
    import unsloth
    print(f"✅ Unsloth: Installed")
except ImportError:
    print("❌ Unsloth not installed - installing now...")
    !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

try:
    import transformers
    print(f"✅ Transformers: {transformers.__version__}")
except ImportError:
    print("❌ Transformers not installed")

try:
    import openai
    print(f"✅ OpenAI SDK: {openai.__version__}")
except ImportError:
    print("⚠️  OpenAI SDK not installed - installing...")
    !pip install openai

try:
    import datasets
    print(f"✅ Datasets: {datasets.__version__}")
except ImportError:
    print("⚠️  Datasets not installed - installing...")
    !pip install datasets

try:
    import pandas
    print(f"✅ Pandas: {pandas.__version__}")
except ImportError:
    print("⚠️  Pandas not installed - installing...")
    !pip install pandas

try:
    import rich
    print(f"✅ Rich: Installed")
except ImportError:
    print("⚠️  Rich not installed - installing...")
    !pip install rich

print("="*70)
print("✅ All core packages verified!")

## Step 3: Configure OpenAI API Key

**Get your API key from:** https://platform.openai.com/api-keys

**Estimated cost:** $5-10 for 300 GPT-4 comparisons

In [None]:
from getpass import getpass
import os

# Enter your OpenAI API key (paste when prompted)
OPENAI_API_KEY = getpass("Enter OpenAI API key: ")
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

print("✅ API key configured")

## Step 4: Initialize Benchmark Suite

Load your model and prepare for benchmarking.

In [None]:
import sys
import os
from pathlib import Path

# Add project directory to path - check multiple locations
project_paths = [
    "/workspace/data/Cogumi-LLM",       # Your actual location!
    "/workspace/Cogumi-LLM",
    "/data/Cogumi-LLM",
    "/workspace",
]

project_dir = None
for proj_path in project_paths:
    scripts_path = os.path.join(proj_path, "scripts", "automated_gpt4_benchmark.py")
    if os.path.exists(scripts_path):
        sys.path.append(proj_path)
        project_dir = proj_path
        print(f"✅ Project found at: {proj_path}")
        print(f"✅ Benchmark script exists: {scripts_path}")
        break

if not project_dir:
    print("⚠️ Warning: Could not find automated_gpt4_benchmark.py")
    print("Attempting to add common paths...")
    sys.path.append("/workspace/data/Cogumi-LLM")

from scripts.automated_gpt4_benchmark import BenchmarkSuite

# Create output directory (next to model)
output_dir = str(Path(model_path).parent.parent / "benchmark_results")
os.makedirs(output_dir, exist_ok=True)

print("\n🔄 Initializing benchmark suite...")
print(f"   Model: {model_path}")
print(f"   Output: {output_dir}")
print("")

# Initialize (this loads your model on GPU)
suite = BenchmarkSuite(
    model_path=model_path,
    openai_key=OPENAI_API_KEY,
    output_dir=output_dir,
    device="auto"  # Uses GPU automatically
)

print("\n✅ Benchmark suite ready!")
print(f"\n📊 Will test on {len(suite.categories)} categories:")
for cat, desc in suite.categories.items():
    print(f"   - {cat}: {desc}")


## Step 5: Run Benchmark

**This will take ~30-60 minutes**

Tests your model against GPT-4 on:
- 50 samples per category
- 6 categories = 300 total comparisons
- Side-by-side GPT-4 judging

**Progress will be shown below.**

In [None]:
from datetime import datetime

# Safety check - ensure suite is initialized
if 'suite' not in globals():
    print("❌ ERROR: BenchmarkSuite not initialized!")
    print("\n🔧 Please run Cell 4 first to initialize the benchmark suite.")
    print("\n📝 Make sure you've:")
    print("   1. Uploaded the fixed automated_gpt4_benchmark.py script")
    print("   2. Successfully run Cell 4 (Initialize Benchmark Suite)")
    raise NameError("suite not defined - run Cell 4 first")

print("🚀 Starting benchmark...")
print(f"   Start time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"   Categories: {list(suite.categories.keys())}")
print(f"   Samples per category: 50")
print(f"   Total comparisons: 300")
print("")
print("⏳ This will take ~30-60 minutes. Progress shown below:")
print("="*70)

# Run full benchmark
results = suite.run_full_benchmark(
    categories=list(suite.categories.keys()),
    samples_per_category=50
)

print("\n" + "="*70)
print("✅ Benchmark complete!")
print(f"   End time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## Step 6: View Results

Analyze performance vs GPT-4 baseline.

In [None]:
import json
import pandas as pd
from pathlib import Path

# Load latest benchmark report
reports = sorted(Path(output_dir).glob("benchmark_report_*.json"))

if not reports:
    print("❌ No benchmark reports found")
else:
    # Load most recent report
    with open(reports[-1]) as f:
        report = json.load(f)
    
    # Display overall performance
    print("\n" + "="*70)
    print("📊 PHASE 1B BENCHMARK RESULTS")
    print("="*70)
    print(f"\n🎯 Overall Performance: {report['overall']['score_vs_gpt4']:.1f}% of GPT-4")
    print(f"   Rating: {report['overall']['performance_rating']}")
    print(f"   Total comparisons: {report['overall']['total_comparisons']}")
    print("")
    
    # Create detailed table
    df = pd.DataFrame(report['by_category']).T
    df = df.sort_values('score', ascending=False)
    
    print("\n📋 Results by Category:")
    print("="*70)
    for cat in df.index:
        score = df.loc[cat, 'score']
        wins = df.loc[cat, 'wins']
        losses = df.loc[cat, 'losses']
        ties = df.loc[cat, 'ties']
        
        # Status indicator
        if score >= 85:
            status = "✅ Excellent"
        elif score >= 75:
            status = "🟢 Good"
        elif score >= 65:
            status = "🟡 Needs work"
        else:
            status = "🔴 Weak"
        
        print(f"{cat:12s}: {score:5.1f}% | W:{wins:2d} L:{losses:2d} T:{ties:2d} | {status}")
    
    # Identify weak areas
    print("\n" + "="*70)
    weak_categories = df[df['score'] < 75].index.tolist()
    
    if weak_categories:
        print("⚠️  WEAK AREAS (need Phase 1C targeted distillation):")
        for cat in weak_categories:
            print(f"   - {cat}: {df.loc[cat, 'score']:.1f}% (target: 85%+)")
    else:
        print("✅ NO WEAK AREAS - All categories ≥75%!")
    
    print("\n" + "="*70)
    print(f"\n📁 Full report saved to: {reports[-1]}")
    print("="*70)

## Step 7: Analyze Failure Patterns

Identify specific examples where GPT-4 outperformed your model.

In [None]:
from collections import Counter

# Extract failure examples
failure_examples = []

for category, data in suite.results.items():
    if 'details' in data:
        for result in data['details']:
            if result.get('judgment', {}).get('winner') == 'B':  # GPT-4 won
                failure_examples.append({
                    'category': category,
                    'prompt': result['prompt'],
                    'local_response': result['local_response'],
                    'gpt4_response': result['gpt4_response'],
                    'reasoning': result['judgment'].get('reasoning', '')
                })

print(f"\n📊 Failure Analysis")
print("="*70)
print(f"Total failures: {len(failure_examples)} / {report['overall']['total_comparisons']}")
print(f"Failure rate: {len(failure_examples) / report['overall']['total_comparisons'] * 100:.1f}%")
print("")

# Breakdown by category
print("Failures by category:")
category_counts = Counter(ex['category'] for ex in failure_examples)
for cat, count in category_counts.most_common():
    pct = count / 50 * 100  # 50 samples per category
    print(f"   {cat:12s}: {count:2d} failures ({pct:.1f}%)")

# Save failures for Phase 1C
failure_file = Path(output_dir) / "failure_examples.json"
with open(failure_file, 'w') as f:
    json.dump(failure_examples, f, indent=2)

print(f"\n✅ Failure examples saved to: {failure_file}")
print("   Use these for Phase 1C targeted distillation")

## Step 8: Sample Failure Examples

Review a few examples where your model lost to GPT-4.

In [None]:
# Show 3 random failure examples
import random

if failure_examples:
    print("\n📝 Sample Failure Examples (3 random):")
    print("="*70)
    
    for i, example in enumerate(random.sample(failure_examples, min(3, len(failure_examples))), 1):
        print(f"\n[Example {i}] Category: {example['category']}")
        print("-"*70)
        print(f"Prompt: {example['prompt'][:200]}...")
        print(f"\nYour model: {example['local_response'][:200]}...")
        print(f"\nGPT-4: {example['gpt4_response'][:200]}...")
        print(f"\nWhy GPT-4 won: {example['reasoning'][:150]}...")
        print("="*70)
else:
    print("\n🎉 No failures - your model matched or beat GPT-4 on all examples!")

## Step 9: Download Results

**Before terminating the instance, download these files:**

In [None]:
print("\n📁 Files to Download (from Jupyter file browser):")
print("="*70)
print(f"\n1. Benchmark report: {reports[-1] if reports else 'N/A'}")
print(f"2. Failure examples: {failure_file}")
print(f"3. All results: {output_dir}/")
print("\n💡 Right-click these folders in Jupyter and select 'Download as Archive'")
print("="*70)

---

## 🎯 Next Steps

### ✅ If Overall Score ≥ 80%:
**Proceed to Phase 2: Compression**
- Your model is ready for Neural Magic pruning
- Expected: 10GB → 520MB with minimal quality loss

### 📊 If Overall Score 75-80%:
**Optional: Phase 1C Targeted Distillation**
- Generate 10-20K examples for weak categories
- Quick fine-tune to push above 80%
- Cost: ~$50-100, Time: ~4-6 hours

### ⚠️ If Overall Score < 75%:
**Required: Phase 1C Full Distillation**
- Generate 40K GPT-5 examples for weak areas
- Comprehensive fine-tune
- Target: 88-100% GPT-4 baseline
- Cost: ~$280, Time: ~5 days

---

## 🎊 Congratulations!

You've completed **Phase 1B - Benchmarking**!

**What you achieved:**
- ✅ Trained an 8B model on 640K examples
- ✅ Benchmarked against GPT-4 baseline
- ✅ Identified performance gaps
- ✅ Ready for next phase!

---