# Phase 1B: Automated GPT-4 Benchmarking

**Run this after Phase 1A training completes**

This notebook:
1. Loads your trained model
2. Compares it against GPT-4 on 6 categories
3. Identifies failure patterns
4. Generates targeted distillation dataset

**Prerequisites:**
- Completed Phase 1A training
- OpenAI API key
- ~$10-20 in API credits

## Step 1: Setup Environment

In [None]:
# Install required packages
!pip install openai datasets transformers accelerate bitsandbytes

## Step 2: Configure API Key

In [None]:
import os
from getpass import getpass

# Enter your OpenAI API key
OPENAI_API_KEY = getpass("Enter OpenAI API key: ")
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

print("✅ API key configured")

## Step 3: Run Quick Benchmark (50 samples/category)

**This will take ~30-60 minutes and cost ~$5-10**

In [None]:
import sys
sys.path.append('/data/Cogumi-LLM')

from scripts.automated_gpt4_benchmark import BenchmarkSuite

# Initialize benchmark
suite = BenchmarkSuite(
    model_path="/data/Cogumi-LLM/checkpoints/final",
    openai_key=OPENAI_API_KEY,
    output_dir="./benchmark_results"
)

# Run on key categories
suite.run_full_benchmark(
    categories=["math", "code", "reasoning", "knowledge", "instruction"],
    samples_per_category=50
)

## Step 4: Analyze Results

In [None]:
import json
from pathlib import Path
import pandas as pd

# Load latest report
reports = sorted(Path("./benchmark_results").glob("benchmark_report_*.json"))
if not reports:
    print("❌ No benchmark reports found")
else:
    with open(reports[-1]) as f:
        report = json.load(f)
    
    # Display overall score
    print(f"\n{'='*60}")
    print(f"Overall Score vs GPT-4: {report['overall']['score_vs_gpt4']:.1f}%")
    print(f"Rating: {report['overall']['performance_rating']}")
    print(f"{'='*60}\n")
    
    # Create DataFrame
    df = pd.DataFrame(report['by_category']).T
    df = df.sort_values('score', ascending=False)
    
    print("\nResults by Category:")
    print(df[['score', 'wins', 'losses', 'ties']])
    
    # Identify weak areas (< 85%)
    weak_categories = df[df['score'] < 85].index.tolist()
    
    if weak_categories:
        print(f"\n⚠️ Weak areas (need GPT-5 distillation):")
        for cat in weak_categories:
            print(f"   - {cat}: {df.loc[cat, 'score']:.1f}%")
    else:
        print("\n✅ All categories above 85% - excellent performance!")

## Step 5: Identify Failure Patterns

In [None]:
# Load detailed results
failure_examples = []

for category in suite.results.keys():
    for result in suite.results[category]['details']:
        if result['judgment']['winner'] == 'B':  # GPT-4 won
            failure_examples.append({
                'category': category,
                'prompt': result['prompt'],
                'local_response': result['local_response'],
                'gpt4_response': result['gpt4_response'],
                'reasoning': result['judgment'].get('reasoning', '')
            })

print(f"\n📊 Found {len(failure_examples)} failures")
print(f"\nBreakdown by category:")

from collections import Counter
category_counts = Counter(ex['category'] for ex in failure_examples)
for cat, count in category_counts.most_common():
    print(f"   {cat}: {count} failures")

# Save failures for Phase 1C
with open('./benchmark_results/failure_examples.json', 'w') as f:
    json.dump(failure_examples, f, indent=2)

print(f"\n✅ Failure examples saved to: ./benchmark_results/failure_examples.json")

## Next Steps

Based on results:

### If Score ≥ 90%:
✅ **Proceed to Phase 2: Compression**
- Your model is ready for pruning and quantization

### If Score 85-90%:
📊 **Optional: Phase 1C Distillation**
- Generate 10K targeted examples for weak areas
- Quick fine-tune to push above 90%

### If Score < 85%:
⚠️ **Required: Phase 1C Distillation**
- Generate 40K targeted examples
- Focus on weak categories identified above
- Target: 88-100% GPT-4 baseline