# LongBench v2 Evaluation Demo

A mini version of The Token Company LongBench v2 experiment on OpenRouter.

**Features:**
- Multiple compression conditions (baseline, cutoff=0.3, cutoff=0.9)
- Budget-guarded execution (default $10 limit)
- Deterministic sampling with stratification
- Bootstrap confidence intervals
- Filesystem caching to avoid duplicate API calls

## 1. Setup and Dependencies

In [None]:
# Install required packages
!pip install -q requests python-dotenv tiktoken numpy datasets matplotlib

In [None]:
# Add .env to .gitignore (run once)
gitignore_content = """
# Environment variables
.env

# Cache and results
runs/

# Python
__pycache__/
*.pyc
.ipynb_checkpoints/
"""

with open('.gitignore', 'a') as f:
    f.write(gitignore_content)
print("Updated .gitignore")

In [None]:
import os
import sys
from pathlib import Path

# Load environment variables from .env file
try:
    from dotenv import load_dotenv
    load_dotenv()
    print("Loaded .env file")
except ImportError:
    print("python-dotenv not installed, using environment variables directly")

# Check API key
API_KEY = os.getenv("OPENROUTER_API_KEY")
if API_KEY:
    print(f"API Key found: {API_KEY[:8]}...{API_KEY[-4:]}")
else:
    print("WARNING: OPENROUTER_API_KEY not found!")
    print("Create a .env file with: OPENROUTER_API_KEY=your_key_here")
    print("Or set the environment variable directly.")

## 2. Load Compressor Module

In [None]:
# Import from the evals package
from evals import compress_text, compress_messages, count_tokens, identity_compressor

print("Compressor module loaded!")
print(f"Available functions: compress_text, compress_messages, count_tokens, identity_compressor")

## 3. Quick Demo: Compress One LongBench Example

In [None]:
# Load one example from LongBench v2
from datasets import load_dataset

print("Loading LongBench v2 dataset (streaming)...")
dataset = load_dataset("zai-org/LongBench-v2", split="train", streaming=True)

# Get first example
example = next(iter(dataset))

print(f"\nExample ID: {example.get('_id', 'N/A')}")
print(f"Domain: {example.get('domain', 'N/A')}")
print(f"Length category: {example.get('length', 'N/A')}")
print(f"Question: {example.get('question', 'N/A')[:100]}...")
print(f"\nChoices:")
print(f"  A) {example.get('choice_A', 'N/A')[:50]}...")
print(f"  B) {example.get('choice_B', 'N/A')[:50]}...")
print(f"  C) {example.get('choice_C', 'N/A')[:50]}...")
print(f"  D) {example.get('choice_D', 'N/A')[:50]}...")
print(f"\nCorrect Answer: {example.get('answer', 'N/A')}")

In [None]:
# Get context and count tokens
context = example.get('context', '')
original_tokens = count_tokens(context)

print(f"Context length: {len(context):,} characters")
print(f"Context tokens: {original_tokens:,} tokens")
print(f"\nContext preview (first 500 chars):")
print("-" * 50)
print(context[:500])
print("...")

In [None]:
# Compress with different cutoffs
print("=" * 60)
print("COMPRESSION COMPARISON")
print("=" * 60)

cutoffs = [0.0, 0.3, 0.5, 0.9]

for cutoff in cutoffs:
    if cutoff == 0.0:
        compressed, stats = identity_compressor(context)
        label = "baseline (no compression)"
    else:
        compressed, stats = compress_text(context, importance_cutoff=cutoff)
        label = f"cutoff={cutoff}"
    
    orig = stats['original_tokens']
    comp = stats['compressed_tokens']
    reduction = stats['reduction_ratio']
    
    print(f"\n{label}:")
    print(f"  Original: {orig:,} tokens")
    print(f"  Compressed: {comp:,} tokens")
    print(f"  Reduction: {reduction:.1%}")

In [None]:
# Show before/after for cutoff=0.5
compressed_05, stats_05 = compress_text(context, importance_cutoff=0.5)

print("BEFORE (first 300 chars):")
print("-" * 40)
print(context[:300])

print("\nAFTER compression (cutoff=0.5, first 300 chars):")
print("-" * 40)
print(compressed_05[:300])

## 4. How to Run Evaluation

### CLI Usage

```bash
# Full run with default settings
python -m evals.longbench_eval --model openai/gpt-4o-mini --n 30 --seed 42 --cutoffs 0.3 0.9 --budget-usd 10

# Dry run (no API calls)
python -m evals.longbench_eval --n 10 --no-api

# Custom pricing
python -m evals.longbench_eval --price-input-per-million 0.15 --price-output-per-million 0.60

# Custom cache directory
python -m evals.longbench_eval --cache-dir my_cache --results-dir my_results
```

### Available Flags

| Flag | Default | Description |
|------|---------|-------------|
| `--model` | `openai/gpt-4o-mini` | OpenRouter model ID |
| `--n` | 30 | Number of examples to sample |
| `--seed` | 42 | Random seed for reproducibility |
| `--cutoffs` | 0.3 0.9 | Compression cutoff values |
| `--budget-usd` | 10.0 | Maximum budget in USD |
| `--max-context-tokens-for-sampling` | 60000 | Filter contexts to this limit |
| `--max-output-tokens` | 8 | Max output tokens from model |
| `--price-input-per-million` | 0.15 | Input token price |
| `--price-output-per-million` | 0.60 | Output token price |
| `--cache-dir` | runs/cache | Cache directory |
| `--results-dir` | runs | Results directory |
| `--no-api` | False | Dry run mode |

In [None]:
# Dry run with N=3 (no API calls needed)
print("Running dry-run evaluation (N=3, no API calls)...")
print("=" * 60)

from evals import run_experiment

dry_results = run_experiment(
    compressor_fn=compress_text,
    cutoffs=[0.3, 0.9],
    model="openai/gpt-4o-mini",
    n=3,
    seed=42,
    budget_usd=10.0,
    no_api=True,  # No API calls
)

print("\nDry run complete!")
print(f"Conditions tested: {len(dry_results.get('conditions', []))}")

In [None]:
# If API key is available, run a small real evaluation
if API_KEY:
    print("API key found! Running small evaluation (N=10)...")
    print("=" * 60)
    
    results = run_experiment(
        compressor_fn=compress_text,
        cutoffs=[0.3, 0.9],
        model="openai/gpt-4o-mini",
        n=10,
        seed=42,
        budget_usd=10.0,
        no_api=False,
        openrouter_api_key=API_KEY,
    )
    
    print(f"\nTotal cost: ${results.get('total_cost', 0):.4f}")
else:
    print("No API key found. Skipping real evaluation.")
    print("To run with real API calls:")
    print("1. Create a .env file with OPENROUTER_API_KEY=your_key")
    print("2. Re-run this cell")
    results = None

## 5. Display Results

In [None]:
import json
from pathlib import Path

# Load results from file (if exists)
results_file = Path("runs/results.json")

if results_file.exists():
    with open(results_file) as f:
        saved_results = json.load(f)
    print("Loaded results from runs/results.json")
    print(f"\nConfiguration:")
    config = saved_results.get('config', {})
    print(f"  Model: {config.get('model', 'N/A')}")
    print(f"  N examples: {config.get('n', 'N/A')}")
    print(f"  Seed: {config.get('seed', 'N/A')}")
    print(f"  Budget: ${config.get('budget_usd', 'N/A')}")
    print(f"\nTotal cost: ${saved_results.get('total_cost', 0):.4f}")
else:
    print("No results file found. Run an evaluation first.")
    saved_results = None

In [None]:
# Display results table
if saved_results and 'conditions' in saved_results:
    print("\n" + "=" * 100)
    print("RESULTS TABLE")
    print("=" * 100)
    print(f"{'Condition':<15} | {'Accuracy':>8} | {'Delta':>8} | {'95% CI':>20} | {'Token Red.':>10} | {'Invalid':>8} | {'Cost':>8}")
    print("-" * 100)
    
    for cond in saved_results['conditions']:
        name = cond['condition']
        acc = cond['accuracy']
        delta = cond['delta_vs_baseline']
        ci_low = cond['delta_ci_lower']
        ci_high = cond['delta_ci_upper']
        token_red = cond['token_reduction_vs_baseline']
        invalid = cond['invalid_rate']
        cost = cond['total_cost']
        
        if name == 'baseline':
            ci_str = "---"
            delta_str = "---"
        else:
            ci_str = f"[{ci_low:+.3f}, {ci_high:+.3f}]"
            delta_str = f"{delta:+.3f}"
        
        print(f"{name:<15} | {acc:>8.3f} | {delta_str:>8} | {ci_str:>20} | {token_red:>9.1%} | {invalid:>7.1%} | ${cost:>7.4f}")
    
    print("=" * 100)
else:
    print("No results to display.")

In [None]:
# Plot accuracy vs condition
import matplotlib.pyplot as plt

if saved_results and 'conditions' in saved_results:
    conditions = saved_results['conditions']
    
    # Extract data
    names = [c['condition'] for c in conditions]
    accuracies = [c['accuracy'] for c in conditions]
    token_reductions = [c['token_reduction_vs_baseline'] * 100 for c in conditions]
    
    # Create figure with two subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    
    # Plot 1: Accuracy by condition
    colors = ['#2ecc71' if n == 'baseline' else '#3498db' for n in names]
    bars1 = ax1.bar(names, accuracies, color=colors, edgecolor='black', linewidth=1.2)
    ax1.set_xlabel('Condition', fontsize=12)
    ax1.set_ylabel('Accuracy', fontsize=12)
    ax1.set_title('Accuracy by Compression Condition', fontsize=14)
    ax1.set_ylim(0, 1)
    ax1.axhline(y=accuracies[0], color='gray', linestyle='--', alpha=0.5, label='Baseline')
    
    # Add value labels
    for bar, acc in zip(bars1, accuracies):
        ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                f'{acc:.2f}', ha='center', va='bottom', fontsize=10)
    
    # Plot 2: Token reduction
    bars2 = ax2.bar(names, token_reductions, color='#e74c3c', edgecolor='black', linewidth=1.2)
    ax2.set_xlabel('Condition', fontsize=12)
    ax2.set_ylabel('Token Reduction (%)', fontsize=12)
    ax2.set_title('Token Reduction vs Baseline', fontsize=14)
    
    # Add value labels
    for bar, red in zip(bars2, token_reductions):
        ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                f'{red:.1f}%', ha='center', va='bottom', fontsize=10)
    
    plt.tight_layout()
    plt.savefig('runs/accuracy_plot.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("\nPlot saved to runs/accuracy_plot.png")
else:
    print("No results to plot. Run an evaluation first.")

In [None]:
# Accuracy-Token Tradeoff Plot
if saved_results and 'conditions' in saved_results:
    conditions = saved_results['conditions']
    
    fig, ax = plt.subplots(figsize=(8, 6))
    
    for cond in conditions:
        name = cond['condition']
        acc = cond['accuracy']
        tokens = cond['mean_input_tokens']
        
        color = '#2ecc71' if name == 'baseline' else '#3498db'
        marker = 'o' if name == 'baseline' else 's'
        
        ax.scatter(tokens, acc, s=200, c=color, marker=marker, 
                  edgecolors='black', linewidth=1.5, label=name, zorder=5)
        ax.annotate(name, (tokens, acc), xytext=(10, 10), 
                   textcoords='offset points', fontsize=10)
    
    ax.set_xlabel('Mean Input Tokens', fontsize=12)
    ax.set_ylabel('Accuracy', fontsize=12)
    ax.set_title('Accuracy vs Token Count Trade-off', fontsize=14)
    ax.grid(True, alpha=0.3)
    ax.legend(loc='best')
    
    plt.tight_layout()
    plt.savefig('runs/tradeoff_plot.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("Plot saved to runs/tradeoff_plot.png")

## Summary

This notebook demonstrates:

1. **Compression Module** (`evals/compressor.py`)
   - `compress_text(prompt, importance_cutoff)` - Compress raw text
   - `compress_messages(messages, importance_cutoff)` - Compress chat messages
   - `count_tokens(text)` - Count tokens using tiktoken

2. **Evaluation Module** (`evals/longbench_eval.py`)
   - `run_experiment(...)` - Run full evaluation programmatically
   - CLI: `python -m evals.longbench_eval --n 30 --cutoffs 0.3 0.9`

3. **Key Features**
   - Budget-guarded execution (auto-reduces N if over budget)
   - Filesystem caching (no duplicate API calls)
   - Bootstrap confidence intervals
   - Stratified sampling by domain and length

### Project Structure

```
.
├── evals/
│   ├── __init__.py
│   ├── compressor.py
│   └── longbench_eval.py
├── model.ipynb
├── .env.example
├── requirements.txt
└── runs/
    ├── cache/
    ├── results.json
    └── results.csv
```