# MT-bench Evaluation System on Google Colab

This notebook runs the MT-bench evaluation system on Google Colab with T4 GPU optimization.

## Features:
- Evaluates multiple language models on MT-bench
- Uses GPT-4.1-nano as judge
- Memory optimized for Colab's T4 GPU
- Flash Attention 2 integration
- Comprehensive results analysis

## Requirements:
- Google Colab with GPU runtime
- OpenAI API key for judging
- About 2-3 hours for full evaluation of 5 models

## 1. Setup and Installation

In [None]:
# Check GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("WARNING: No GPU detected. This will be very slow!")

In [None]:
# Install required packages
!pip install -q torch>=2.0.0 transformers>=4.36.0 accelerate>=0.24.1
!pip install -q openai>=1.0.0 tenacity>=8.2.0 aiohttp>=3.8.0
!pip install -q pandas tqdm psutil pyyaml
!pip install -q bitsandbytes>=0.41.0

# Try to install Flash Attention 2 (may take a few minutes)
try:
    !pip install -q flash-attn>=2.4.0 --no-build-isolation
    print("✅ Flash Attention 2 installed successfully")
except:
    print("⚠️ Flash Attention 2 installation failed, will use standard attention")

print("\n✅ Package installation completed!")

In [None]:
# Download the MT-bench evaluation code
import os
import zipfile
import urllib.request

# For demo purposes, we'll create the code structure directly
# In real usage, you would clone from your repository

# Create directory structure
os.makedirs('src/models', exist_ok=True)
os.makedirs('src/evaluation', exist_ok=True)
os.makedirs('src/utils', exist_ok=True)
os.makedirs('data', exist_ok=True)
os.makedirs('results', exist_ok=True)

print("✅ Code structure created")

## 2. Configuration and API Setup

In [None]:
# Set up OpenAI API key
import getpass
import os

# Get API key securely
if 'OPENAI_API_KEY' not in os.environ:
    api_key = getpass.getpass('Enter your OpenAI API key: ')
    os.environ['OPENAI_API_KEY'] = api_key
    print("✅ API key set")
else:
    print("✅ API key already configured")

In [None]:
# Select models to evaluate
# These are optimized for Colab's T4 GPU (15GB memory)

AVAILABLE_MODELS = {
    'gpt2-large': {
        'path': 'gpt2-large',
        'memory_gb': 1.5,
        'description': 'GPT-2 Large (774M params) - Fast baseline model'
    },
    'llama-3.2-1b': {
        'path': 'meta-llama/Llama-3.2-1B-Instruct',
        'memory_gb': 2.0,
        'description': 'Llama 3.2 1B - Instruction-tuned model'
    },
    'phi-3-mini': {
        'path': 'microsoft/Phi-3-mini-4k-instruct',
        'memory_gb': 2.5,
        'description': 'Phi-3 Mini - Efficient 3.8B parameter model'
    },
    'gemma-2b': {
        'path': 'google/gemma-2b-it',
        'memory_gb': 4.0,
        'description': 'Gemma 2B IT - Instruction-tuned model'
    }
}

# Select models to evaluate (you can modify this list)
MODELS_TO_EVALUATE = ['gpt2-large', 'llama-3.2-1b', 'phi-3-mini']

# For quick testing, limit questions
MAX_QUESTIONS = 10  # Set to None for full evaluation (80 questions)

print("Selected models for evaluation:")
for model in MODELS_TO_EVALUATE:
    info = AVAILABLE_MODELS[model]
    print(f"  - {model}: {info['description']} (~{info['memory_gb']}GB)")

print(f"\nQuestions per model: {MAX_QUESTIONS if MAX_QUESTIONS else 80}")
print(f"Total evaluations: {len(MODELS_TO_EVALUATE) * (MAX_QUESTIONS if MAX_QUESTIONS else 80) * 2} (2 turns each)")

## 3. Core Implementation

This section contains the main evaluation code. In a real deployment, this would be imported from modules.

In [None]:
# Memory monitoring utilities
import torch
import gc
import psutil

class MemoryMonitor:
    def __init__(self, memory_limit_gb=15.0):
        self.memory_limit_gb = memory_limit_gb
        self.peak_memory = 0.0
    
    def get_gpu_memory_usage(self):
        if torch.cuda.is_available():
            return torch.cuda.memory_allocated() / (1024**3)
        return 0.0
    
    def cleanup_gpu_memory(self):
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            torch.cuda.synchronize()
        gc.collect()
    
    def log_memory_usage(self, operation=""):
        gpu_memory = self.get_gpu_memory_usage()
        system_memory = psutil.virtual_memory().percent
        self.peak_memory = max(self.peak_memory, gpu_memory)
        
        print(f"[{operation}] GPU: {gpu_memory:.2f}GB, System: {system_memory:.1f}%, Peak: {self.peak_memory:.2f}GB")

# Initialize memory monitor
memory_monitor = MemoryMonitor()
memory_monitor.log_memory_usage("Initial")

In [None]:
# Model management with memory optimization
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import warnings

class ModelManager:
    def __init__(self, memory_monitor):
        self.memory_monitor = memory_monitor
        self.current_model = None
        self.current_tokenizer = None
        self.current_model_name = None
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
    
    def cleanup_current_model(self):
        if self.current_model is not None:
            print(f"Cleaning up model: {self.current_model_name}")
            del self.current_model
            del self.current_tokenizer
            self.current_model = None
            self.current_tokenizer = None
            self.current_model_name = None
            self.memory_monitor.cleanup_gpu_memory()
    
    def load_model(self, model_name):
        if self.current_model_name == model_name:
            return self.current_model, self.current_tokenizer
        
        self.cleanup_current_model()
        
        model_info = AVAILABLE_MODELS[model_name]
        model_path = model_info['path']
        
        print(f"Loading model: {model_name} ({model_path})")
        self.memory_monitor.log_memory_usage("Before model loading")
        
        # Load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        
        # Configure model loading
        model_kwargs = {
            "torch_dtype": torch.float16,
            "device_map": "auto",
            "low_cpu_mem_usage": True,
        }
        
        # Try to use Flash Attention if available
        try:
            import flash_attn
            model_kwargs["attn_implementation"] = "flash_attention_2"
            print("Using Flash Attention 2")
        except ImportError:
            print("Flash Attention 2 not available, using standard attention")
        
        # Add trust_remote_code for certain models
        if "phi" in model_name.lower() or "qwen" in model_name.lower():
            model_kwargs["trust_remote_code"] = True
        
        # Load model
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore")
            model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)
        
        model.eval()
        
        self.current_model = model
        self.current_tokenizer = tokenizer
        self.current_model_name = model_name
        
        self.memory_monitor.log_memory_usage("After model loading")
        return model, tokenizer
    
    def generate_response(self, prompt, max_new_tokens=512):
        if self.current_model is None:
            raise RuntimeError("No model loaded")
        
        inputs = self.current_tokenizer(
            prompt, return_tensors="pt", truncation=True, max_length=2048
        ).to(self.device)
        
        with torch.no_grad():
            outputs = self.current_model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=0.7,
                do_sample=True,
                top_p=0.9,
                pad_token_id=self.current_tokenizer.eos_token_id,
                repetition_penalty=1.1
            )
        
        # Decode only the new tokens
        input_length = inputs.input_ids.shape[1]
        response_tokens = outputs[0][input_length:]
        response = self.current_tokenizer.decode(response_tokens, skip_special_tokens=True).strip()
        
        return response

# Initialize model manager
model_manager = ModelManager(memory_monitor)
print("✅ Model manager initialized")

In [None]:
# MT-bench data loader
import json
import requests
from dataclasses import dataclass
from typing import List

@dataclass
class MTBenchQuestion:
    question_id: int
    category: str
    turns: List[str]

def load_mtbench_questions():
    """Load MT-bench questions from official repository."""
    url = "https://raw.githubusercontent.com/lm-sys/FastChat/main/fastchat/llm_judge/data/mt_bench/question.jsonl"
    
    print("Downloading MT-bench questions...")
    response = requests.get(url)
    response.raise_for_status()
    
    questions = []
    for line in response.text.strip().split('\n'):
        if line.strip():
            data = json.loads(line)
            questions.append(MTBenchQuestion(
                question_id=data['question_id'],
                category=data['category'],
                turns=data['turns']
            ))
    
    print(f"✅ Loaded {len(questions)} MT-bench questions")
    
    # Show category distribution
    categories = {}
    for q in questions:
        categories[q.category] = categories.get(q.category, 0) + 1
    
    print("Categories:")
    for cat, count in sorted(categories.items()):
        print(f"  {cat}: {count} questions")
    
    return questions

# Load questions
all_questions = load_mtbench_questions()

# Limit questions for testing if specified
if MAX_QUESTIONS:
    questions = all_questions[:MAX_QUESTIONS]
    print(f"\n📝 Using first {len(questions)} questions for testing")
else:
    questions = all_questions
    print(f"\n📝 Using all {len(questions)} questions for full evaluation")

In [None]:
# OpenAI judge client
import asyncio
import time
import re
from openai import AsyncOpenAI
from dataclasses import dataclass

@dataclass
class JudgeScore:
    score: float
    reasoning: str
    question_id: int
    turn: int
    model_name: str

class JudgeClient:
    def __init__(self, api_key, model="gpt-4.1-nano"):
        self.client = AsyncOpenAI(api_key=api_key)
        self.model = model
        self.request_times = []
    
    async def rate_limit(self):
        """Simple rate limiting to stay under 10 req/sec."""
        current_time = time.time()
        # Remove old timestamps
        self.request_times = [t for t in self.request_times if current_time - t < 1.0]
        
        if len(self.request_times) >= 8:  # Be conservative
            wait_time = 1.0 - (current_time - self.request_times[0])
            if wait_time > 0:
                await asyncio.sleep(wait_time)
        
        self.request_times.append(current_time)
    
    def parse_score(self, response_text):
        """Parse score from judge response."""
        # Look for [[score]] pattern
        match = re.search(r'\[\[(\d+(?:\.\d+)?)\]\]', response_text)
        if match:
            score = float(match.group(1))
            reasoning = response_text[:match.start()].strip()
            return score, reasoning
        
        # Fallback: look for any number 1-10
        numbers = re.findall(r'\b([1-9]|10)\b', response_text)
        if numbers:
            return float(numbers[-1]), response_text
        
        return 0.0, "Could not parse score"
    
    async def judge_response(self, question, answer, question_id, turn, model_name):
        """Judge a single response."""
        prompt = f"""Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, please rate the response on a scale of 1 to 10 by strictly following this format: "Rating: [[rating]]", for example: "Rating: [[8]]".

[Question]
{question}

[The Start of Assistant's Answer]
{answer}
[The End of Assistant's Answer]"""
        
        await self.rate_limit()
        
        try:
            response = await self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "You are an expert AI evaluator."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.2,
                max_tokens=1000
            )
            
            response_text = response.choices[0].message.content
            score, reasoning = self.parse_score(response_text)
            
            return JudgeScore(
                score=score,
                reasoning=reasoning,
                question_id=question_id,
                turn=turn,
                model_name=model_name
            )
        
        except Exception as e:
            print(f"Judge error: {e}")
            return JudgeScore(
                score=0.0,
                reasoning=f"Error: {e}",
                question_id=question_id,
                turn=turn,
                model_name=model_name
            )

# Initialize judge client
judge_client = JudgeClient(os.environ['OPENAI_API_KEY'])
print("✅ Judge client initialized")

## 4. Run Evaluation

This section runs the actual MT-bench evaluation.

In [None]:
# Prompt templates for different models
PROMPT_TEMPLATES = {
    'gpt2-large': '{instruction}',
    'llama-3.2-1b': '<s>[INST] {instruction} [/INST]',
    'phi-3-mini': '<|user|>\n{instruction}<|end|>\n<|assistant|>\n',
    'gemma-2b': '<start_of_turn>user\n{instruction}<end_of_turn>\n<start_of_turn>model\n'
}

def format_prompt(instruction, model_name, conversation_history=None):
    """Format prompt for specific model."""
    template = PROMPT_TEMPLATES.get(model_name, '{instruction}')
    
    if conversation_history:
        full_instruction = f"{conversation_history}\n\nUser: {instruction}\nAssistant:"
    else:
        full_instruction = instruction
    
    return template.format(instruction=full_instruction)

print("✅ Prompt templates configured")

In [None]:
# Main evaluation function
import pandas as pd
from tqdm.notebook import tqdm
import numpy as np

async def evaluate_model(model_name, questions):
    """Evaluate a single model on all questions."""
    print(f"\n🔄 Starting evaluation of {model_name}")
    
    # Load model
    model, tokenizer = model_manager.load_model(model_name)
    
    results = []
    
    # Progress bar
    pbar = tqdm(questions, desc=f"Evaluating {model_name}")
    
    for question in pbar:
        conversation_history = ""
        
        # Process both turns
        for turn_num in [1, 2]:
            turn_question = question.turns[turn_num - 1]
            
            # Format prompt
            if turn_num == 1:
                prompt = format_prompt(turn_question, model_name)
            else:
                prompt = format_prompt(turn_question, model_name, conversation_history)
            
            # Generate response
            start_time = time.time()
            try:
                response = model_manager.generate_response(prompt)
                generation_time = time.time() - start_time
            except Exception as e:
                print(f"Generation error for Q{question.question_id} T{turn_num}: {e}")
                response = f"Error: {e}"
                generation_time = 0.0
            
            # Judge response
            judge_score = await judge_client.judge_response(
                question=turn_question,
                answer=response,
                question_id=question.question_id,
                turn=turn_num,
                model_name=model_name
            )
            
            # Store result
            results.append({
                'model_name': model_name,
                'question_id': question.question_id,
                'category': question.category,
                'turn': turn_num,
                'question': turn_question,
                'response': response,
                'score': judge_score.score,
                'reasoning': judge_score.reasoning,
                'generation_time': generation_time,
                'memory_gb': memory_monitor.get_gpu_memory_usage()
            })
            
            # Update conversation history for turn 2
            if turn_num == 1:
                conversation_history = f"User: {turn_question}\nAssistant: {response}"
        
        # Update progress bar
        current_scores = [r['score'] for r in results if r['model_name'] == model_name]
        avg_score = np.mean(current_scores) if current_scores else 0.0
        pbar.set_postfix({
            'avg_score': f'{avg_score:.2f}',
            'memory': f'{memory_monitor.get_gpu_memory_usage():.1f}GB'
        })
    
    # Calculate final stats
    model_scores = [r['score'] for r in results if r['model_name'] == model_name]
    avg_score = np.mean(model_scores)
    
    print(f"✅ {model_name} completed - Average score: {avg_score:.2f}")
    memory_monitor.log_memory_usage(f"Completed {model_name}")
    
    return results

# Run evaluation for all models
all_results = []

for model_name in MODELS_TO_EVALUATE:
    try:
        model_results = await evaluate_model(model_name, questions)
        all_results.extend(model_results)
        
        # Cleanup between models
        model_manager.cleanup_current_model()
        
    except Exception as e:
        print(f"❌ Failed to evaluate {model_name}: {e}")
        continue

print(f"\n🎉 Evaluation completed! Total results: {len(all_results)}")

## 5. Results Analysis and Visualization

In [None]:
# Create results DataFrame
df = pd.DataFrame(all_results)

print("📊 Results Summary:")
print(f"Total evaluations: {len(df)}")
print(f"Models evaluated: {df['model_name'].nunique()}")
print(f"Questions evaluated: {df['question_id'].nunique()}")
print(f"Categories covered: {df['category'].nunique()}")

# Overall scores by model
print("\n🏆 Overall Rankings:")
model_scores = df.groupby('model_name')['score'].agg(['mean', 'std', 'count']).round(3)
model_scores = model_scores.sort_values('mean', ascending=False)
print(model_scores)

# Category performance
print("\n📈 Performance by Category:")
category_scores = df.groupby(['model_name', 'category'])['score'].mean().unstack(fill_value=0).round(2)
print(category_scores)

# Turn comparison
print("\n🔄 Performance by Turn:")
turn_scores = df.groupby(['model_name', 'turn'])['score'].mean().unstack(fill_value=0).round(2)
print(turn_scores)

In [None]:
# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('default')
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Overall scores by model
model_means = df.groupby('model_name')['score'].mean().sort_values(ascending=True)
axes[0, 0].barh(model_means.index, model_means.values, color='skyblue')
axes[0, 0].set_title('Average MT-bench Scores by Model')
axes[0, 0].set_xlabel('Average Score')
for i, v in enumerate(model_means.values):
    axes[0, 0].text(v + 0.1, i, f'{v:.2f}', va='center')

# 2. Score distribution
df.boxplot(column='score', by='model_name', ax=axes[0, 1])
axes[0, 1].set_title('Score Distribution by Model')
axes[0, 1].set_xlabel('Model')
axes[0, 1].set_ylabel('Score')

# 3. Category heatmap
category_pivot = df.pivot_table(values='score', index='model_name', columns='category', aggfunc='mean')
sns.heatmap(category_pivot, annot=True, fmt='.2f', cmap='YlOrRd', ax=axes[1, 0])
axes[1, 0].set_title('Performance Heatmap by Category')

# 4. Turn comparison
turn_data = df.groupby(['model_name', 'turn'])['score'].mean().unstack()
turn_data.plot(kind='bar', ax=axes[1, 1], width=0.8)
axes[1, 1].set_title('Turn 1 vs Turn 2 Performance')
axes[1, 1].set_ylabel('Average Score')
axes[1, 1].legend(['Turn 1', 'Turn 2'])
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Performance metrics
print("\n⚡ Performance Metrics:")
perf_metrics = df.groupby('model_name').agg({
    'generation_time': ['mean', 'sum'],
    'memory_gb': 'max'
}).round(3)
perf_metrics.columns = ['Avg_Gen_Time', 'Total_Gen_Time', 'Peak_Memory_GB']
print(perf_metrics)

print(f"\n🔥 Peak memory usage: {memory_monitor.peak_memory:.2f}GB")

In [None]:
# Export results
import datetime

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

# Save detailed results
csv_filename = f"mtbench_results_{timestamp}.csv"
df.to_csv(csv_filename, index=False)
print(f"✅ Detailed results saved to: {csv_filename}")

# Save summary
summary_filename = f"mtbench_summary_{timestamp}.txt"
with open(summary_filename, 'w') as f:
    f.write("MT-BENCH EVALUATION SUMMARY\n")
    f.write("=" * 50 + "\n\n")
    
    f.write(f"Evaluation Date: {datetime.datetime.now()}\n")
    f.write(f"Models Evaluated: {', '.join(MODELS_TO_EVALUATE)}\n")
    f.write(f"Questions per Model: {len(questions)}\n")
    f.write(f"Total Evaluations: {len(df)}\n\n")
    
    f.write("OVERALL RANKINGS:\n")
    for i, (model, score) in enumerate(model_scores['mean'].items(), 1):
        f.write(f"{i}. {model}: {score:.3f}\n")
    
    f.write(f"\nPeak Memory Usage: {memory_monitor.peak_memory:.2f}GB\n")
    f.write(f"Total Evaluation Time: {df['generation_time'].sum():.1f}s\n")

print(f"✅ Summary report saved to: {summary_filename}")

# Download files (in Colab)
try:
    from google.colab import files
    files.download(csv_filename)
    files.download(summary_filename)
    print("📥 Files ready for download!")
except ImportError:
    print("💾 Files saved locally (not in Colab environment)")

## 6. Conclusion and Next Steps

This notebook has successfully evaluated multiple language models on the MT-bench benchmark using GPT-4.1-nano as a judge.

### Key Results:
- **Models Evaluated**: Various sizes from GPT-2 Large to Gemma 2B
- **Memory Optimization**: Used Flash Attention 2 and fp16 precision
- **Comprehensive Analysis**: Category-wise and turn-wise performance

### Next Steps:
1. **Scale Up**: Run full evaluation with all 80 questions
2. **More Models**: Evaluate larger models with quantization
3. **Custom Categories**: Focus on specific domains of interest
4. **Fine-tuning**: Use results to guide model improvement

### Resources:
- [MT-bench Paper](https://arxiv.org/abs/2306.05685)
- [FastChat Repository](https://github.com/lm-sys/FastChat)
- [Flash Attention](https://github.com/Dao-AILab/flash-attention)
