# Model Benchmarking with LM-Eval-Harness

This notebook benchmarks the GPT-2 MLA model defined in `configs/gpt-2_mla_c4.json` using the lm-eval-harness framework. The model uses Multi-Layer Attention (MLA) with shared subspaces for query, key, and value projections.

## Model Configuration
- **Architecture**: GPT-2 with MLA attention
- **Hidden Size**: 768
- **Layers**: 12
- **Attention Heads**: 12
- **Query Shared Dim**: 192
- **KV Shared Dim**: 96
- **Vocab Size**: 50,257
- **Max Position Embeddings**: 1,024


## Setup and Imports

First, we import the necessary libraries and set up logging for the benchmarking process.


In [None]:
import json
import os
import logging
from pathlib import Path
from lm_eval import simple_evaluate

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)


## Configuration

Configure the benchmarking parameters. The model path should point to your trained GPT-2 MLA model checkpoint.


In [None]:
# --- Model Configuration ---
# IMPORTANT: Update this path to point to your trained model checkpoint
# This can be a local path or a model ID on the Hugging Face Hub
MODEL_PATH = "checkpoints/gpt2_mla_c4_baseline"  # Update this path as needed

# If your model requires `trust_remote_code=True`, set this to True
# This is common for custom architectures like MLA
TRUST_REMOTE_CODE = True

# Configure the batch size. "auto" lets lm-eval-harness decide the best
# batch size. You can also set it to a specific integer like 8 or 16.
BATCH_SIZE = "auto"

# Path to save the detailed results
OUTPUT_PATH = "benchmark_results.json"

print(f"Model path: {MODEL_PATH}")
print(f"Trust remote code: {TRUST_REMOTE_CODE}")
print(f"Batch size: {BATCH_SIZE}")


## Task Configuration

Define the benchmark tasks to evaluate. We'll test the model on a comprehensive set of tasks covering different capabilities:

- **Language Modeling**: WikiText (perplexity)
- **Cloze/Prediction**: LAMBADA
- **Commonsense Reasoning**: HellaSwag, PIQA, WinoGrande
- **Reading Comprehension**: ARC Easy/Challenge, OpenBookQA
- **Knowledge**: MMLU (5-shot)
- **Mathematical Reasoning**: GSM8K (8-shot)


In [None]:
# Define the list of tasks you want to run
TASKS = [
    "wikitext",         # Perplexity on WikiText
    "lambada_openai",   # Cloze/Prediction task
    "hellaswag",        # Commonsense NLI
    "piqa",             # Physical Interaction QA
    "winogrande",       # Commonsense Reasoning (Winograd Schema)
    "arc_easy",         # AI2 Reasoning Challenge (Easy)
    "arc_challenge",    # AI2 Reasoning Challenge (Challenge)
    "openbookqa",       # Open Book Question Answering
    "mmlu",             # Massive Multitask Language Understanding
    "gsm8k",            # Grade School Math
]

# Define tasks that should be run with a specific number of few-shot examples
# The numbers here are standard for these benchmarks
FEW_SHOT_CONFIG = {
    "mmlu": 5,    # 5-shot for MMLU
    "gsm8k": 8,   # 8-shot for GSM8K
}

print(f"Total tasks: {len(TASKS)}")
print(f"0-shot tasks: {len([t for t in TASKS if t not in FEW_SHOT_CONFIG])}")
print(f"Few-shot tasks: {len(FEW_SHOT_CONFIG)}")


## Model Path Validation

Check if the model path exists and is valid before proceeding with benchmarking.


In [None]:
def validate_model_path(model_path):
    """
    Validate that the model path exists and contains necessary files.
    """
    if model_path == "path/to/your/custom/decoder/model":
        logger.error("="*80)
        logger.error("ERROR: Please update the `MODEL_PATH` variable!")
        logger.error("It should point to your local model directory or a Hugging Face Hub ID.")
        logger.error("="*80)
        return False
    
    # Check if it's a local path
    if not model_path.startswith("http") and not "/" in model_path.split(":")[0]:
        model_dir = Path(model_path)
        if not model_dir.exists():
            logger.warning(f"Local model path does not exist: {model_path}")
            logger.warning("This might be a Hugging Face Hub model ID, which is fine.")
        else:
            # Check for common model files
            required_files = ["config.json", "pytorch_model.bin"]
            missing_files = [f for f in required_files if not (model_dir / f).exists()]
            if missing_files:
                logger.warning(f"Missing model files: {missing_files}")
    
    logger.info(f"Model path validated: {model_path}")
    return True

# Validate the model path
if not validate_model_path(MODEL_PATH):
    raise ValueError("Invalid model path. Please update MODEL_PATH variable.")


## Benchmark Execution

Run the benchmarks on all configured tasks. The evaluation is split into 0-shot and few-shot tasks for efficiency.


In [None]:

logger.info(f"Starting benchmark for model: {MODEL_PATH}")

# Separate tasks into 0-shot and few-shot
zero_shot_tasks = [t for t in TASKS if t not in FEW_SHOT_CONFIG]
few_shot_tasks = {t: n for t, n in FEW_SHOT_CONFIG.items() if t in TASKS}

# Prepare model arguments
model_args = f"pretrained={MODEL_PATH},trust_remote_code={TRUST_REMOTE_CODE}"

all_results = {}

# Run 0-shot tasks
if zero_shot_tasks:
    logger.info("\n" + "="*50)
    logger.info(f"Running 0-shot tasks: {', '.join(zero_shot_tasks)}")
    logger.info("="*50)
    
    results_0_shot = simple_evaluate(
        model="hf-causal",
        model_args=model_args,
        tasks=zero_shot_tasks,
        num_fewshot=0,
        batch_size=BATCH_SIZE,
        # You can specify a device, e.g., device="cuda:0"
        # If not specified, it will auto-detect
    )
    all_results.update(results_0_shot['results'])

# Run few-shot tasks
for task_name, num_shots in few_shot_tasks.items():
    logger.info("\n" + "="*50)
    logger.info(f"Running {num_shots}-shot task: {task_name}")
    logger.info("="*50)
    
    results_few_shot = simple_evaluate(
        model="hf-causal",
        model_args=model_args,
        tasks=[task_name],  # simple_evaluate expects a list of tasks
        num_fewshot=num_shots,
        batch_size=BATCH_SIZE,
    )
    all_results.update(results_few_shot['results'])

# Results are stored in all_results


## Results Processing and Display

Process the benchmark results and display a summary of the main metrics for each task.


In [None]:
def process_and_display_results(results):
    """
    Process and display the benchmark results.
    """
    logger.info("\n" + "="*50)
    logger.info("Benchmark complete!")
    logger.info("="*50)

    # Save the full results to a file
    with open(OUTPUT_PATH, "w") as f:
        json.dump(results, f, indent=4)
    logger.info(f"Detailed results saved to: {OUTPUT_PATH}")

    # Print a summary of the main metrics
    print("\n--- Benchmark Summary ---\n")
    
    for task_name, metrics in results.items():
        # Extract the main metric for each task
        main_metric = "N/A"
        if "acc,none" in metrics:
            main_metric = f"acc: {metrics['acc,none']:.4f}"
        elif "acc_norm,none" in metrics:
            main_metric = f"acc_norm: {metrics['acc_norm,none']:.4f}"
        elif "exact_match,none" in metrics:
            main_metric = f"exact_match: {metrics['exact_match,none']:.4f}"
        elif "word_perplexity,none" in metrics:
            # For perplexity, lower is better
            main_metric = f"word_perplexity: {metrics['word_perplexity,none']:.4f}"

        print(f"{task_name:<20}: {main_metric}")

    print("\n--- End of Summary ---\n")
    
    return results

# Process and display results
processed_results = process_and_display_results(results)


## Detailed Results Analysis

Display more detailed information about the results, including additional metrics where available.


In [None]:
def display_detailed_results(results):
    """
    Display detailed results for each task.
    """
    print("\n--- Detailed Results ---\n")
    
    for task_name, metrics in results.items():
        print(f"Task: {task_name}")
        print("-" * (len(task_name) + 6))
        
        # Display all available metrics
        for metric_name, value in metrics.items():
            if isinstance(value, (int, float)):
                print(f"  {metric_name}: {value:.6f}")
            else:
                print(f"  {metric_name}: {value}")
        
        print()  # Empty line between tasks

# Display detailed results
display_detailed_results(processed_results)


## Results Summary Table

Create a formatted table of the main results for easy comparison.


In [None]:
import pandas as pd

def create_results_table(results):
    """
    Create a formatted table of the main results.
    """
    table_data = []
    
    for task_name, metrics in results.items():
        row = {"Task": task_name}
        
        # Extract main metric
        if "acc,none" in metrics:
            row["Main Metric"] = "Accuracy"
            row["Score"] = f"{metrics['acc,none']:.4f}"
        elif "acc_norm,none" in metrics:
            row["Main Metric"] = "Normalized Accuracy"
            row["Score"] = f"{metrics['acc_norm,none']:.4f}"
        elif "exact_match,none" in metrics:
            row["Main Metric"] = "Exact Match"
            row["Score"] = f"{metrics['exact_match,none']:.4f}"
        elif "word_perplexity,none" in metrics:
            row["Main Metric"] = "Word Perplexity"
            row["Score"] = f"{metrics['word_perplexity,none']:.4f}"
        else:
            row["Main Metric"] = "Other"
            row["Score"] = "N/A"
        
        # Add few-shot info
        if task_name in FEW_SHOT_CONFIG:
            row["Few-Shot"] = f"{FEW_SHOT_CONFIG[task_name]}-shot"
        else:
            row["Few-Shot"] = "0-shot"
        
        table_data.append(row)
    
    df = pd.DataFrame(table_data)
    return df

# Create and display the results table
results_table = create_results_table(processed_results)
print("\n--- Results Summary Table ---\n")
print(results_table.to_string(index=False))


## Next Steps

The benchmarking is complete! Here are some suggestions for next steps:

1. **Compare with baselines**: Compare these results with standard GPT-2 or other transformer models
2. **Analyze performance patterns**: Look for tasks where the MLA architecture performs particularly well or poorly
3. **Hyperparameter tuning**: Experiment with different MLA configurations (q_shared_dim, kv_shared_dim, etc.)
4. **Ablation studies**: Test the impact of different components of the MLA architecture
5. **Scale experiments**: Test how performance changes with model size or training data

The detailed results have been saved to `benchmark_results.json` for further analysis.
