Erik Y. Wang*, Sumeet Motwani, James V. Roggeveen, Eliot Hodges, Dulhan Jayalath,
Charles London, Kalyan Ramakrishnan, Flaviu Cipcigan, Philip Torr, Alessandro Abate
University of Oxford · Benchmark · Harvard University · Princeton University · Ellison Institute of Technology
*Correspondence: erik.wang@dtc.ox.ac.uk
uv syncSome benchmark validators require SageMath (not a Python package). Install it separately:
# Ubuntu/Debian
sudo apt-get install -y sagemath
# macOS (Homebrew)
brew install sageIf Sage is installed outside your PATH, set SAGE_CMD to the executable, e.g.
SAGE_CMD=/opt/homebrew/bin/sage.
Create a .env file with your API keys:
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=AIza...HorizonMath/
├── data/ # JSON data files
│ ├── problems_full.json # Problem definitions, prompts, and numeric values
│ └── baselines.json # State-of-the-art baseline values for benchmark problems
├── numerics/ # Numerical solution scripts (one per problem)
├── validators/ # Validators for benchmark and construction problems
│ ├── utils.py # Shared utilities (ValidationResult, etc.)
│ └── *.py # Problem-specific validators (one per problem)
├── scripts/ # Utility scripts
│ ├── run_benchmark.py # Generate LLM responses (Phase 1)
│ ├── evaluate_responses.py # Evaluate saved responses (Phase 2)
│ ├── tmux_run.sh # Run both phases in a tmux session
│ ├── evaluate.py # Core evaluation logic (numeric + benchmark modes)
│ ├── validator_registry.py # Auto-discovers validators
│ ├── baseline_comparator.py # Compares results against baselines
│ ├── aggregate_results.py # Aggregate evaluations across split runs
│ └── run_numerics.py # Execute all numeric scripts
└── pyproject.toml # Python dependencies
Problems are classified with three fields in data/problems_full.json:
output_type: the kind of artifact being produced (constant, formula, construction, etc.)evaluation_mode: how references are generated (ground-truth or benchmark)solvability: difficulty level (0 = already solved / calibration, 1 = likely solvable, 2 = challenging, 3 = possibly unsolvable)
| Level | Description | Count |
|---|---|---|
| 0 | Already solved; included as calibration and sanity checks | 10 |
| 1 | Likely solvable with known techniques or near-term AI | 23 |
| 2 | Challenging; may require novel insights or methods | 60 |
| 3 | Possibly unsolvable; included for experimental exploration | 8 |
Level 0 problems have known solutions and serve to verify that the evaluation pipeline and LLMs can correctly reproduce established results.
| Output Type | Count |
|---|---|
| constant | 54 |
| function | 5 |
| formula_discovery | 3 |
| construction | 39 |
Total: 101 problems
| Evaluation Mode | Count |
|---|---|
| ground_truth_computable | 59 |
| benchmark_best_known | 33 |
| new_construction | 9 |
| Domain | Count |
|---|---|
| number_theory | 17 |
| discrete_geometry | 15 |
| lattice_models | 15 |
| continuum_physics | 13 |
| combinatorics | 12 |
| integrals | 11 |
| coding_theory | 9 |
| mathematical_constants | 9 |
Most ground_truth_computable problems have a corresponding Python script in numerics/ that computes their numerical value to high precision. For problems without a numerics script, the ground-truth value was obtained from the reference papers listed for the corresponding problem in the dataset json file. To run all numerical scripts and update problems_full.json:
# Run all numeric scripts
uv run python scripts/run_numerics.py
# Run a specific problem
uv run python scripts/run_numerics.py --problem 0
# Dry run (show what would be done)
uv run python scripts/run_numerics.py --dry-run
# Force recompute even if value exists
uv run python scripts/run_numerics.py --forceThis updates data/problems_full.json with numeric_value fields for each problem.
The benchmark has a two-phase pipeline:
- Phase 1 — Generate responses (
run_benchmark.py): Prompts models and saves raw responses toresponses.jsonl. - Phase 2 — Evaluate responses (
evaluate_responses.py): Evaluates saved responses against ground truths, validators, and baselines.
This separation means you can re-evaluate responses without re-prompting models, and long-running generation survives interruptions via --resume.
The tmux_run.sh wrapper runs both phases in a detached tmux session, so benchmark runs survive SSH disconnects and laptop lid closures. Output is logged to results/tmux_run_<timestamp>.log.
# Full benchmark with defaults (OpenRouter gpt-5.2, 5 parallel)
./scripts/tmux_run.sh
# Specify provider and model
./scripts/tmux_run.sh --provider openai --model gpt-5.2-pro
# Single problem
./scripts/tmux_run.sh --problem diff_basis_upper
# Resume an interrupted run
./scripts/tmux_run.sh --resume results/openrouter_openai-gpt-5.2_20260205_143022/
# Override parallelism
./scripts/tmux_run.sh --provider anthropic --parallel 4
# Run only generation (skip evaluation)
./scripts/tmux_run.sh --phase generate --provider openai --model gpt-5.2
# Run only evaluation on existing results
./scripts/tmux_run.sh --phase evaluate --resume results/openrouter_openai-gpt-5.2_20260205_143022/
# Explicitly run both phases (default)
./scripts/tmux_run.sh --phase bothMonitor and manage the session:
tmux attach -t openmath # Attach to see live output
# Ctrl-b d # Detach without stopping the run
tmux kill-session -t openmath # Abort the runWhen both phases finish, the session drops into an interactive shell so you can inspect results.
You can also run the phases separately:
Phase 1 — Generate responses:
# Full benchmark with OpenRouter gpt-5.2 (default)
uv run scripts/run_benchmark.py
# Single problem
uv run scripts/run_benchmark.py --problem w4_watson_integral
# Use OpenAI directly (with code execution tool)
uv run scripts/run_benchmark.py --provider openai --model gpt-5.2-pro
# Use Anthropic Claude
uv run scripts/run_benchmark.py --provider anthropic
# Use Gemini
uv run scripts/run_benchmark.py --provider gemini
# Parallel generation
uv run scripts/run_benchmark.py --parallel 10
# Resume an interrupted run
uv run scripts/run_benchmark.py --resume results/openrouter_openai-gpt-5.2_20260205_143022/Phase 2 — Evaluate responses:
# Evaluate all responses in a results directory
uv run scripts/evaluate_responses.py results/openrouter_openai-gpt-5.2_20260205_143022/
# Re-evaluate (overwrites previous evaluation results)
uv run scripts/evaluate_responses.py results/openrouter_openai-gpt-5.2_20260205_143022/ --forceResults are saved to timestamped folders in results/:
results/openai_gpt-5.2_20260205_143022/
├── config.json # Run configuration
├── prompts.jsonl # Problem prompts (saved before API calls)
├── responses.jsonl # Raw LLM responses (Phase 1 output)
├── evaluation.jsonl # Per-problem evaluation results (Phase 2 output)
└── summary.json # Detailed statistics
| Option | Description |
|---|---|
--provider |
LLM provider: openrouter (default), openai, gemini, or anthropic |
--model |
Model name (defaults: openai/gpt-5.2 for OpenRouter, gpt-5.2 for OpenAI, gemini-3-pro-preview for Gemini, claude-opus-4-6 for Anthropic) |
--problem |
Run only a single problem by ID (supports partial match) |
--range |
Slice of problems to run by 0-based index, e.g. 0-24 (inclusive) |
--resume |
Resume from an interrupted run directory |
--parallel |
Number of problems to process in parallel (default: 1; tmux_run.sh uses 5) |
--data-file |
Path to problems JSON (default: data/problems_full.json) |
--debug |
Save prompts locally without calling the model API |
| Option | Description |
|---|---|
--force |
Re-evaluate even if evaluation.jsonl already exists |
--data-file |
Path to problems JSON (default: data/problems_full.json) |
--baselines-file |
Path to baselines JSON (default: data/baselines.json) |
When a benchmark run is split across multiple result directories (e.g. using --range to parallelize generation), use aggregate_results.py to merge evaluations into a single report:
# Aggregate all directories matching a glob
uv run scripts/aggregate_results.py results/openai_gpt-5.2-pro_*/
# Save combined evaluation.jsonl, summary.json, and provenance.json
uv run scripts/aggregate_results.py results/openai_gpt-5.2-pro_*/ -o results/gpt-5.2-pro_combined/If a problem appears in multiple directories (e.g. from retried runs), the latest directory wins (by lexicographic/timestamp order). The output directory contains:
results/gpt-5.2-pro_combined/
├── evaluation.jsonl # All deduplicated per-problem evaluation results
├── summary.json # Aggregate statistics (pass rates, digit counts, etc.)
└── provenance.json # Maps each problem_id to its source directory
The evaluation script (scripts/evaluate.py) supports three modes:
For problems with known numeric answers, the LLM outputs a proposed_solution() function that returns a number. The evaluator compares digits against the ground truth.
# Evaluate a numeric problem
uv run python scripts/evaluate.py --llm-output solution.txt --problem-index 0For optimization/construction problems, the LLM outputs a proposed_solution() function that returns a JSON-serializable construction (dict/list). The evaluator:
- Executes the function to get the construction
- Validates it using problem-specific validators
- Compares metrics against state-of-the-art baselines
# Evaluate a benchmark problem (auto-detects mode)
uv run python scripts/evaluate.py --llm-output solution.txt --problem-index 34
# Explicit benchmark mode
uv run python scripts/evaluate.py --llm-output solution.txt --problem-id diff_basis_upper --mode benchmark
# List all benchmark problems
uv run python scripts/evaluate.py --list-problems --mode benchmark
# Get JSON output
uv run python scripts/evaluate.py --llm-output solution.txt --problem-index 34 --jsonFor binary existence problems (find a valid object or don't), the LLM outputs a proposed_solution() function returning a construction. The evaluator validates it using problem-specific validators but performs no baseline comparison — the result is simply valid or invalid.
# Evaluate a construction problem (auto-detects mode)
uv run python scripts/evaluate.py --llm-output solution.txt --problem-index 78
# List all construction problems
uv run python scripts/evaluate.py --list-problems --mode constructionExample output:
✓ VALID [34] diff_basis_upper
Problem ID: diff_basis_upper
Message: Verified difference basis for n=10: |B|=5, |B|²/n = 2.500000
Metrics: n=10, basis_size=5, ratio=2.5
Baseline: beats_baseline
Achieved: 2.5
Baseline: 2.639 (minimize)
Improvement: +5.27%
LLMs should output a proposed_solution() function:
def proposed_solution():
# For numeric problems: return a number
# For benchmark problems: return a JSON-serializable dict/list
return {"n": 10, "basis": [0, 1, 2, 6, 9]}Each validator documents its expected input format in its docstring. Common formats include:
- Sequences:
{"sequence": [1, -1, 1, ...]} - Graphs:
{"edges": [[0,1], [1,2], ...]}or{"adjacency": [[...], ...]} - Points:
{"points": [[x, y, z], ...]} - Matrices:
{"matrix": [[...], ...]}
We welcome new problem contributions! To propose a problem, open a GitHub issue with the following:
- Problem description — a clear mathematical statement, including the source (paper, Math Stack Exchange, etc.)
- Classification — the proposed
output_type,domain,evaluation_mode, andsolvabilitylevel (see Problem Taxonomy) - Full implementation — depending on the evaluation mode, include:
| Evaluation Mode | What to provide |
|---|---|
ground_truth_computable |
A numerics script that computes the answer to at least 50 digits of precision, or a reliable academic source providing a pre-computed numerical value |
benchmark_best_known |
A validator script and a baseline value with source citation |
new_construction |
A validator script |
You must also provide a justification of the numerics or validator script that you provide below, explaining the method(s) used.
A numerics script computes the ground-truth value to high precision. It should be a standalone Python file:
from mpmath import mp
mp.dps = 110 # at least 100 digits of precision
def compute():
# Your computation here
return result
if __name__ == "__main__":
print(str(compute()))A validator checks whether a proposed construction is mathematically valid and returns metrics. It should export a single validate(solution) function:
from . import ValidationResult, success, failure
def validate(solution):
"""
Expected input format:
{"basis": [b0, b1, b2, ...]}
"""
# 1. Parse the input
if isinstance(solution, dict) and 'basis' in solution:
basis = solution['basis']
else:
return failure("Expected dict with 'basis' key")
# 2. Check mathematical validity
if not all_differences_covered(basis):
return failure("Not all differences covered", basis_size=len(basis))
# 3. Return success with metrics
return success(
f"Valid basis of size {len(basis)}",
basis_size=len(basis),
ratio=len(basis)**2 / n
)success(message, **metrics)andfailure(message, **metrics)are the only return types needed.- Document the expected input format in the docstring.
- For benchmark problems, return metrics as keyword arguments — one of these is compared against the baseline.
- Helper utilities available from the
validatorspackage:parse_integer,parse_rational,load_solution,run_sage_script.
For benchmark_best_known problems, provide a baseline entry for data/baselines.json:
{
"problem_id": "diff_basis_upper",
"baseline": {
"value": "2.6390",
"direction": "minimize",
"metric": "ratio |B|^2/n for a difference basis of [1, n-1]",
"metric_key": "ratio"
}
}direction:"minimize"or"maximize"— whether lower or higher values are better.metric_key: which key from the validator's returned metrics to compare against the baseline.- Include a source citation for the baseline value.
@article{wang2026horizonmath,
title = {HorizonMath: Measuring AI Progress Toward Mathematical
Discovery with Automatic Verification},
author = {Wang, Erik Y. and Motwani, Sumeet and Roggeveen, James V.
and Hodges, Eliot and Jayalath, Dulhan and London, Charles
and Ramakrishnan, Kalyan and Cipcigan, Flaviu
and Torr, Philip and Abate, Alessandro},
year = {2026},
note = {Working Draft}
}