Skip to content

eous/bestofn

Repository files navigation

BestOfN

A framework for generating verified synthetic training data using best-of-N candidate selection with multiple LLM providers.

Given a dataset of questions, BestOfN generates N candidate responses per question using Claude or OpenAI models, then applies domain-specific verifiers to select the best response. The result is high-quality training data with verified reasoning chains.

Repository Structure

bestofn/
├── common/              # Shared utilities (schemas, API retry, LLM judge, etc.)
├── verifiers/           # Pluggable verification system (math, code, tool, spatial, ...)
├── claude_gen/          # Claude generation pipeline (extended thinking, tool use)
├── openai_gen/          # OpenAI generation pipeline (responses API, structured output)
├── scripts/             # Dataset generators and data processing pipelines
└── tests/               # Test suite (400+ tests)

Verifiers

The verification system uses a registry pattern to dispatch verifiers based on dataset split names. Each verifier inherits from Verifier and implements domain-specific validation.

Verifier Domain Method
MathVerifier Math/STEM Symbolic equivalence via SymPy
CodeVerifier Python/JS Docker-sandboxed execution
ToolVerifier CLI/HTTP tool use Schema + execution validation
OpenAPIToolVerifier API tool calls OpenAPI spec validation
SpatialVerifier Hamiltonian paths Deterministic path verification (2D/3D)
PolyominoVerifier Tiling puzzles Multi-strategy placement parsing
InstructionFollowingVerifier General LLM-backed instruction checking
StructuredOutputVerifier JSON/schema JSON schema validation
RefusalClassifier Safety Hybrid pattern + LLM refusal detection
PersonaVerifier Personality Style/character consistency checking
from verifiers import get_verifier, get_verifier_for_split

# By name
verifier = get_verifier('math')
result = verifier.verify(question, candidate_answer, reference_answer)

# By dataset split (auto-dispatches)
verifier = get_verifier_for_split('gsm8k')

Dataset Generators

The scripts/ directory contains generators for synthetic training datasets:

  • Spatial Reasoning (generate_spatial_reasoning_dataset.py) — Hamiltonian path puzzles on 2D grids (3x3-8x8) and 3D cubes (3x3x3-4x4x4), with obstacles and impossible variants
  • Polyomino Tiling (generate_polyomino_tiling_dataset.py) — Tetromino/pentomino placement puzzles with 23 piece types, 6 difficulty levels, and impossibility proofs
  • Terminal Gym (generate_terminal_gym_dataset.py, generate_terminal_gym_llm.py) — Bash/CLI tool-use trajectories with a sandboxed filesystem, 16 task subcategories, and optional plan-then-implement mode

Each generator produces JSONL output that can be converted to Harmony training format using the corresponding convert_*_to_harmony.py script.

Harmony Format

Training data uses a structured multi-channel format with special tokens:

<|start|>system<|message|>You are a helpful assistant.<|end|>
<|start|>user<|message|>Solve this puzzle...<|end|>
<|start|>assistant<|channel|>analysis<|message|>Let me think step by step...<|end|>
<|start|>assistant<|channel|>final<|message|>The answer is...<|end|>

Channels (analysis, final, planning) enable structured reasoning traces with selective loss masking during training.

Quick Start

Installation

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

API Keys

export ANTHROPIC_API_KEY=your-key    # For Claude generation
export OPENAI_API_KEY=your-key       # For OpenAI generation

Generate Candidates

# Claude pipeline
python -m claude_gen.generate --config experiments/baseline/baseline.yaml

# OpenAI pipeline
python -m openai_gen.generate --config experiments/baseline/baseline.yaml

Generate Datasets

# Spatial reasoning (2D)
python scripts/generate_spatial_reasoning_dataset.py --grid-sizes 3 4 5 --num-per-size 50

# Spatial reasoning (3D)
python scripts/generate_spatial_reasoning_dataset.py --dimensions 3 --grid-sizes 3 4 --num-per-size 50

# Polyomino tiling
python scripts/generate_polyomino_tiling_dataset.py --num-puzzles 100 --difficulty-levels 1 2 3

# Terminal gym (templated)
python scripts/generate_terminal_gym_dataset.py --num-tasks 100

Run Tests

python -m pytest tests/ -x --tb=short

License

MIT License

Copyright (c) 2026

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages