CTXBench is a benchmark runner for comparing how LLMs respond to dataset-backed tasks under different execution strategies.
The current codebase is centered on a simple idea:
- keep the question set fixed
- vary how the model accesses the source information
- evaluate the response qualitatively with either deterministic heuristics or a judge model
The repository currently ships a Lattes-based dataset and resource-oriented tool layer.
This version no longer uses the legacy evaluation model based on exact, analytical, unanswerable, numeric scores, or rubric dimensions.
The benchmark now uses:
- dataset instances organized by folder
- question-level
validation.type - instance-level
acceptedAnswers,contextRefs, andthemes - qualitative evaluation only
- Lattes tools named as
get_<resource>
A dataset is composed of:
ctxbench.dataset.jsonquestions.jsonquestions.instance.jsoncontext/<instanceId>/...
ctxbench.dataset.json identifies the dataset package and its datasetVersion.
questions.json defines the stable question catalog.
{
"datasetId": "example-lattes-v2",
"questions": [
{
"id": "q_phd_year",
"question": "In which year did the researcher obtain their PhD?",
"tags": ["objective", "factual", "simple"],
"validation": {
"type": "heuristic",
"schema": { "type": "number" }
}
},
{
"id": "q_research_summary",
"question": "Summarize the researcher's main research areas based only on the available context.",
"tags": ["subjective", "factual", "simple"],
"validation": {
"type": "judge"
}
}
]
}questions.instance.json binds each question to one instance.
{
"datasetId": "example-lattes-v2",
"instances": [
{
"instanceId": "5660469902738038",
"contextBlocks": "context/5660469902738038/blocks.json",
"questions": [
{
"id": "q_phd_year",
"acceptedAnswers": [1999]
},
{
"id": "q_research_summary",
"contextRefs": ["summary", "research"],
"themes": ["software engineering", "distributed systems"]
}
]
}
]
}Each instance lives in its own directory:
dataset-root/
ctxbench.dataset.json
questions.json
questions.instance.json
context/
5660469902738038/
raw.html
cleaned.html
parsed.json
blocks.json
Minimal package manifest:
{
"id": "ctxbench/lattes",
"datasetVersion": "0.1.0",
"manifestSchemaVersion": 1
}Only two validation modes exist:
heuristicUsed when the answer can be checked deterministically againstacceptedAnswers.judgeUsed when the answer must be evaluated qualitatively againstcontextRefsandthemes.
Judge outputs are qualitative. They include one rating and one justification for each criterion:
groundednesscorrectnesscompleteness
There is no score and no meanScore.
An experiment selects:
- which dataset to use
- which instances and questions to include via
scope - which provider/model pairs to run
- which strategies and formats to test
- whether evaluation is enabled
Example:
{
"id": "lattes_full_001",
"output": "/abs/path/to/outputs",
"dataset": "lattes/",
"scope": {
"instances": [],
"questions": []
},
"factors": {
"model": [
{ "provider": "openai", "name": "gpt-5.4-nano" }
],
"strategy": ["inline", "local_function", "local_mcp", "remote_mcp"],
"format": ["json", "html"]
},
"evaluation": {
"enabled": true,
"judge": {
"provider": "openai",
"model": "gpt-5.4-mini",
"temperature": 0
}
},
"execution": {
"repeats": 1
}
}scope.instances and scope.questions act as filters. Empty lists mean "all available".
CTXBench currently supports four execution strategies.
The model receives the context artifact directly in the prompt.
Typical formats:
json->parsed.jsonhtml->raw.htmlcleaned_html->cleaned.html
The benchmark controls the tool loop and exposes local Python tools directly.
For Lattes, the model interacts with:
get_profileget_expertiseget_educationget_projectsget_supervisionsget_experienceget_academic_activitiesget_publicationsget_technical_outputget_artistic_output
The benchmark still controls the loop, but the tools are accessed through a local MCP runtime.
The model provider controls the remote MCP interaction.
This path is less observable by design. Some metrics may be null because the benchmark cannot reliably observe provider-side tool execution.
The Lattes integration is resource-oriented.
The parsed curriculum in parsed.json is treated as the source of truth for tool-based execution. The tool surface is fixed and shared across local_function, local_mcp, and remote_mcp:
get_profileget_expertiseget_educationget_projectsget_supervisionsget_experienceget_academic_activitiesget_publicationsget_technical_outputget_artistic_output
All tools are read-only. Temporal filters are exposed only where they make sense through start_year and end_year.
get_supervisions returns a grouped structure by supervision level:
mastersdoctoralundergraduatespecializationothers
Each level contains:
completedongoing
This keeps the benchmark simpler and makes tool usage easier to compare across strategies.
The installed CLI command is ctxbench.
For remote or cached datasets, use the dataset-management commands first:
ctxbench dataset fetch \
--dataset-url https://github.com/ctxbench/lattes/releases/download/v0.1.0-dataset/ctxbench-lattes-v0.1.0.tar.gz \
--sha256 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef \
--cache-dir ./.ctxbench/datasets
ctxbench dataset inspect ctxbench/lattes@0.1.0 --cache-dir ./.ctxbench/datasetsIf your experiment already points to a local dataset root, skip fetch and inspect the root directly:
ctxbench dataset inspect datasets/lattesctxbench plan datasets/lattes/experiment.json \
--output outputs/lattes_baseline_001 \
--cache-dir ./.ctxbench/datasetsThis writes:
manifest.jsontrials.jsonl
ctxbench execute outputs/lattes_baseline_001/trials.jsonlThis writes:
responses.jsonltraces/executions/<trialId>.json
To force re-execution even when response artifacts already exist:
ctxbench execute outputs/lattes_baseline_001/trials.jsonl --forceCtrl-C stops after the current item and leaves a checkpoint behind. Rerunning the same command resumes from the last completed trialId.
ctxbench eval outputs/lattes_baseline_001/responses.jsonlOptional filters and selectors:
--model <id>--provider <name>--instance <instanceId>--task <taskId>--strategy <name>--format <name>--repetition <n>--trial-id <trialId>--trial-id-file <path>--judge <judgeId>--status <status>
Batch evaluation uses the same responses.jsonl input:
ctxbench eval outputs/lattes_baseline_001/responses.jsonl --judge juiz-gpt --batch
ctxbench eval outputs/lattes_baseline_001/responses.jsonl --judge juiz-gpt --batch --wait --poll-interval 60This writes:
evals.jsonljudge_votes.jsonltraces/evals/<trialId>.json
ctxbench export outputs/lattes_baseline_001/evals.jsonl --to csv --output outputs/lattes_baseline_001/results.csvctxbench status outputs/lattes_baseline_001
ctxbench status outputs/lattes_baseline_001 --by judgeThe benchmark persists:
manifest.jsontrials.jsonlresponses.jsonlevals.jsonljudge_votes.jsonlresults.csv- trace artifacts
- checkpoint files for interrupted execution or evaluation
JSONL artifacts are the default canonical source for analysis:
trials.jsonlresponses.jsonlevals.jsonl
Run responses include a compact metricsSummary separate from the raw trace. When a strategy does not expose a metric reliably, the field is stored as null.
Evaluation rows persist qualitative details and expose common fields directly (outcome, correctness, completeness, judge metadata, and evaluation token/duration fields) for easier CSV export.
Model factors can include a short id for filtering and reporting:
{
"provider": "openai",
"id": "gpt-mini",
"name": "gpt-5.4-mini-2026-03-17"
}Execute and eval commands accept selectors:
ctxbench execute outputs/lattes_baseline_001/trials.jsonl --model gpt-mini --task q_sup
ctxbench eval outputs/lattes_baseline_001/responses.jsonl --model gpt-mini --instance 5521922960404236--model matches either the short modelId or the full model name. Selectors are available for provider, model, instance, task, strategy, format, repetition, and trial id. Evaluation also supports status and judge selection via --judge / --not-judge. Each selector also has a --not-* variant.
Evaluation can also use provider batch mode for supported judges. This keeps the input contract unchanged: pass the same responses.jsonl input used by synchronous evaluation.
ctxbench eval outputs/lattes_baseline_001/responses.jsonl --judge juiz-claude --batch
ctxbench eval outputs/lattes_baseline_001/responses.jsonl --judge juiz-gpt --batch --wait --poll-interval 60
ctxbench eval outputs/lattes_baseline_001/responses.jsonl --judge juiz-gemini --batch --batch-id batches/...Batch evaluation currently supports one selected judge per invocation across Anthropic/Claude, OpenAI, and Google/Gemini judges, so use --judge when the experiment has more than one judge. The command writes evaluation.batch.json beside the experiment artifacts with the provider batch id and request manifest. Without --wait, the first command only submits the provider batch; run again with --batch --wait or --batch --batch-id ... --wait to collect and persist evals.jsonl, evals-summary.json, and optional CSV artifacts.
This migration is intentionally breaking. The public CLI, selectors, artifact names, record fields, and strategy labels use one canonical form each. Legacy public names are documented here for migration only and are not accepted as aliases.
The installed command is ctxbench. The Python distribution metadata may still be named copa during this migration.
| Deprecated term | Target |
|---|---|
copa |
ctxbench |
query |
execute |
exec |
prohibited abbreviation; use execute |
queries.jsonl |
trials.jsonl |
answers.jsonl |
responses.jsonl |
runId |
trialId |
questionId |
taskId |
answer |
response |
mcp |
remote_mcp when referring to the remote MCP strategy |
--question |
--task |
--repeat |
--repetition |
--ids |
--trial-id |
src/copa/cli.pyCLI entrypointsrc/copa/commands/plan,execute,eval,export, andstatuscommandssrc/copa/benchmark/experiment schema, runspec generation, execution, evaluation, persistencesrc/copa/ai/model adapters, strategies, trace collection, runtimessrc/copa/dataset/generic dataset loading and validationsrc/copa/datasets/lattes/section-based Lattes provider, tools, and MCP serverexamples/datasets/example dataset and experiment fixturesdatasets/lattes/main Lattes dataset
Install the project in editable mode:
pip install -e .[dev]Run the test suite:
pytest -qThe refactor is aligned around simplicity:
- no compatibility with the legacy dataset/evaluation contract
- no score aggregation
- no rubric dimensions
- no broad domain-specific tool surface
- section-first retrieval for Lattes
If you are extending the benchmark, prefer preserving these constraints instead of reintroducing legacy abstractions.