Summary
MAP has tests for research persistence and token accounting, but not for whether research-agent actually returns useful, compact file-line evidence. Add a small deterministic eval harness that scores research output against fixture repositories and expected relevant locations.
Paper basis
FastContext reports standalone exploration quality using patch-derived file/module/function targets and precision/recall/F1. The important product lesson is not the exact benchmark, but that repository exploration quality should be evaluated separately from the solver.
Current evidence
- tests/test_map_step_runner.py:5562-5684 validates save_research/load_research storage behavior.
- tests/test_map_token_meter.py:86-206 validates token accounting for subagent transcripts.
- tests/test_validate_spec_citations.py validates existing file:line citations for specs.
- I did not find a dedicated research-agent localization quality eval that checks returned paths/ranges against a known target set.
Proposal
Create a lightweight eval suite for the research layer, separate from skill trigger eval:
- fixture repos or mini-repos with known symbols and call paths
- natural-language research queries in English and Russian where relevant
- expected file path and line-range targets
- parser for research artifacts into predicted locations
- scoring at least file-level precision/recall/F1, plus malformed/over-broad penalties
Possible implementation paths
- Reuse validate_spec_citations path/line validation logic for citation existence.
- Store eval cases under tests/fixtures/research_eval or .map/eval-runs/research-agent.
- Keep it deterministic and mocked; do not require live LLM calls in CI.
Acceptance criteria
- A parser converts research-agent output into normalized file/range citations.
- Unit tests score exact hits, partial line overlap, missing locations, over-broad output, duplicate citations, and malformed paths.
- The eval can run without provider credentials.
- Documentation explains how maintainers can add new research eval cases when a MAP workflow misses important files.
Summary
MAP has tests for research persistence and token accounting, but not for whether research-agent actually returns useful, compact file-line evidence. Add a small deterministic eval harness that scores research output against fixture repositories and expected relevant locations.
Paper basis
FastContext reports standalone exploration quality using patch-derived file/module/function targets and precision/recall/F1. The important product lesson is not the exact benchmark, but that repository exploration quality should be evaluated separately from the solver.
Current evidence
Proposal
Create a lightweight eval suite for the research layer, separate from skill trigger eval:
Possible implementation paths
Acceptance criteria