Skip to content

Add localization-quality evaluation for research-agent outputs #200

Description

@azalio

Summary
MAP has tests for research persistence and token accounting, but not for whether research-agent actually returns useful, compact file-line evidence. Add a small deterministic eval harness that scores research output against fixture repositories and expected relevant locations.

Paper basis
FastContext reports standalone exploration quality using patch-derived file/module/function targets and precision/recall/F1. The important product lesson is not the exact benchmark, but that repository exploration quality should be evaluated separately from the solver.

Current evidence

  • tests/test_map_step_runner.py:5562-5684 validates save_research/load_research storage behavior.
  • tests/test_map_token_meter.py:86-206 validates token accounting for subagent transcripts.
  • tests/test_validate_spec_citations.py validates existing file:line citations for specs.
  • I did not find a dedicated research-agent localization quality eval that checks returned paths/ranges against a known target set.

Proposal
Create a lightweight eval suite for the research layer, separate from skill trigger eval:

  • fixture repos or mini-repos with known symbols and call paths
  • natural-language research queries in English and Russian where relevant
  • expected file path and line-range targets
  • parser for research artifacts into predicted locations
  • scoring at least file-level precision/recall/F1, plus malformed/over-broad penalties

Possible implementation paths

  • Reuse validate_spec_citations path/line validation logic for citation existence.
  • Store eval cases under tests/fixtures/research_eval or .map/eval-runs/research-agent.
  • Keep it deterministic and mocked; do not require live LLM calls in CI.

Acceptance criteria

  • A parser converts research-agent output into normalized file/range citations.
  • Unit tests score exact hits, partial line overlap, missing locations, over-broad output, duplicate citations, and malformed paths.
  • The eval can run without provider credentials.
  • Documentation explains how maintainers can add new research eval cases when a MAP workflow misses important files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions