feat(mcp): steer agents to explore-first; fix Kotlin/Swift test detection#191
Merged
Conversation
…tion Two changes from diagnosing why Claude Code's Explore agent wasn't using codegraph_explore on a benchmark run (37 calls / ~90k tokens via search+Read+grep, vs a general-purpose agent that led with explore: 13 calls / ~55k tokens for the same question). 1. Tool guidance reframed across server-instructions.ts, instructions-template.ts, and .cursor/rules/codegraph.mdc (+ the explore/search tool descriptions): codegraph_explore is the workhorse for understanding/architecture/"how does X work" questions. Seed it with the key symbol names (a quick search/context first if the question names nothing concrete), read its output, and fill gaps with node/Read — instead of searching then Reading each file. The old "search first to find names, then explore" wording was short-circuiting: agents searched, got file:line locations, and Read them, never reaching explore. 2. isTestFile now recognizes Kotlin (*Test.kt, jvmTest/commonTest/ androidTest source sets), Swift (*Tests.swift), and other camelCase test conventions, so test code is deprioritized in explore/context ranking. Previously only Java/JS/Python were known, letting tests dominate Kotlin/Swift exploration (OkHttp "trace a request" went from 8/9 test files to surfacing Call.kt/OkHttpClient.kt/Request.kt/Response.kt). Capital-led matching keeps latest.kt/manifest.kt unflagged. An IDF common-term down-weighting was prototyped for the cold-query case but dropped — it was a measured no-op (the "common" terms weren't actually common in the test indexes); the test-detection gap was the real cause. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…usage Tooling to measure how a Claude Code agent actually uses the codegraph MCP tools on a real repo — does it lead with codegraph_explore, how many Read/Grep follow-ups, token cost — for validating tool-guidance changes (server-instructions, tool descriptions) against real agent behavior. - itrun.sh drives the real interactive TUI via tmux (the faithful Explore path). Hardened for unattended runs: type-and-verify prompt delivery (the ❯ glyph is drawn ~6s before the input accepts keys), auto-accepts the "trust this folder" dialog, busy-detection keys on the universal "(Ns · …)" spinner so the pre-stream thinking phase counts as busy, and fails loudly instead of capturing an empty pane. - parse-session.mjs reports the tool breakdown + token accounting (gen / fresh-in / cached-in / billable) from the session and subagent logs, consistent across main-thread and subagent runs; counts main-thread Bash in the grep verdict. - run-agent.sh / parse-run.mjs are the headless stream-json complement (exact per-tool tokens/cost via claude -p). - run-interactive-test.md documents how to run it and how completion is detected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
colbymchenry
added a commit
that referenced
this pull request
May 20, 2026
Folds all changes since 0.7.10 into 0.7.12 (0.7.11 was unpublished from npm): size-adaptive codegraph_explore output budget (#185/#187), line numbers in explore source sections (#188), explore-first tool guidance (#191), language-neutral source-omission markers, and Kotlin/Swift test-file detection (#191). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Came out of refreshing the README benchmark, which surfaced that Claude Code's Explore agent wasn't using
codegraph_exploreat all.Diagnosis
Same VS Code question ("how does the extension host communicate with the main process?"), two agents:
explorecallsThe Explore agent used
codegraph_search+ Read + grep — it treated codegraph as a search index and never reachedexplore, landing at ~the without-CodeGraph token cost. Root causes: (1) the "search first to find names, then explore" guidance short-circuited — agents searched, gotfile:linelocations, and Read them instead of feeding the names to explore; (2) explore was framed as a heavy last-resort for "unfamiliar surveys."Changes
Guidance reframe (4 spots in sync:
server-instructions.ts,instructions-template.ts,.cursor/rules/codegraph.mdc, + the explore/search tool descriptions):codegraph_exploreis the workhorse for understanding/"how does X work"/architecture questions. Seed it with the key symbol names (a quickcodegraph_search/codegraph_contextfirst only if the question names nothing concrete), read its output, fill gaps withnode/Read — don't search-then-Read each file.isTestFilefix: now recognizes Kotlin (*Test.kt,jvmTest//commonTest//androidTest/source sets), Swift (*Tests.swift), and other camelCase test conventions, so tests get deprioritized in explore/context ranking. Previously only Java/JS/Python were known.Call.kt,OkHttpClient.kt,Request.kt,Response.ktCapital-led matching keeps
latest.kt/manifest.kt/RealCall.ktunflagged.Dropped
An IDF common-term down-weighting was prototyped for the cold-query case but dropped — measured a no-op (the supposedly-"common" terms weren't actually common in the test indexes: "process" 0.3%, "main" 0.8% in VS Code). The test-detection gap was the real cold-query noise source.
Test plan
__tests__/is-test-file.test.ts(Kotlin/Swift/camelCase detection + false-positive guards)rpcProtocol.tsis still missed) is intentionally left to the seed-then-explore guidance rather than a retrieval change — flag if you'd rather chase it in retrieval.