feat: show effective-token delta per MCP tool call in agent log#33629
Conversation
Correlates rpc-messages.jsonl timestamps with token-usage.jsonl to compute the effective-token delta introduced by each MCP tool call result. Go CLI: adds a "Tool Call Timeline (Effective Token Δ)" table to the MCP tool usage section when delta data is available. JS step summary: adds a ΔET column to the REQUEST table in the MCP Gateway Activity section. Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds visibility into how much each MCP tool call result increases the LLM context by annotating tool calls with an effective-token delta (ΔET), computed by correlating tool-call timestamps with adjacent LLM API calls in token-usage.jsonl.
Changes:
- Go CLI: parse/sort raw token usage entries and correlate tool calls with ΔET; store it on
MCPToolCall. - Go CLI: render a “Tool Call Timeline (Effective Token Δ)” table when delta data exists.
- GitHub Actions JS summary: compute per-request ΔET and optionally add a ΔET column to the REQUEST table; adds unit tests.
Show a summary per file
| File | Description |
|---|---|
| pkg/cli/token_usage.go | Adds token-usage JSONL reader and correlates MCP tool calls with effective-token deltas. |
| pkg/cli/token_usage_test.go | Adds unit tests for the Go correlation logic. |
| pkg/cli/gateway_logs_mcp.go | Invokes tool-call/token-usage correlation when building MCP tool call records. |
| pkg/cli/audit_report.go | Extends MCPToolCall with EffectiveTokenDelta for JSON/reporting. |
| pkg/cli/audit_report_render_tools.go | Renders a tool-call timeline table including ΔET when available. |
| actions/setup/js/parse_mcp_gateway_log.cjs | Computes per-request ΔET from token-usage.jsonl and renders an optional ΔET column in the REQUEST table. |
| actions/setup/js/parse_mcp_gateway_log.test.cjs | Adds unit tests for JS ΔET computation and rendering. |
| .changeset/patch-agent-log-token-delta.md | Declares a patch release note describing the new ΔET annotations. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (1)
pkg/cli/token_usage.go:556
- The prev/next search scans the entire etEntries slice for every tool call (O(toolCalls × tokenUsageEntries)) even though etEntries is sorted. This can become slow on large runs. Consider using sort.Search (binary search) to find the insertion point for callTS and derive prev/next indices, or a two-pointer walk if toolCalls are chronological, and break early once nextIdx is found.
// Find prev (last entry with ts < callTS) and next (first entry with ts > callTS)
prevIdx := -1
nextIdx := -1
for j, e := range etEntries {
if e.ts.Before(callTS) {
prevIdx = j // keep updating to get the last one before callTS
} else if e.ts.After(callTS) && nextIdx == -1 {
nextIdx = j // first entry after callTS
}
}
- Files reviewed: 8/8 changed files
- Comments generated: 4
| // Resolve weights once for all entries | ||
| multipliers, classWeights := resolveEffectiveWeights(nil) |
| // Correlate tool calls with effective-token deltas from token-usage.jsonl | ||
| tokenUsageFile := findTokenUsageFile(logDir) | ||
| toolCalls = correlateToolCallsWithTokenDelta(toolCalls, tokenUsageFile) |
| ts := tc.Timestamp | ||
| if len(ts) > 19 { | ||
| ts = ts[:19] + "Z" | ||
| } |
| for (let i = 0; i < requests.length; i++) { | ||
| const req = requests[i]; | ||
| if (!req.timestamp) continue; | ||
| const callTs = new Date(req.timestamp).getTime(); | ||
| if (isNaN(callTs)) continue; | ||
|
|
||
| let prevIdx = -1; | ||
| let nextIdx = -1; | ||
| for (let j = 0; j < etEntries.length; j++) { | ||
| if (etEntries[j].ts < callTs) { | ||
| prevIdx = j; // keep updating to get the last one before callTs | ||
| } else if (etEntries[j].ts > callTs && nextIdx === -1) { | ||
| nextIdx = j; // first entry after callTs | ||
| } | ||
| } |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
🧪 Test Quality Sentinel completed test quality analysis. |
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. |
There was a problem hiding this comment.
Code Quality Review Summary
This PR adds effective-token delta tracking for MCP tool calls, which is a valuable observability improvement. However, there are 4 issues that should be addressed before merging:
Issues Found
- Performance inefficiency in JavaScript (line 633,
parse_mcp_gateway_log.cjs) — O(n2) nested loop can be optimized to O(n log n) using binary search - Inconsistent token weight handling (lines 515 & 547,
token_usage.go) — Ignores custom weights fromaw_info.json, causing deltas to be inconsistent with other parts of the report - Redundant file operations (line 83,
gateway_logs_mcp.go) —findTokenUsageFilecalled twice instead of once - Timestamp rendering issue (line 145,
audit_report_render_tools.go) — String slicing can misrepresent non-UTC timestamps and produce invalid output
All four issues were previously identified by the automated reviewer and remain valid concerns.
What Works Well
✅ Comprehensive test coverage for both Go and JavaScript implementations
✅ Clean separation of concerns (correlation logic isolated in dedicated functions)
✅ Proper error handling (non-fatal when token-usage.jsonl is missing)
✅ Clear algorithm documentation in both code and PR description
Recommendation
Address the 4 flagged issues, particularly the custom weights inconsistency (#2) which affects correctness, and the performance issue (#1) which could be noticeable on runs with many tool calls.
🔎 Code quality review by PR Code Quality Reviewer · ● 3.4M
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /diagnose and /tdd based on the feature addition with algorithmic implementation and test coverage.
Key Themes
Performance bottleneck — Both JS and Go implementations use O(n·m) nested loops instead of binary search on already-sorted data. For audit logs with hundreds of tool calls and API entries, this will cause noticeable performance degradation. Since both implementations already sort their data arrays, switching to binary search (O(log m) per lookup) is straightforward and drops overall complexity from O(n·m) to O(n·log m).
Missing edge case tests — The test suites cover happy paths well but don't document behavior for:
- Tool call timestamp exactly equal to an API call timestamp (current logic uses strict
</>, so exact matches return delta=0) - Negative deltas when effective tokens decrease (cache-heavy follow-ups can reduce billable tokens)
These edge cases should have explicit tests to prevent future regressions and clarify intended behavior.
Positive Highlights
✅ Algorithm documentation — Both implementations have clear docstrings explaining the prev/next correlation logic
✅ Graceful degradation — Missing token-usage.jsonl is non-fatal; delta column is simply omitted
✅ Consistent cross-language implementation — JS and Go versions use identical logic, making maintenance easier
✅ Good test coverage for sequential cases — The multi-call test scenarios validate correct delta assignment
Verdict
Requesting changes to address the performance issue before merge. The binary search optimization is a straightforward improvement that prevents future performance complaints. Edge case tests are recommended but not blocking.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 5.9M
Comments that could not be inline-anchored
actions/setup/js/parse_mcp_gateway_log.cjs:75
[/diagnose] This nested loop creates O(n·m) performance where n = requests.length and m = etEntries.length. For workflows with hundreds of tool calls and API entries, this will cause noticeable lag.
Optimization: Since etEntries is already sorted by timestamp (line 64), use binary search to find prevIdx and nextIdx in O(log m) instead of O(m):
for (let i = 0; i < requests.length; i++) {
const req = requests[i];
if (!req.timestamp) continue;
const callTs = new…
</details>
<details><summary>pkg/cli/token_usage.go:513</summary>
**[/diagnose]** Same O(n·m) performance issue as the JavaScript implementation. The Go version scans all `etEntries` for each tool call.
**Optimization**: Use `sort.Search` for binary search since `etEntries` is already sorted:
```go
for i, tc := range updated {
callTS, ok := parseTokenUsageTimestamp(tc.Timestamp)
if !ok {
continue
}
// Binary search for first entry after callTS
nextIdx := sort.Search(len(etEntries), func(j int) bool {
return etEntries[j].…
</details>
<details><summary>actions/setup/js/parse_mcp_gateway_log.test.cjs:198</summary>
**[/tdd]** Missing edge case tests for timestamp boundary conditions. Consider adding:
```javascript
test("handles tool call with timestamp equal to API call", () => {
// What happens when callTs === etEntries[i].ts?
// Current logic uses < and >, so exact match won't find prev or next
const tokenContent = [
tuLine("2026-05-19T21:10:00.000Z", 1000, 50),
tuLine("2026-05-19T21:10:05.000Z", 1500, 80) // exact match
].join("\\n");
const requests = [req("2026-05-19T21:10:05.000Z"…
</details>
<details><summary>pkg/cli/token_usage_test.go:544</summary>
**[/tdd]** The Go tests cover happy path well, but missing the same edge cases as the JS tests. Add:
```go
t.Run("tool call timestamp equal to API call timestamp", func(t *testing.T) {
tmpDir := testutil.TempDir(t, "token-delta-exact")
filePath := filepath.Join(tmpDir, "token-usage.jsonl")
// Tool call at exactly the same timestamp as an API call
content := `{"timestamp":"2026-05-19T21:10:00.000Z","model":"unknown","provider":"test","input_tokens":1000,"output_tokens":50,"cache…
</details>|
✅ PR Code Quality Reviewer completed the code quality review. No new issues found beyond those already identified by previous reviewer. All 4 existing review comments are valid and should be addressed: (1) custom token weights not propagated to correlateToolCallsWithTokenDelta, (2) duplicate findTokenUsageFile calls, (3) unsafe timestamp string truncation, (4) O(n2) nested loop in JS that could use binary search. |
… call Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. |
This comment has been minimized.
This comment has been minimized.
|
🧪 Test Quality Sentinel completed test quality analysis. |
✅ Design Decision Gate — ADR VerifiedThe implementation in this PR aligns with the linked Architecture Decision Record. ADR reviewed: ADR-33629: Effective-Token Delta per MCP Tool Call via Timestamp Correlation Verification SummaryThe ADR specifies a timestamp-bracketing algorithm ( Per-requirement conformance checkDelta Computation
Rendering
Data Model
Unit tests cover the bracketing algorithm in both languages ( The design decision has been recorded and the implementation follows it. 🏗️ References: §26193570741
|
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /tdd and /zoom-out based on this being a new feature that adds observable behavior (rendering effective-token deltas in agent logs).
Key Themes
Test Coverage (✅ Strong)
- Both Go and JS implementations have comprehensive unit tests covering edge cases
- Tests follow table-driven patterns with descriptive names that read as specifications
- Edge cases covered: no-prev, no-next, empty inputs, multi-call scenarios
Architecture Fit (✅ Good)
- Timestamp-bracketing algorithm is well-documented and reasonable for the use case
- Graceful degradation when
token-usage.jsonlis missing (non-fatal, additive feature) - Reuses existing effective-token computation logic (good DRY principle)
- Excellent ADR documenting alternatives considered and trade-offs
Areas for Minor Improvement
- Algorithm Performance — O(n*m) complexity in the nested loop (lines 604-609 in Go, 75-80 in JS). Could be optimized to O(n log m) with binary search if needed for large logs.
- Missing Edge Case Tests — Clock skew / out-of-order timestamps and concurrent tool calls at identical timestamps not explicitly tested.
- Code Duplication — Bracketing algorithm is duplicated in Go and JS with slight implementation differences (acknowledged in ADR as acceptable trade-off).
Positive Highlights
- ✅ Outstanding ADR — Clear narrative explaining context, decision, alternatives, and consequences. The normative specification section is excellent.
- ✅ Defensive immutability — Go implementation creates a new slice rather than mutating input (line 593).
- ✅ Non-fatal failure mode — Missing
token-usage.jsonldoesn't break the feature, just omits the delta column. - ✅ Clear documentation — JSDoc and Go comments explain the algorithm well.
- ✅ Consistent weight computation — Both implementations reuse existing effective-token logic for consistency.
Verdict
COMMENT — No blocking issues. The feature is well-tested, fits the architecture cleanly, and includes excellent documentation. The inline comments suggest optional future optimizations and additional edge case tests, but none are critical for launch.
Great work on the ADR — it's a model example of documenting design decisions! 🎯
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 6.6M
Comments that could not be inline-anchored
pkg/cli/token_usage.go:609
[/zoom-out] This nested loop creates O(n*m) complexity where n = tool calls and m = token usage entries. For typical runs this is fine, but consider optimizing if you observe performance issues with large logs.
Optimization approach (optional future work): Since etEntries is already sorted, you could use binary search to find prevIdx and nextIdx in O(log m) time, reducing overall complexity to O(n log m).
// Future optimization example:
prevIdx := sort.Search(len(etEntries…
</details>
<details><summary>actions/setup/js/parse_mcp_gateway_log.cjs:79</summary>
**[/zoom-out]** Same O(n*m) complexity issue as the Go implementation. The nested loop searches all `etEntries` for each request.
Consider using binary search here too for consistency and performance:
```js
// Binary search for prevIdx
let left = 0, right = etEntries.length - 1;
while (left <= right) {
const mid = Math.floor((left + right) / 2);
if (etEntries[mid].ts < callTs) {
prevIdx = mid;
left = mid + 1;
} else {
right = mid - 1;
}
}Not blocking for initial impl…
docs/adr/33629-effective-token-delta-per-mcp-tool-call.md:341
[/tdd] Excellent acknowledgment of the approximation in your ADR! Consider adding a test case that demonstrates this limitation:
func TestToolCallsDeltaWithClockSkew(t *testing.T) {
// Tool call timestamp is AFTER the "next" API call due to clock skew
// Expected: delta should be 0 (graceful degradation)
}This would document the behavior when timestamps are out-of-order and prevent future regressions if the algorithm changes.
pkg/cli/token_usage.go:554
[/zoom-out] The function creates a new slice copy of toolCalls (line 593) rather than mutating the input. This is excellent defensive programming!
Consider documenting this choice in a comment so future maintainers understand it's intentional:
// Create a new slice to avoid mutating the input (caller may rely on original)
updated := make([]MCPToolCall, len(toolCalls))
copy(updated, toolCalls)This makes the "returns a new slice (not mutate the input)" promise from the ADR mor…
actions/setup/js/parse_mcp_gateway_log.test.cjs:198
[/tdd] Great test coverage! The test names are descriptive and read like specifications.
One additional edge case worth testing: concurrent tool calls at the same timestamp. What happens if two tool calls have identical timestamps?
test("handles tool calls with identical timestamps", () => {
const requests = [
req("2026-05-19T21:10:05.000Z", "tool-a"),
req("2026-05-19T21:10:05.000Z", "tool-b"), // same timestamp
];
// Both should get the same delta or both should g…
</details>
🧪 Test Quality Sentinel ReportTest Quality Score: 93/100✅ Excellent test quality
Test Classification DetailsView detailed test breakdown (13 tests)JavaScript:
|
| Test | Classification | Issues Detected |
|---|---|---|
computes delta for a tool call bracketed by two API calls |
✅ Design | None |
returns empty map when tool call has no preceding API call |
✅ Design | None |
returns empty map when tool call has no following API call |
✅ Design | None |
returns empty map for empty inputs |
✅ Design | None |
computes correct deltas for multiple sequential tool calls |
✅ Design | None |
shows ΔET column when deltas are provided |
✅ Design | None |
omits ΔET column when no deltas are provided |
✅ Design | None |
shows dash for requests without a delta |
✅ Design | None |
Go: pkg/cli/token_usage_test.go (5 subtests)
| Test | Classification | Issues Detected |
|---|---|---|
assigns delta to tool calls bracketed by API calls |
✅ Design | None |
leaves delta zero when tool call has no preceding API call |
✅ Design | None |
leaves delta zero when tool call has no following API call |
✅ Design | None |
handles empty token usage file path |
✅ Design | None |
assigns correct deltas to multiple sequential tool calls |
✅ Design | None |
Analysis Highlights
✅ Strengths:
- 100% behavioral coverage — All tests verify observable outputs and system contracts
- Excellent edge case handling — Tests cover empty inputs, missing data, and boundary conditions
- No test inflation — Test-to-production ratio is healthy (JS: 1:1, Go: 0.65:1)
- Clean code practices — No prohibited mocks, proper build tags, descriptive assertion messages
- Parallel test design — JavaScript and Go tests validate the same algorithm independently
Test Design Pattern Analysis:
Both test suites follow the same behavioral contract:
- Core algorithm correctness — Delta calculation between two API calls
- Edge case: Missing previous call — Returns zero/empty when no preceding API call exists
- Edge case: Missing next call — Returns zero/empty when no following API call exists
- Edge case: Empty inputs — Handles null/empty data gracefully
- Multiple calls — Correctly processes sequential tool calls with different deltas
This parallelism across languages strengthens confidence in the algorithm's correctness.
Test Inflation Analysis
| File | Test Lines | Production Lines | Ratio | Status |
|---|---|---|---|---|
parse_mcp_gateway_log.test.cjs |
+105 | +105 | 1.00:1 | ✅ Healthy |
token_usage_test.go |
+92 | +141 | 0.65:1 | ✅ Healthy |
Threshold: 2:1 (test lines should not exceed 2× production lines)
Language Support
Tests analyzed:
- 🟨 JavaScript (
*.test.cjs): 8 tests (vitest) - 🐹 Go (
*_test.go): 5 subtests (testify, unit build tag)
Verdict
✅ Check passed. 0% of new tests are implementation tests (threshold: 30%). All tests verify behavioral contracts and would catch regressions if deleted.
Minor Suggestion
The JavaScript tests could benefit from error path testing if computeToolCallTokenDeltas can throw exceptions for malformed input (e.g., invalid JSON in tokenContent). Current tests focus on edge cases (empty/missing data) but don't validate error handling for corrupt data.
However, this is a minor enhancement opportunity — the current test quality is excellent.
📖 Understanding Test Classifications
Design Tests (High Value) verify what the system does:
- Assert on observable outputs, return values, or state changes
- Cover error paths and boundary conditions
- Would catch a behavioral regression if deleted
- Remain valid even after internal refactoring
Implementation Tests (Low Value) verify how the system does it:
- Assert on internal function calls (mocking internals)
- Only test the happy path with typical inputs
- Break during legitimate refactoring even when behavior is correct
- Give false assurance: they pass even when the system is wrong
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.
🧪 Test quality analysis by Test Quality Sentinel · ● 12.9M · ◷
pr-sous-chef: nudged
Please address unresolved review feedback and post a short completion plan when done. Also consider refreshing the branch and re-running checks.
|
Each MCP tool call result grows the LLM context window by some amount, but the exact cost was invisible. This adds a ΔET (effective-token delta) annotation to each tool call by correlating
rpc-messages.jsonltimestamps withtoken-usage.jsonlAPI call data.Algorithm
For each tool call at timestamp
T:token-usage.jsonlentry withts < T(the API call that made the tool decision)ts > T(the API call that consumed the result)ΔET = effectiveTokens(next) − effectiveTokens(prev)Changes
Go CLI (
pkg/cli/)audit_report.go: addsEffectiveTokenDelta inttoMCPToolCalltoken_usage.go: addsreadTokenUsageEntries(raw sorted JSONL reader) andcorrelateToolCallsWithTokenDelta(correlation using existing ET weights/multipliers)gateway_logs_mcp.go: calls correlation after building tool calls for bothrpc-messages.jsonlandgateway.jsonlpathsaudit_report_render_tools.go: adds a Tool Call Timeline (Effective Token Δ) table when deltas are availableJS step summary (
actions/setup/js/parse_mcp_gateway_log.cjs)computeToolCallTokenDeltas(tokenUsageContent, requests)using the existingcomputeEffectiveTokensfromeffective_tokens.cjstokenDeltasparameter togenerateRpcMessagesSummary; renders a ΔET column in the REQUEST table when presentmain()readsTOKEN_USAGE_PATH, computes deltas, passes them into the summary — non-fatal if file is missingExample output (REQUEST table in step summary)
push_to_pull_request_branchupdate_pull_requestupdate_pull_request