feat: add tool calls diff and tokens-per-turn to audit diff mode#28494
feat: add tool calls diff and tokens-per-turn to audit diff mode#28494
Conversation
- Add ToolCallDiffEntry, ToolCallsDiff, ToolCallsDiffSummary, BashCommandsDiff types - Add tokens-per-turn fields (Run1TokensPerTurn, Run2TokensPerTurn, TokensPerTurnChange) to RunMetricsDiff - Add ToolCallsDiff field to RunMetricsDiff for engine-level tool call analysis - Implement computeToolCallsDiff() diffing LogMetrics.ToolCalls between runs - Implement computeBashCommandsDiff() for bash-specific analysis (handles generic bash/Bash and per-command bash_* entries from Codex) - Implement isBashTool() helper for bash tool name matching - Update computeRunMetricsDiff() to compute tokens-per-turn and include tool calls diff - Add renderToolCallsDiffPrettySection() and renderBashCommandsDiffPrettySection() - Add renderToolCallsDiffMarkdownSection() and renderBashCommandsDiffMarkdownSection() - Update Run Metrics table in both renderers to show Tokens/turn row - Export ToolCallInfo type alias in pkg/cli/logs_models.go - Add 15 new unit tests covering all new functionality Agent-Logs-Url: https://github.com/github/gh-aw/sessions/dbe42488-aa10-4336-bfeb-170f75f44adf Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…raversal in bash diff - Extract formatMaxSizeCell() helper to remove duplicated max-size formatting in pretty and markdown renderers - Collect bash tools during main iteration in computeToolCallsDiff() so computeBashCommandsDiff() receives pre-filtered maps, avoiding a second traversal - Update tests to pass pre-filtered bash tool maps to computeBashCommandsDiff() directly Agent-Logs-Url: https://github.com/github/gh-aw/sessions/dbe42488-aa10-4336-bfeb-170f75f44adf Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds more granular cost-regression context to audit diff by surfacing per-turn token efficiency and a detailed engine-level tool call breakdown (including bash command detail) in both pretty and markdown output.
Changes:
- Extend
RunMetricsDiffwith tokens-per-turn metrics and a newToolCallsDiff/BashCommandsDiffstructure. - Compute tool call diffs (new/removed/changed/all) and bash command aggregation during diff generation.
- Render new rows/sections in both markdown and pretty output; add comprehensive unit tests.
Show a summary per file
| File | Description |
|---|---|
| pkg/cli/logs_models.go | Exposes ToolCallInfo as a CLI-level alias to the shared workflow type. |
| pkg/cli/audit_diff.go | Adds tokens-per-turn computation plus tool calls + bash command diff computation and new diff structs. |
| pkg/cli/audit_diff_render.go | Renders tokens/turn row and new tool-call + bash-command breakdown sections (pretty + markdown). |
| pkg/cli/audit_diff_test.go | Adds unit tests for tool-call diffs, bash-command diffs, and tokens-per-turn behavior. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (1)
pkg/cli/audit_diff.go:547
- The early-return "no meaningful data" check ignores engine-level tool call metrics. If both runs have 0 tokens/duration/turns and no token/rate-limit details, but
summary*.Metrics.ToolCallsis populated, this function returns nil and the new Tool Call Breakdown / Bash Commands sections can never render. Include tool-call presence in the meaningful-data check (e.g.,hasToolCalls := (m1 != nil && len(m1.ToolCalls) > 0) || (m2 != nil && len(m2.ToolCalls) > 0)), or computecomputeToolCallsDifffirst and consider it in the condition.
// Skip if there is no meaningful data
hasTokenDetails := tu1 != nil || tu2 != nil
hasRateLimitDetails := rl1 != nil || rl2 != nil
if run1Tokens == 0 && run2Tokens == 0 && run1Duration == 0 && run2Duration == 0 && run1Turns == 0 && run2Turns == 0 && !hasTokenDetails && !hasRateLimitDetails {
return nil
- Files reviewed: 4/4 changed files
- Comments generated: 2
| if m1 != nil { | ||
| for _, tc := range m1.ToolCalls { | ||
| run1Tools[tc.Name] = tc | ||
| } | ||
| } | ||
| if m2 != nil { | ||
| for _, tc := range m2.ToolCalls { | ||
| run2Tools[tc.Name] = tc |
There was a problem hiding this comment.
computeToolCallsDiff builds run1Tools/run2Tools by overwriting entries with the same tool name (run1Tools[tc.Name] = tc). In this codebase, LogMetrics.ToolCalls can contain duplicate tool names because metrics are appended across multiple log files (pkg/cli/logs_metrics.go:208-210), and other consumers aggregate duplicates (e.g., buildToolUsageInfo sums counts). This overwrite will undercount call totals and lose max-size info. When inserting into the map, aggregate: sum CallCount, and take the max of MaxInputSize/MaxOutputSize (and any other relevant fields) across duplicates.
| if m1 != nil { | |
| for _, tc := range m1.ToolCalls { | |
| run1Tools[tc.Name] = tc | |
| } | |
| } | |
| if m2 != nil { | |
| for _, tc := range m2.ToolCalls { | |
| run2Tools[tc.Name] = tc | |
| aggregateToolCall := func(tools map[string]ToolCallInfo, tc ToolCallInfo) { | |
| if existing, ok := tools[tc.Name]; ok { | |
| existing.CallCount += tc.CallCount | |
| if tc.MaxInputSize > existing.MaxInputSize { | |
| existing.MaxInputSize = tc.MaxInputSize | |
| } | |
| if tc.MaxOutputSize > existing.MaxOutputSize { | |
| existing.MaxOutputSize = tc.MaxOutputSize | |
| } | |
| tools[tc.Name] = existing | |
| return | |
| } | |
| tools[tc.Name] = tc | |
| } | |
| if m1 != nil { | |
| for _, tc := range m1.ToolCalls { | |
| aggregateToolCall(run1Tools, tc) | |
| } | |
| } | |
| if m2 != nil { | |
| for _, tc := range m2.ToolCalls { | |
| aggregateToolCall(run2Tools, tc) |
| } | ||
| for _, entry := range diff.AllTools { | ||
| change := entry.CallCountChange | ||
| if change == "" && entry.Status == "unchanged" { |
There was a problem hiding this comment.
In the pretty renderer, new/removed tools currently render an empty "Change" cell because CallCountChange is only set for Status == "changed", and the code only substitutes "—" for Status == "unchanged". This leads to inconsistent output vs. the markdown renderer (which shows "—" whenever change is empty) and leaves a blank table cell in the console. Consider using the same rule as markdown here: if change == "" set it to "—" regardless of status.
| if change == "" && entry.Status == "unchanged" { | |
| if change == "" { |
🧪 Test Quality Sentinel ReportTest Quality Score: 90/100✅ Excellent test quality
Test Classification DetailsView all 15 test classifications
Verdict
📖 Understanding Test ClassificationsDesign Tests (High Value) verify what the system does:
Implementation Tests (Low Value) verify how the system does it:
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators. References: §24940226955
|
There was a problem hiding this comment.
✅ Test Quality Sentinel: 90/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All 15 tests verify behavioral contracts with good edge case coverage (nil inputs, empty metrics, zero-division guard, capitalization handling, sorting invariants).
…-turn in audit diff Generated by the Design Decision Gate workflow for PR #28494. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Commit pushed:
|
🏗️ Design Decision Gate — ADR RequiredThis PR makes significant changes to core business logic (797 new lines in AI has analyzed the PR diff and generated a draft ADR to help you get started: 📄 Draft ADR: What to do next
Once an ADR is linked in the PR body, this gate will re-run and verify the implementation matches the decision. Why ADRs Matter
ADRs create a searchable, permanent record of why the codebase looks the way it does. Future contributors (and your future self) will thank you. 📋 Michael Nygard ADR Format ReferenceAn ADR must contain these four sections to be considered complete:
All ADRs are stored in
|
|
@copilot review all comments |
…ange cells - In computeToolCallsDiff: aggregate duplicate tool entries (sum CallCount, take max of sizes) instead of overwriting, matching how other consumers handle metrics appended from multiple log files - In renderToolCallsDiffPrettySection: always substitute "—" for empty change cells regardless of status, consistent with markdown renderer - Add TestComputeToolCallsDiff_DuplicateToolNames test Agent-Logs-Url: https://github.com/github/gh-aw/sessions/60183031-f19f-4034-bf21-75f6983738ad Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Both review issues addressed in 874e7a8:
Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
✅ Smoke CI completed successfully! |
Comment MemoryNote This comment is managed by comment memory.It stores persistent context for this thread in the code block at the top of this comment. |
- audit.md: document tokens-per-turn and tool calls/bash command breakdown in diff output (PR #28494) - frontmatter.md: add Observability section documenting observability.otlp configuration, object-form OTLP headers (PR #28524), and gen_ai.* span attribute conventions replacing gh-aw.* (PR #28511) - cli.md: update audit diff description to include new output fields (PR #28494) - cache-memory.md: document cache path misconfiguration detection via missing_data signal (PR #28516) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Agents investigating cost regressions in diff mode lacked the granular context needed to pinpoint why token usage changed. This adds precomputed per-turn token efficiency and a full engine-level tool call breakdown — including bash command detail — directly into the diff output.
New data in
RunMetricsDiffTokens per turn — uses effective tokens from the firewall proxy when available, falls back to engine-level count:
Run1TokensPerTurn,Run2TokensPerTurn,TokensPerTurnChangeTool calls diff (
ToolCallsDiff) — sourced fromRunSummary.Metrics.ToolCalls(populated by engine log parsers):AllTools— complete view of every tool type with run1/run2 call counts and changeNewTools/RemovedTools/ChangedTools— bucketed diffsMaxInputSize/MaxOutputSizefor size contextBash-specific breakdown (
BashCommandsDiff):bash,Bash(Claude), andbash_*per-command entries (Codex)New output sections (pretty + markdown)
Implementation notes
computeToolCallsDiffsocomputeBashCommandsDiffreceives pre-filtered maps (no second traversal)formatMaxSizeCell()helper extracted to deduplicate size formatting across pretty and markdown renderersToolCallInfoexported as a type alias inlogs_models.go(alongside existingLogMetricsalias)