Skip to content

[cli-tools-test] Audit tool reports phantom tool usage with inconsistent naming in ToolCalls metrics #26165

@github-actions

Description

@github-actions

Problem Description

When auditing a workflow run with a failed detection job (where the squid proxy container crashed), the ToolCalls metrics in run_summary.json contain 126 entries with 68 unique tool names — but the agent only executed ~11 descriptive shell commands and 1 MCP noop call. The metrics include:

  • 54 phantom GitHub/safeoutputs tool calls that the agent never actually invoked
  • 22 tools duplicated under two naming conventionsgithub-get_me (dash) and github___get_me (triple-underscore) both appear independently
  • noop listed multiple times with different counts (totaling 7 across entries instead of 1)
  • Shell command labels treated as tool names — descriptive names like Check, Extract, Inspect, Load, Save, Select, Update, Fetching appear because they are labels Claude Code assigns to shell calls

The audit MCP tool surfaces this corrupted data, showing tool_types=68 in observability insights (implying 68 distinct tool types were used) when in reality only ~2 were used (shell + MCP safeoutputs).

Reproduction

  • Run: §24299193223GPL Dependency Cleaner (gpclean), April 12 2026
  • Root cause event: Detection job failed because awf-squid container exited with code 1 mid-run
  • Tool: audit MCP tool with run_id_or_url: 24299193223

Steps to Reproduce

  1. Identify a run where the detection job failed due to a container crash (e.g., squid proxy exit)
  2. Call the audit tool on that run ID
  3. Inspect the tool_usage array in the response or run_summary.jsonmetrics.ToolCalls

Expected Behavior

ToolCalls should list only tools actually invoked by the agent:

  • Shell tool calls (with their descriptive labels, e.g., Load cache-memory state, Check transitive deps)
  • MCP tools (e.g., noop)
  • Total: ~12 entries for this run

Actual Behavior

126 entries in ToolCalls, including:

github-list_branches: 3    github___list_branches: 2   ← same tool, two formats
github-get_me: 3            github___get_me: 2           ← same tool, two formats
safeoutputs-create_issue: 3  safeoutputs___create_issue: 2
noop: 1, noop: 1, noop: 4, noop: 1                      ← duplicate noop entries
Check: 2, Inspect: 2, Extract: 2, Load: 1, Save: 1, Select: 1, Update: 1, Fetching: 1
tool: 2, tool: 1                                         ← word "tool" as a tool name

All 22 GitHub API tools and all 5 safeoutputs tools appear in both github-xxx and github___xxx format, each counted 2-3 times. These tools were never called — the agent only did shell commands and one noop.

Hypothesis

The tool call parser scans a log stream that includes both the system prompt (which enumerates all available tools) and the actual agent output. When the detection job crashes partway through, the partial data stream causes:

  1. Every available tool to be counted as "called" (from the tool list in the prompt context)
  2. The same tool appearing in two formats depending on which log section was scanned
  3. Descriptive shell command labels being treated as tool names

Comparison: successful runs from the same session (24381951145, 24381981463) have ToolCalls: null in run_summary.json and show correct tool_types=0 in the audit output.

Impact

  • Severity: Medium — misleading observability data
  • Frequency: Reproducible when detection job fails due to container crash
  • User impact: audit reports tool_types=68 for a simple read-only run, causing incorrect optimization recommendations and inflated resource profiling

Environment

Suggested Fix

  1. Deduplicate ToolCalls entries before writing to run_summary.json, merging entries with the same base name (normalizing github___xxx and github-xxx to a canonical form)
  2. Add a filter to exclude tool names that appear in the system prompt tool list but have no corresponding invocation event
  3. Skip ToolCalls population when the detection job exits with a container failure code

Generated by Daily CLI Tools Exploratory Tester · ● 2M ·

  • expires on Apr 21, 2026, 5:35 AM UTC

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions