[cli-tools-test] Audit tool reports phantom tool usage with inconsistent naming in ToolCalls metrics

### Problem Description

When auditing a workflow run with a failed detection job (where the squid proxy container crashed), the `ToolCalls` metrics in `run_summary.json` contain **126 entries with 68 unique tool names** — but the agent only executed ~11 descriptive shell commands and 1 MCP `noop` call. The metrics include:

- **54 phantom GitHub/safeoutputs tool calls** that the agent never actually invoked
- **22 tools duplicated under two naming conventions** — `github-get_me` (dash) and `github___get_me` (triple-underscore) both appear independently
- **`noop` listed multiple times** with different counts (totaling 7 across entries instead of 1)
- **Shell command labels treated as tool names** — descriptive names like `Check`, `Extract`, `Inspect`, `Load`, `Save`, `Select`, `Update`, `Fetching` appear because they are labels Claude Code assigns to `shell` calls

The `audit` MCP tool surfaces this corrupted data, showing `tool_types=68` in observability insights (implying 68 distinct tool types were used) when in reality only ~2 were used (shell + MCP safeoutputs).

### Reproduction

- **Run**: [§24299193223](https://github.com/github/gh-aw/actions/runs/24299193223) — `GPL Dependency Cleaner (gpclean)`, April 12 2026
- **Root cause event**: Detection job failed because `awf-squid` container exited with code 1 mid-run
- **Tool**: `audit` MCP tool with `run_id_or_url: 24299193223`

### Steps to Reproduce

1. Identify a run where the detection job failed due to a container crash (e.g., squid proxy exit)
2. Call the `audit` tool on that run ID
3. Inspect the `tool_usage` array in the response or `run_summary.json` → `metrics.ToolCalls`

### Expected Behavior

`ToolCalls` should list only tools actually invoked by the agent:
- Shell tool calls (with their descriptive labels, e.g., `Load cache-memory state`, `Check transitive deps`)
- MCP tools (e.g., `noop`)
- Total: ~12 entries for this run

### Actual Behavior

126 entries in `ToolCalls`, including:

```
github-list_branches: 3    github___list_branches: 2   ← same tool, two formats
github-get_me: 3            github___get_me: 2           ← same tool, two formats
safeoutputs-create_issue: 3  safeoutputs___create_issue: 2
noop: 1, noop: 1, noop: 4, noop: 1                      ← duplicate noop entries
Check: 2, Inspect: 2, Extract: 2, Load: 1, Save: 1, Select: 1, Update: 1, Fetching: 1
tool: 2, tool: 1                                         ← word "tool" as a tool name
```

All 22 GitHub API tools and all 5 safeoutputs tools appear in both `github-xxx` and `github___xxx` format, each counted 2-3 times. These tools were **never called** — the agent only did shell commands and one `noop`.

### Hypothesis

The tool call parser scans a log stream that includes both the system prompt (which enumerates all available tools) and the actual agent output. When the detection job crashes partway through, the partial data stream causes:
1. Every available tool to be counted as "called" (from the tool list in the prompt context)
2. The same tool appearing in two formats depending on which log section was scanned
3. Descriptive shell command labels being treated as tool names

Comparison: successful runs from the same session (`24381951145`, `24381981463`) have `ToolCalls: null` in `run_summary.json` and show correct `tool_types=0` in the audit output.

### Impact

- **Severity**: Medium — misleading observability data
- **Frequency**: Reproducible when detection job fails due to container crash
- **User impact**: `audit` reports `tool_types=68` for a simple read-only run, causing incorrect optimization recommendations and inflated resource profiling

### Environment

- **Repository**: github/gh-aw
- **Test Run ID**: [§24382493929](https://github.com/github/gh-aw/actions/runs/24382493929)
- **Affected Run**: [§24299193223](https://github.com/github/gh-aw/actions/runs/24299193223)
- **CLI Version**: `0eb9816`
- **Date**: 2026-04-14

### Suggested Fix

1. Deduplicate `ToolCalls` entries before writing to `run_summary.json`, merging entries with the same base name (normalizing `github___xxx` and `github-xxx` to a canonical form)
2. Add a filter to exclude tool names that appear in the system prompt tool list but have no corresponding invocation event
3. Skip `ToolCalls` population when the detection job exits with a container failure code




> Generated by [Daily CLI Tools Exploratory Tester](https://github.com/github/gh-aw/actions/runs/24382493929/agentic_workflow) · ● 2M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-cli-tools-tester%22&type=issues)
> - [x] expires  on Apr 21, 2026, 5:35 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cli-tools-test] Audit tool reports phantom tool usage with inconsistent naming in ToolCalls metrics #26165

Problem Description

Reproduction

Steps to Reproduce

Expected Behavior

Actual Behavior

Hypothesis

Impact

Environment

Suggested Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[cli-tools-test] Audit tool reports phantom tool usage with inconsistent naming in ToolCalls metrics #26165

Description

Problem Description

Reproduction

Steps to Reproduce

Expected Behavior

Actual Behavior

Hypothesis

Impact

Environment

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions