diff --git a/.github/aw/debug-agentic-workflow.md b/.github/aw/debug-agentic-workflow.md index 8c301248ce8..1fbea8e92c8 100644 --- a/.github/aw/debug-agentic-workflow.md +++ b/.github/aw/debug-agentic-workflow.md @@ -62,6 +62,7 @@ Report back with specific findings and actionable fixes. - `gh aw run ` → run a workflow (requires workflow_dispatch trigger) - `gh aw logs [workflow-name] --json` → download and analyze workflow logs with JSON output - `gh aw audit --json` → investigate a specific run with JSON output +- `gh aw audit [...] --json` → diff two or more runs to detect regressions (firewall, MCP, metrics) - `gh aw status` → show status of agentic workflows in the repository > [!IMPORTANT] @@ -91,7 +92,7 @@ Report back with specific findings and actionable fixes. > - `status` tool → equivalent to `gh aw status` > - `compile` tool → equivalent to `gh aw compile` > - `logs` tool → equivalent to `gh aw logs` -> - `audit` tool → equivalent to `gh aw audit` +> - `audit` tool → equivalent to `gh aw audit` (single run: audit report; multiple run IDs: diff mode) > - `checks` tool → equivalent to `gh aw checks` > - `update` tool → equivalent to `gh aw update` > - `add` tool → equivalent to `gh aw add` @@ -183,6 +184,18 @@ When the user provides a workflow run URL (e.g., `https://github.com/github/gh-a - Provides comprehensive JSON analysis - Stores artifacts in `logs/run-/` for offline inspection - Reports missing tools, errors, and execution metrics + + **Comparing two runs (regression detection)**: + Pass a second run ID to produce a diff — no separate `audit diff` command needed: + ```bash + gh aw audit --json + # Or compare base against multiple runs at once: + gh aw audit --json + ``` + Or via the `agentic-workflows` tool: + ``` + Use the audit tool with run_ids_or_urls: ["", ""] + ``` 3. **Analyze Missing Tools** diff --git a/debug.md b/debug.md index 1889c663fdb..213b7dec376 100644 --- a/debug.md +++ b/debug.md @@ -127,6 +127,10 @@ gh aw logs # Audit a specific workflow run gh aw audit +# Diff two or more workflow runs (multi-run diff mode) +gh aw audit +gh aw audit + # Compile workflows after fixing gh aw compile @@ -137,5 +141,6 @@ gh aw status ## Key Debugging Commands - `gh aw audit --json` → Detailed run analysis with missing tools and errors +- `gh aw audit --json` → Diff two runs to detect regressions (firewall, MCP, metrics) - `gh aw logs --json` → Download and analyze recent workflow logs - `gh aw compile --strict` → Validate workflow with strict security checks diff --git a/docs/adr/28483-unify-audit-multi-run-diff-into-main-command.md b/docs/adr/28483-unify-audit-multi-run-diff-into-main-command.md new file mode 100644 index 00000000000..671b57e8bea --- /dev/null +++ b/docs/adr/28483-unify-audit-multi-run-diff-into-main-command.md @@ -0,0 +1,80 @@ +# ADR-28483: Unify Multi-Run Diff Mode into the Main `audit` Command + +**Date**: 2026-04-25 +**Status**: Draft +**Deciders**: pelikhan, Copilot + +--- + +## Part 1 — Narrative (Human-Friendly) + +### Context + +The `gh aw audit` command previously accepted exactly one run ID or URL and produced a single-run audit report. Comparing two runs required invoking a separate subcommand, `audit diff`, which users had to discover independently. The same limitation existed in the MCP tool wrapper, which exposed `audit` only via a single `run_id_or_url` string field. This two-entry-point design created a discoverability gap: agents and users performing regression detection had to know that multi-run comparison was a distinct subcommand rather than a natural extension of `audit`. + +### Decision + +We will unify multi-run diff mode into the main `audit` command by changing its argument signature from `ExactArgs(1)` to `MinimumNArgs(1)`. When exactly one argument is provided the command behaves as before; when two or more are provided the first is treated as the base run and the remaining arguments are compared against it, delegating to the existing `RunAuditDiff` implementation. The `audit diff` subcommand will be hidden (`Hidden: true`) and retained only for backward compatibility. In the MCP tool wrapper, we will add a new `run_ids_or_urls []string` field as the preferred input, while keeping the old `run_id_or_url string` field as a deprecated fallback. + +### Alternatives Considered + +#### Alternative 1: Keep `audit diff` as the Primary Interface, Improve Documentation + +The status quo could be preserved and discoverability improved through documentation updates and help text alone. This was rejected because documentation cannot help agents that parse command output programmatically, and it would not simplify the MCP tool schema. The fundamental UX problem — that comparison requires a different command — would remain. + +#### Alternative 2: Add a `--compare` Flag to `audit` + +A flag-based approach (e.g., `gh aw audit 12345 --compare 12346`) would keep the argument list unambiguous (first positional arg is always the base). This was rejected because it is more verbose and less natural when comparing against multiple runs. Positional arguments are consistent with how `audit diff` already worked, so migration is straightforward for existing users and scripts. + +### Consequences + +#### Positive +- Users and agents have a single entry point for all audit use cases; no need to remember `audit diff`. +- The MCP tool schema gains a typed `run_ids_or_urls` array that makes multi-run diff intent explicit. +- Validation logic (self-comparison rejection, duplicate ID rejection, invalid ID rejection) is shared between the subcommand and the new path. + +#### Negative +- The `audit diff` subcommand must be kept indefinitely as a hidden backward-compatibility alias, adding maintenance surface. +- The `--parse` flag silently becomes a no-op in multi-run mode, which is a subtle inconsistency that may surprise users who upgrade from single-run workflows. +- Agent instruction files and documentation required a sweep to replace `audit diff ` with `audit `. + +#### Neutral +- The error envelope returned by the MCP tool was updated to use `run_ids_or_urls` (array) instead of `run_id_or_url` (string), which is a breaking change for any consumer that inspects the error structure. Callers relying on the old field name will need to update. +- Test coverage for the new `runAuditMulti` function and MCP tool multi-run path was added in the same PR. + +--- + +## Part 2 — Normative Specification (RFC 2119) + +> The key words **MUST**, **MUST NOT**, **REQUIRED**, **SHALL**, **SHALL NOT**, **SHOULD**, **SHOULD NOT**, **RECOMMENDED**, **MAY**, and **OPTIONAL** in this section are to be interpreted as described in [RFC 2119](https://www.rfc-editor.org/rfc/rfc2119). + +### CLI Command Signature + +1. The `audit` command **MUST** accept one or more positional arguments, each being a numeric run ID or a supported GitHub Actions URL format. +2. When exactly one argument is provided, the command **MUST** produce a single-run audit report identical in structure to the previous behavior. +3. When two or more arguments are provided, the command **MUST** treat the first argument as the base run and all subsequent arguments as comparison runs, delegating to the multi-run diff implementation. +4. The command **MUST NOT** accept a self-comparison (base run ID equal to any comparison run ID) and **MUST** return a descriptive error in that case. +5. The command **MUST NOT** accept duplicate comparison run IDs and **MUST** return a descriptive error in that case. +6. The `--parse` flag **MUST** be accepted in multi-run mode but **SHOULD** be documented as a no-op; implementations **MUST NOT** fail if `--parse` is passed alongside multiple run IDs. + +### `audit diff` Subcommand + +1. The `audit diff` subcommand **MUST** remain present in the CLI binary and **MUST** continue to function as before. +2. The `audit diff` subcommand **MUST** be hidden from help output (`Hidden: true`) to discourage new usage. +3. The `audit diff` subcommand **MUST NOT** be removed in any release that does not provide a documented migration path. + +### MCP Tool Schema + +1. The MCP `audit` tool **MUST** accept a `run_ids_or_urls` field of type `[]string` as the primary input. +2. The MCP `audit` tool **MUST** accept the deprecated `run_id_or_url` field of type `string` as a fallback when `run_ids_or_urls` is absent or empty. +3. When both fields are provided, `run_ids_or_urls` **MUST** take precedence. +4. The tool **MUST** return an MCP `InvalidParams` error when neither field provides at least one run ID. +5. Error envelopes returned by the tool **MUST** include a `run_ids_or_urls` array field (not `run_id_or_url`) reflecting the resolved list of run IDs that were attempted. + +### Conformance + +An implementation is considered conformant with this ADR if it satisfies all **MUST** and **MUST NOT** requirements above. Failure to meet any **MUST** or **MUST NOT** requirement constitutes non-conformance. + +--- + +*This is a DRAFT ADR generated by the [Design Decision Gate](https://github.com/github/gh-aw/actions/runs/24936666718) workflow. The PR author must review, complete, and finalize this document before the PR can merge.* diff --git a/docs/src/content/docs/guides/audit-with-agents.md b/docs/src/content/docs/guides/audit-with-agents.md index 2fac6bdabe6..191680439ce 100644 --- a/docs/src/content/docs/guides/audit-with-agents.md +++ b/docs/src/content/docs/guides/audit-with-agents.md @@ -9,7 +9,7 @@ When running locally, all three audit commands accept `--json` to write structur | --------- | ---------- | | `gh aw audit --json` | Single run — `key_findings`, `recommendations`, `metrics` | | `gh aw logs [workflow] --last 10 --json` | Trend analysis — `per_run_breakdown`, `domain_inventory` | -| `gh aw audit diff --json` | Before/after — `run_metrics_diff`, `firewall_diff` | +| `gh aw audit --json` | Before/after — `run_metrics_diff`, `firewall_diff` | Inside GitHub Actions workflows, agents access these commands through the `agentic-workflows` MCP tool rather than calling the CLI directly. @@ -65,7 +65,7 @@ permissions: # Regression Detection -Use the `agentic-workflows` MCP tool `audit diff` with base run ID ${{ inputs.base_run_id }} and current run ID ${{ inputs.current_run_id }}. Check for new blocked domains, increased MCP error rates, cost increase > 20%, or token usage increase > 50%. If regressions are found, open a GitHub issue with a table from `run_metrics_diff`, affected domains from `firewall_diff`, and affected MCP tools from `mcp_tools_diff`. +Use the `agentic-workflows` MCP tool `audit` with run IDs ${{ inputs.base_run_id }} and ${{ inputs.current_run_id }} to compare the two runs. Check for new blocked domains, increased MCP error rates, cost increase > 20%, or token usage increase > 50%. If regressions are found, open a GitHub issue with a table from `run_metrics_diff`, affected domains from `firewall_diff`, and affected MCP tools from `mcp_tools_diff`. ``` ## Filing issues from audit findings diff --git a/docs/src/content/docs/guides/maintaining-repos.md b/docs/src/content/docs/guides/maintaining-repos.md index 39a414ebfa8..aafaa0e6c04 100644 --- a/docs/src/content/docs/guides/maintaining-repos.md +++ b/docs/src/content/docs/guides/maintaining-repos.md @@ -175,7 +175,7 @@ gh aw logs --filtered-integrity # only runs with DIFC-filtered events **Compare two runs for regressions:** ```bash -gh aw audit diff BASELINE_ID CURRENT_ID +gh aw audit BASELINE_ID CURRENT_ID ``` ### Common Failure Patterns @@ -195,7 +195,7 @@ gh aw audit diff BASELINE_ID CURRENT_ID 2. Run `gh aw audit RUN_ID` for a structured breakdown. 3. For complex issues, use `/agent agentic-workflows` in Copilot Chat. 4. Edit the `.md` file → run `gh aw compile` to validate → trigger a new run. -5. Compare the new run against the baseline with `gh aw audit diff`. +5. Compare the new run against the baseline with `gh aw audit BASELINE_ID NEW_ID`. ## Related Documentation diff --git a/docs/src/content/docs/patterns/monitoring.md b/docs/src/content/docs/patterns/monitoring.md index 3717203e101..c0f1072bca4 100644 --- a/docs/src/content/docs/patterns/monitoring.md +++ b/docs/src/content/docs/patterns/monitoring.md @@ -114,7 +114,7 @@ Use `gh aw status` to see which workflows are enabled and their latest run state For deeper investigation, the audit commands are the primary monitoring tool for agentic workflows: - `gh aw audit ` — single-run report with tool usage, MCP failures, firewall activity, and cost metrics -- `gh aw audit diff ` — compare two runs to detect behavioral regressions or new network accesses +- `gh aw audit ` — compare two runs to detect behavioral regressions or new network accesses (pass additional IDs to compare base against multiple runs) - `gh aw logs --format markdown [workflow]` — cross-run security and performance report for trend monitoring ```bash @@ -122,7 +122,10 @@ For deeper investigation, the audit commands are the primary monitoring tool for gh aw audit 12345678 # Compare two runs for regressions -gh aw audit diff 12345678 12345679 +gh aw audit 12345678 12345679 + +# Compare base against multiple runs at once +gh aw audit 12345678 12345679 12345680 # Trend report across the last 10 runs of a workflow gh aw logs my-workflow --format markdown --count 10 diff --git a/docs/src/content/docs/reference/audit.md b/docs/src/content/docs/reference/audit.md index 8a347901ff8..ca7fd702ebf 100644 --- a/docs/src/content/docs/reference/audit.md +++ b/docs/src/content/docs/reference/audit.md @@ -7,17 +7,18 @@ sidebar: The `gh aw audit` commands download workflow run artifacts and logs, analyze MCP tool usage and network behavior, and produce structured reports suited for security reviews, debugging, and feeding to AI agents. -## `gh aw audit ` +## `gh aw audit [...]` -Audit a single workflow run and generate a detailed Markdown report. +Audit one or more workflow runs. When a single run is provided, a detailed Markdown report is generated. When two or more runs are provided, the first is used as the base (reference) run and the remaining runs are compared against it, producing a diff report. **Arguments:** | Argument | Description | |----------|-------------| | `` | A numeric run ID, GitHub Actions run URL, job URL, or job URL with step anchor | +| `[...]` | Additional run IDs or URLs to compare against the first (diff mode) | -**Accepted input formats:** +**Accepted input formats (per argument):** - Numeric run ID: `1234567890` - Run URL: `https://github.com/owner/repo/actions/runs/1234567890` @@ -26,7 +27,11 @@ Audit a single workflow run and generate a detailed Markdown report. - Short run URL: `https://github.com/owner/repo/runs/1234567890` - GitHub Enterprise URLs using the same formats above -When a job URL is provided without a step anchor, the command extracts the output of the first failing step. When a step anchor is included, it extracts that specific step. +When a job URL is provided without a step anchor (single-run mode), the command extracts the output of the first failing step. When a step anchor is included, it extracts that specific step. + +In diff mode, job URLs and step-anchored URLs are accepted for any argument — the job/step specificity is silently normalized to the parent run ID, so it is always a run-level diff. + +Self-comparisons and duplicate run IDs are rejected when using diff mode. **Flags:** @@ -34,11 +39,12 @@ When a job URL is provided without a step anchor, the command extracts the outpu |------|---------|-------------| | `-o, --output ` | `./logs` | Directory to write downloaded artifacts and report files | | `--json` | off | Output report as JSON to stdout | -| `--parse` | off | Run JavaScript parsers on agent and firewall logs, writing `log.md` and `firewall.md` | +| `--parse` | off | Run JavaScript parsers on agent and firewall logs, writing `log.md` and `firewall.md` (single-run only) | | `--repo ` | auto | Specify repository when the run ID is not from a URL | | `--verbose` | off | Print detailed progress information | +| `--format ` | `pretty` | Diff output format: `pretty` or `markdown` (multi-run only) | -**Examples:** +**Single-run examples:** ```bash gh aw audit 1234567890 @@ -49,38 +55,24 @@ gh aw audit 1234567890 -o ./audit-reports gh aw audit 1234567890 --repo owner/repo ``` -**Report sections** (rendered in Markdown or JSON): Overview, Comparison, Task/Domain, Behavior Fingerprint, Agentic Assessments, Metrics, Key Findings, Recommendations, Observability Insights, Performance Metrics, Engine Config, Prompt Analysis, Session Analysis, Safe Output Summary, MCP Server Health, Jobs, Downloaded Files, Missing Tools, Missing Data, Noops, MCP Failures, Firewall Analysis, Policy Analysis, Redacted Domains, Errors, Warnings, Tool Usage, MCP Tool Usage, Created Items. +**Multi-run diff examples:** + +```bash +gh aw audit 12345 12346 # Compare two runs +gh aw audit 12345 12346 12347 12348 # Compare base against 3 runs +gh aw audit 12345 12346 --format markdown # Markdown output for PR comments +gh aw audit 12345 12346 --json # JSON for CI integration +gh aw audit 12345 12346 --repo owner/repo # Specify repository +``` + +**Single-run report sections** (rendered in Markdown or JSON): Overview, Comparison, Task/Domain, Behavior Fingerprint, Agentic Assessments, Metrics, Key Findings, Recommendations, Observability Insights, Performance Metrics, Engine Config, Prompt Analysis, Session Analysis, Safe Output Summary, MCP Server Health, Jobs, Downloaded Files, Missing Tools, Missing Data, Noops, MCP Failures, Firewall Analysis, Policy Analysis, Redacted Domains, Errors, Warnings, Tool Usage, MCP Tool Usage, Created Items. The Metrics section includes an `ambient_context` object when available. Ambient context captures the first LLM inference footprint for the run: - `ambient_context.input_tokens` — input tokens for the first invocation - `ambient_context.cached_tokens` — cache-read tokens reused by the first invocation - `ambient_context.effective_tokens` — `input_tokens + cached_tokens` -## `gh aw audit diff [...]` - -Compare behavior between workflow runs. Detects policy regressions, new unauthorized domains, behavioral drift, and changes in MCP tool usage or run metrics. - -**Arguments:** - -| Argument | Description | -|----------|-------------| -| `` | Numeric run ID for the baseline run | -| `` | Numeric run ID for the comparison run | -| `[...]` | Additional run IDs to compare against the same base | - -The base run is downloaded once and reused when multiple comparison runs are provided. Self-comparisons and duplicate run IDs are rejected. - -**Flags:** - -| Flag | Default | Description | -|------|---------|-------------| -| `--format ` | `pretty` | Output format: `pretty` or `markdown` | -| `--json` | off | Output diff as JSON | -| `--repo ` | auto | Specify repository | -| `-o, --output ` | `./logs` | Directory for downloaded artifacts | -| `--verbose` | off | Print detailed progress | - -The diff output includes: +**Diff output** includes: - New and removed network domains - Domain status changes (allowed ↔ denied) - Volume changes (request count changes above a 100% threshold) @@ -89,20 +81,10 @@ The diff output includes: - Run metrics comparison (token usage, duration, turns) - Token usage breakdown: input tokens, output tokens, cache read/write tokens, effective tokens, total API requests, and cache efficiency per run -**Output behavior with multiple comparisons:** +**Diff output behavior with multiple comparisons:** - `--json` outputs a single object for one comparison, or an array for multiple - `--format pretty` and `--format markdown` separate multiple diffs with dividers -**Examples:** - -```bash -gh aw audit diff 12345 12346 -gh aw audit diff 12345 12346 12347 12348 -gh aw audit diff 12345 12346 --format markdown -gh aw audit diff 12345 12346 --json -gh aw audit diff 12345 12346 --repo owner/repo -``` - ## `gh aw logs --format ` Generate a cross-run security and performance audit report across multiple recent workflow runs. diff --git a/docs/src/content/docs/reference/glossary.md b/docs/src/content/docs/reference/glossary.md index 02b49deea47..dd410c7d269 100644 --- a/docs/src/content/docs/reference/glossary.md +++ b/docs/src/content/docs/reference/glossary.md @@ -429,9 +429,9 @@ An interactive web-based editor for authoring, compiling, and previewing agentic A CLI command that downloads workflow run artifacts and logs, analyzes MCP tool usage and network behavior, and generates a structured Markdown or JSON report. The report covers failure analysis, tool usage, MCP server status, firewall activity, token/cost metrics, behavior fingerprint, and safe-output summary. Accepts a numeric run ID or any GitHub Actions run or job URL. See [Audit Commands](/gh-aw/reference/audit/). -### Audit Diff (`gh aw audit diff`) +### Audit Diff (multi-run mode) -A `gh aw audit` subcommand that compares behavior across two workflow runs across firewall, MCP tool usage, and run metrics dimensions. Reports domain additions and removals, allowed/denied status changes, request volume drift, and anomaly flags. Useful for detecting regressions and behavioral drift between runs. See [Audit Commands](/gh-aw/reference/audit/#gh-aw-audit-diff-base-run-id-comparison-run-id-comparison-run-id). +Passing two or more run IDs to `gh aw audit` activates diff mode: the first ID is the base and the rest are compared against it. Reports domain additions and removals, allowed/denied status changes, request volume drift, and anomaly flags across firewall, MCP tool usage, and run metrics dimensions. Useful for detecting regressions and behavioral drift between runs. See [Audit Commands](/gh-aw/reference/audit/). ### Behavior Fingerprint @@ -462,7 +462,7 @@ The token footprint of the first LLM invocation in a workflow run, used as a pro ### Firewall Analysis -A section of the `gh aw audit` report that breaks down all network requests made during a workflow run — showing allowed domains, denied domains, request volumes, and policy attribution. Derived from AWF firewall logs. Use `gh aw audit diff` to compare firewall behavior across runs and identify new or removed domain accesses. See [Audit Commands](/gh-aw/reference/audit/) and [Network Permissions](/gh-aw/reference/network/). +A section of the `gh aw audit` report that breaks down all network requests made during a workflow run — showing allowed domains, denied domains, request volumes, and policy attribution. Derived from AWF firewall logs. Pass multiple run IDs to `gh aw audit` (e.g. `gh aw audit `) to compare firewall behavior across runs and identify new or removed domain accesses. See [Audit Commands](/gh-aw/reference/audit/) and [Network Permissions](/gh-aw/reference/network/). ### Frontmatter Hash diff --git a/docs/src/content/docs/reference/mcp-gateway.md b/docs/src/content/docs/reference/mcp-gateway.md index 6d8ec34d995..08e2a222d26 100644 --- a/docs/src/content/docs/reference/mcp-gateway.md +++ b/docs/src/content/docs/reference/mcp-gateway.md @@ -1272,7 +1272,7 @@ The gateway SHOULD: 5. Update readiness based on critical server status > [!TIP] -> To inspect MCP server health for a specific workflow run at runtime, use `gh aw audit `. The **MCP Server Health** section of the audit report shows connection failures, timeout errors, tool call counts, and error rates per server — providing a post-run view of gateway behavior. For recurring MCP failures, `gh aw audit diff` compares MCP tool usage between two runs to identify regressions. See [Audit Commands](/gh-aw/reference/audit/). +> To inspect MCP server health for a specific workflow run at runtime, use `gh aw audit `. The **MCP Server Health** section of the audit report shows connection failures, timeout errors, tool call counts, and error rates per server — providing a post-run view of gateway behavior. For recurring MCP failures, pass two run IDs to `gh aw audit` (e.g. `gh aw audit `) to compare MCP tool usage between runs and identify regressions. See [Audit Commands](/gh-aw/reference/audit/). --- diff --git a/docs/src/content/docs/reference/network.md b/docs/src/content/docs/reference/network.md index 130954d3f5d..a864323a608 100644 --- a/docs/src/content/docs/reference/network.md +++ b/docs/src/content/docs/reference/network.md @@ -304,14 +304,14 @@ If you encounter network access blocked errors, verify that required domains or Use `gh aw logs --run-id ` to view firewall activity and identify blocked domains. See the [Network Configuration Guide](/gh-aw/guides/network-configuration/#troubleshooting-firewall-blocking) for detailed troubleshooting steps and common solutions. -To understand domain allow/block behavior in detail, use `gh aw audit ` — the **Firewall Analysis** section of the report lists every domain request, its allowed or denied status, request volume, and policy attribution. To compare firewall behavior between two runs and spot new or removed domain accesses, use `gh aw audit diff`: +To understand domain allow/block behavior in detail, use `gh aw audit ` — the **Firewall Analysis** section of the report lists every domain request, its allowed or denied status, request volume, and policy attribution. To compare firewall behavior between two runs and spot new or removed domain accesses, pass both run IDs to `audit`: ```bash # Inspect firewall activity for a single run gh aw audit 12345678 # Compare firewall behavior between two runs -gh aw audit diff 12345678 12345679 +gh aw audit 12345678 12345679 ``` See [Audit Commands](/gh-aw/reference/audit/) for full documentation. diff --git a/docs/src/content/docs/setup/cli.md b/docs/src/content/docs/setup/cli.md index 9cf13e59a34..200c41e1ba6 100644 --- a/docs/src/content/docs/setup/cli.md +++ b/docs/src/content/docs/setup/cli.md @@ -441,15 +441,16 @@ Logs are saved to `logs/run-{id}/` with filenames indicating the extraction leve | **Jobs** | Status of each GitHub Actions job in the run | | **Artifacts** | Downloaded artifacts and their contents | -##### `audit diff` +##### Multi-run diff mode -Compare behavior between two workflow runs to detect policy regressions, new unauthorized domains, behavioral drift, and changes in MCP tool usage or run metrics. +Compare behavior between two or more workflow runs to detect policy regressions, new unauthorized domains, behavioral drift, and changes in MCP tool usage or run metrics. Pass multiple run IDs directly to `audit` — the first is the base, the rest are comparisons: ```bash wrap -gh aw audit diff 12345 12346 # Compare two runs -gh aw audit diff 12345 12346 --format markdown # Markdown output for PR comments -gh aw audit diff 12345 12346 --json # JSON for CI integration -gh aw audit diff 12345 12346 --repo owner/repo # Specify repository +gh aw audit 12345 12346 # Compare two runs +gh aw audit 12345 12346 12347 12348 # Compare base against 3 runs +gh aw audit 12345 12346 --format markdown # Markdown output for PR comments +gh aw audit 12345 12346 --json # JSON for CI integration +gh aw audit 12345 12346 --repo owner/repo # Specify repository ``` The diff output shows: new or removed network domains, status changes (allowed ↔ denied), volume changes (>100% threshold), MCP tool invocation changes, and run metric comparisons (token usage, duration, turns). diff --git a/docs/src/content/docs/troubleshooting/debugging.md b/docs/src/content/docs/troubleshooting/debugging.md index c0fb102bfcc..13c6b3d464a 100644 --- a/docs/src/content/docs/troubleshooting/debugging.md +++ b/docs/src/content/docs/troubleshooting/debugging.md @@ -105,11 +105,11 @@ Audit output includes: - **Token/cost metrics** — per-run inference spend and token usage - **Safe-outputs** — structured outputs the agent produced -To compare behavior between two runs and detect regressions across firewall, MCP, and metrics dimensions, use `audit diff`: +To compare behavior between two runs and detect regressions across firewall, MCP, and metrics dimensions, pass multiple run IDs directly to `audit`: ```bash -gh aw audit diff 12345678 12345679 -gh aw audit diff 12345678 12345679 --format markdown +gh aw audit 12345678 12345679 +gh aw audit 12345678 12345679 --format markdown ``` For security and performance trends across multiple runs, use `gh aw logs --format`: diff --git a/pkg/cli/audit.go b/pkg/cli/audit.go index 5f50a4a87f2..46e71eab35d 100644 --- a/pkg/cli/audit.go +++ b/pkg/cli/audit.go @@ -27,12 +27,16 @@ var auditLog = logger.New("cli:audit") // NewAuditCommand creates the audit command func NewAuditCommand() *cobra.Command { cmd := &cobra.Command{ - Use: "audit ", + Use: "audit [run-id-or-url...]", Short: "Audit a workflow run and generate a detailed report", - Long: `Audit a single workflow run by downloading artifacts and logs, detecting errors, -analyzing MCP tool usage, and generating a concise Markdown report suitable for AI agents. + Long: `Audit one or more workflow runs by downloading artifacts and logs, detecting errors, +analyzing MCP tool usage, and generating a concise report suitable for AI agents. -This command accepts: +When a single run is provided, generates a detailed Markdown report for that run. +When two or more runs are provided, the first is used as the base (reference) and the +remaining runs are compared against it, producing a diff report. + +Each argument accepts: - A numeric run ID (e.g., 1234567890) - A GitHub Actions run URL (e.g., https://github.com/owner/repo/actions/runs/1234567890) - A GitHub Actions job URL (e.g., https://github.com/owner/repo/actions/runs/1234567890/job/9876543210) @@ -40,33 +44,27 @@ This command accepts: - A GitHub workflow run URL (e.g., https://github.com/owner/repo/runs/1234567890) - GitHub Enterprise URLs (e.g., https://github.example.com/owner/repo/actions/runs/1234567890) -When a job URL is provided: +When a job URL is provided (single-run mode only): - If a step number is included (#step:7:1), extracts that specific step's output - If no step number, finds and extracts the first failing step's output - Saves job logs to the output directory Examples: - ` + string(constants.CLIExtensionPrefix) + ` audit 1234567890 # Audit run with ID 1234567890 + ` + string(constants.CLIExtensionPrefix) + ` audit 1234567890 # Audit run with ID 1234567890 ` + string(constants.CLIExtensionPrefix) + ` audit https://github.com/owner/repo/actions/runs/1234567890 # Audit from run URL ` + string(constants.CLIExtensionPrefix) + ` audit https://github.com/owner/repo/actions/runs/1234567890/job/9876543210 # Audit job and extract first failing step ` + string(constants.CLIExtensionPrefix) + ` audit https://github.com/owner/repo/actions/runs/1234567890/job/9876543210#step:7:1 # Extract step 7 output ` + string(constants.CLIExtensionPrefix) + ` audit https://github.com/owner/repo/runs/1234567890 # Audit from workflow run URL ` + string(constants.CLIExtensionPrefix) + ` audit https://github.example.com/owner/repo/actions/runs/1234567890 # Audit from GitHub Enterprise - ` + string(constants.CLIExtensionPrefix) + ` audit 1234567890 -o ./audit-reports # Custom output directory - ` + string(constants.CLIExtensionPrefix) + ` audit 1234567890 -v # Verbose output - ` + string(constants.CLIExtensionPrefix) + ` audit 1234567890 --parse # Parse agent logs and firewall logs, generating log.md and firewall.md - ` + string(constants.CLIExtensionPrefix) + ` audit 1234567890 --repo owner/repo # Audit run from a specific repository`, - Args: cobra.ExactArgs(1), + ` + string(constants.CLIExtensionPrefix) + ` audit 1234567890 -o ./audit-reports # Custom output directory + ` + string(constants.CLIExtensionPrefix) + ` audit 1234567890 -v # Verbose output + ` + string(constants.CLIExtensionPrefix) + ` audit 1234567890 --parse # Parse agent logs and firewall logs, generating log.md and firewall.md + ` + string(constants.CLIExtensionPrefix) + ` audit 1234567890 --repo owner/repo # Audit run from a specific repository + ` + string(constants.CLIExtensionPrefix) + ` audit 1234567890 1234567891 # Diff two runs (base vs comparison) + ` + string(constants.CLIExtensionPrefix) + ` audit 1234567890 1234567891 1234567892 # Diff base against multiple runs + ` + string(constants.CLIExtensionPrefix) + ` audit 1234567890 1234567891 --format markdown # Markdown diff output for PR comments`, + Args: cobra.MinimumNArgs(1), RunE: func(cmd *cobra.Command, args []string) error { - runIDOrURL := args[0] - - // Parse run information from input (either numeric ID or URL) - // Use extended parsing to capture job ID and step information - components, err := parser.ParseRunURLExtended(runIDOrURL) - if err != nil { - return err - } - outputDir, _ := cmd.Flags().GetString("output") verbose, _ := cmd.Flags().GetBool("verbose") jsonOutput, _ := cmd.Flags().GetBool("json") @@ -74,30 +72,46 @@ Examples: repoFlag, _ := cmd.Flags().GetString("repo") artifacts, _ := cmd.Flags().GetStringSlice("artifacts") - // If --repo is provided and owner/repo were not parsed from a URL, apply them - if repoFlag != "" && components.Owner == "" { - parts := strings.SplitN(repoFlag, "/", 2) - if len(parts) != 2 || parts[0] == "" || parts[1] == "" { - return fmt.Errorf("invalid repository format '%s': expected 'owner/repo'", repoFlag) + if len(args) == 1 { + // Single run: existing audit behavior + runIDOrURL := args[0] + + // Parse run information from input (either numeric ID or URL) + // Use extended parsing to capture job ID and step information + components, err := parser.ParseRunURLExtended(runIDOrURL) + if err != nil { + return err } - components.Owner = parts[0] - components.Repo = parts[1] + + // If --repo is provided and owner/repo were not parsed from a URL, apply them + if repoFlag != "" && components.Owner == "" { + parts := strings.SplitN(repoFlag, "/", 2) + if len(parts) != 2 || parts[0] == "" || parts[1] == "" { + return fmt.Errorf("invalid repository format '%s': expected 'owner/repo'", repoFlag) + } + components.Owner = parts[0] + components.Repo = parts[1] + } + + return AuditWorkflowRun( + cmd.Context(), + components.Number, + components.Owner, + components.Repo, + components.Host, + outputDir, + verbose, + parse, + jsonOutput, + components.JobID, + components.StepNumber, + artifacts, + ) } - return AuditWorkflowRun( - cmd.Context(), - components.Number, - components.Owner, - components.Repo, - components.Host, - outputDir, - verbose, - parse, - jsonOutput, - components.JobID, - components.StepNumber, - artifacts, - ) + // Multiple runs: diff mode (first is base, rest are comparisons) + format, _ := cmd.Flags().GetString("format") + return runAuditMulti(cmd.Context(), args, repoFlag, outputDir, verbose, jsonOutput, format, artifacts) }, } @@ -106,6 +120,7 @@ Examples: addJSONFlag(cmd) addRepoFlag(cmd) cmd.Flags().Bool("parse", false, "Run JavaScript parsers on agent logs and firewall logs, writing Markdown to log.md and firewall.md") + cmd.Flags().String("format", "pretty", "Diff output format for multi-run mode: pretty, markdown") cmd.Flags().StringSlice("artifacts", nil, "Artifact sets to download (default: all). Valid sets: "+strings.Join(ValidArtifactSetNames(), ", ")) // Register completions for audit command @@ -117,6 +132,51 @@ Examples: return cmd } +// runAuditMulti handles the multi-run diff mode for the audit command. +// The first argument is the base run; remaining arguments are comparison runs. +// Each argument may be a numeric run ID, a GitHub Actions run URL, or a job/step +// URL — job and step specificity is silently normalized to the parent run ID. +func runAuditMulti(ctx context.Context, args []string, repoFlag, outputDir string, verbose, jsonOutput bool, format string, artifacts []string) error { + // Parse base run (job/step URLs are accepted; only the run number is used) + baseComponents, err := parser.ParseRunURLExtended(args[0]) + if err != nil { + return fmt.Errorf("invalid base run %q: %w", args[0], err) + } + + // Resolve owner/repo/hostname from --repo flag or base URL + owner := baseComponents.Owner + repo := baseComponents.Repo + hostname := baseComponents.Host + if repoFlag != "" && owner == "" { + parts := strings.SplitN(repoFlag, "/", 2) + if len(parts) != 2 || parts[0] == "" || parts[1] == "" { + return fmt.Errorf("invalid repository format '%s': expected 'owner/repo'", repoFlag) + } + owner = parts[0] + repo = parts[1] + } + + // Parse comparison run IDs (job/step URLs are accepted; only the run number is used) + seen := make(map[int64]bool) + compareRunIDs := make([]int64, 0, len(args)-1) + for _, arg := range args[1:] { + c, err := parser.ParseRunURLExtended(arg) + if err != nil { + return fmt.Errorf("invalid comparison run %q: %w", arg, err) + } + if c.Number == baseComponents.Number { + return fmt.Errorf("comparison run ID %d is the same as the base run ID: cannot diff a run against itself", c.Number) + } + if seen[c.Number] { + return fmt.Errorf("duplicate comparison run ID %d: each run ID must appear only once", c.Number) + } + seen[c.Number] = true + compareRunIDs = append(compareRunIDs, c.Number) + } + + return RunAuditDiff(ctx, baseComponents.Number, compareRunIDs, owner, repo, hostname, outputDir, verbose, jsonOutput, format, artifacts) +} + // isPermissionError checks if an error is related to permissions/authentication func isPermissionError(err error) bool { if err == nil { diff --git a/pkg/cli/audit_diff_command.go b/pkg/cli/audit_diff_command.go index afa39f88b8e..8947042d845 100644 --- a/pkg/cli/audit_diff_command.go +++ b/pkg/cli/audit_diff_command.go @@ -12,12 +12,19 @@ import ( "github.com/spf13/cobra" ) -// NewAuditDiffSubcommand creates the audit diff subcommand +// NewAuditDiffSubcommand creates the audit diff subcommand. +// Deprecated: pass multiple run IDs directly to `audit` instead (e.g. `gh aw audit `). +// This subcommand is hidden and kept for backward compatibility only. func NewAuditDiffSubcommand() *cobra.Command { cmd := &cobra.Command{ - Use: "diff ...", - Short: "Compare behavior across workflow runs", - Long: `Compare workflow run behavior between a base run and one or more comparison runs + Use: "diff ...", + Short: "Compare behavior across workflow runs", + Hidden: true, + Long: `Deprecated: pass multiple run IDs directly to the audit command instead. + + gh aw audit ... + +Compare workflow run behavior between a base run and one or more comparison runs to detect policy regressions, new unauthorized domains, behavioral drift, and changes in MCP tool usage, token usage, or run metrics. diff --git a/pkg/cli/audit_test.go b/pkg/cli/audit_test.go index 31cb5f3bc37..7388a3c87e7 100644 --- a/pkg/cli/audit_test.go +++ b/pkg/cli/audit_test.go @@ -16,6 +16,8 @@ import ( "github.com/github/gh-aw/pkg/testutil" "github.com/github/gh-aw/pkg/workflow" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" ) func TestIsPermissionError(t *testing.T) { @@ -994,3 +996,56 @@ func TestResolveWorkflowDisplayNameFromLocalFile(t *testing.T) { t.Errorf("extractWorkflowNameFromYAML() = %q, want %q", name, "My Test Workflow") } } + +// TestRunAuditMulti_Validation verifies that runAuditMulti rejects invalid +// argument combinations before attempting to download any run data. +func TestRunAuditMulti_Validation(t *testing.T) { + tests := []struct { + name string + args []string + wantErr string + }{ + { + name: "self comparison rejected", + args: []string{"1234567890", "1234567890"}, + wantErr: "cannot diff a run against itself", + }, + { + name: "duplicate comparison run ID rejected", + args: []string{"1234567890", "1111111111", "1111111111"}, + wantErr: "duplicate comparison run ID", + }, + { + name: "invalid base run ID rejected", + args: []string{"not-a-run-id", "1111111111"}, + wantErr: "invalid base run", + }, + { + name: "invalid comparison run ID rejected", + args: []string{"1234567890", "not-a-run-id"}, + wantErr: "invalid comparison run", + }, + { + // Job URL as base is normalized to its parent run ID (1234567890), so + // a self-comparison against the same run ID should still be caught. + name: "base job URL normalized and self-comparison rejected", + args: []string{"https://github.com/owner/repo/actions/runs/1234567890/job/9876543210", "1234567890"}, + wantErr: "cannot diff a run against itself", + }, + { + // Job URL as comparison is normalized to its parent run ID (1111111111), + // so duplicate detection should still work. + name: "comparison job URL normalized and duplicate detected", + args: []string{"1234567890", "https://github.com/owner/repo/actions/runs/1111111111/job/9876543210", "1111111111"}, + wantErr: "duplicate comparison run ID", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + err := runAuditMulti(t.Context(), tt.args, "", "", false, false, "pretty", nil) + require.Error(t, err, "runAuditMulti should return an error for invalid input") + assert.Contains(t, err.Error(), tt.wantErr, "error message should be descriptive") + }) + } +} diff --git a/pkg/cli/mcp_tools_privileged.go b/pkg/cli/mcp_tools_privileged.go index f15314f8d94..f2eb971feb3 100644 --- a/pkg/cli/mcp_tools_privileged.go +++ b/pkg/cli/mcp_tools_privileged.go @@ -248,9 +248,10 @@ from where the previous request stopped due to timeout.`, // Returns an error if schema generation fails. func registerAuditTool(server *mcp.Server, execCmd execCmdFunc, actor string, validateActor bool) error { type auditArgs struct { - RunIDOrURL string `json:"run_id_or_url" jsonschema:"GitHub Actions workflow run ID or URL. Accepts: numeric run ID (e.g., 1234567890), run URL (https://github.com/owner/repo/actions/runs/1234567890), job URL (https://github.com/owner/repo/actions/runs/1234567890/job/9876543210), or job URL with step (https://github.com/owner/repo/actions/runs/1234567890/job/9876543210#step:7:1)"` - Artifacts []string `json:"artifacts,omitempty" jsonschema:"Artifact sets to download (default: all). Valid sets: all, activation, agent, detection, firewall, github-api, mcp"` - MaxTokens int `json:"max_tokens,omitempty" jsonschema:"Deprecated: accepted for backward compatibility but ignored."` + RunIDOrURL string `json:"run_id_or_url,omitempty" jsonschema:"Deprecated: use run_ids_or_urls instead. Single GitHub Actions workflow run ID or URL."` + RunIDsOrURLs []string `json:"run_ids_or_urls,omitempty" jsonschema:"One or more workflow run IDs or URLs. Single item: detailed audit report. Multiple items: diff mode with first as base (see tool description for accepted formats)."` + Artifacts []string `json:"artifacts,omitempty" jsonschema:"Artifact sets to download (default: all). Valid sets: all, activation, agent, detection, firewall, github-api, mcp"` + MaxTokens int `json:"max_tokens,omitempty" jsonschema:"Deprecated: accepted for backward compatibility but ignored."` } // Generate schema for audit tool @@ -267,20 +268,25 @@ func registerAuditTool(server *mcp.Server, execCmd execCmdFunc, actor string, va IdempotentHint: true, OpenWorldHint: boolPtr(true), }, - Description: `Investigate a workflow run, job, or specific step and generate a concise report. + Description: `Investigate one or more workflow runs and generate a concise report. -Accepts multiple input formats: +When a single run is provided, generates a detailed audit report. +When two or more runs are provided, the first is the base (reference) run and +the remaining runs are compared against it (diff mode), showing changes in +firewall domains, MCP tool usage, and run metrics. + +Each run accepts: - Numeric run ID: 1234567890 - Run URL: https://github.com/owner/repo/actions/runs/1234567890 - Job URL: https://github.com/owner/repo/actions/runs/1234567890/job/9876543210 - Job URL with step: https://github.com/owner/repo/actions/runs/1234567890/job/9876543210#step:7:1 -When a job URL is provided: +When a job URL is provided (single-run mode only): - If a step number is included (#step:7:1), extracts that specific step's output - If no step number, finds and extracts the first failing step's output - Saves job logs and step-specific logs to the output directory -Returns JSON with the following structure: +Single-run returns JSON with: - overview: Basic run information (run_id, workflow_name, status, conclusion, created_at, started_at, updated_at, duration, event, branch, url, logs_path) - metrics: Execution metrics (token_usage, estimated_cost, turns, error_count, warning_count) - jobs: List of job details (name, status, conclusion, duration) @@ -290,7 +296,9 @@ Returns JSON with the following structure: - errors: Error details (file, line, type, message) - warnings: Warning details (file, line, type, message) - tool_usage: Tool usage statistics (name, call_count, max_output_size, max_duration) -- firewall_analysis: Network firewall analysis if available (total_requests, allowed_requests, blocked_requests, allowed_domains, blocked_domains)`, +- firewall_analysis: Network firewall analysis if available (total_requests, allowed_requests, blocked_requests, allowed_domains, blocked_domains) + +Multi-run diff returns JSON describing changes between the base and each comparison run.`, InputSchema: auditSchema, Icons: []mcp.Icon{ {Source: "🔍"}, @@ -308,18 +316,30 @@ Returns JSON with the following structure: default: } - // Build command arguments - // Force output directory to /tmp/gh-aw/aw-mcp/logs for MCP server (same as logs) - // Use --json flag to output structured JSON for MCP consumption - // Pass the run ID or URL directly - the audit command will parse it - cmdArgs := []string{"audit", args.RunIDOrURL, "-o", "/tmp/gh-aw/aw-mcp/logs", "--json"} + // Resolve the list of run IDs/URLs to pass to the audit command. + // run_ids_or_urls takes precedence; fall back to the deprecated run_id_or_url field. + runItems := args.RunIDsOrURLs + if len(runItems) == 0 && args.RunIDOrURL != "" { + runItems = []string{args.RunIDOrURL} + } + if len(runItems) == 0 { + return nil, nil, newMCPError(jsonrpc.CodeInvalidParams, "at least one run ID or URL must be provided via run_ids_or_urls or run_id_or_url", nil) + } + + // Build command arguments. + // Force output directory to /tmp/gh-aw/aw-mcp/logs for MCP server (same as logs). + // Use --json flag to output structured JSON for MCP consumption. + // Pass all run IDs/URLs directly - the audit command handles single vs. diff mode. + cmdArgs := []string{"audit"} + cmdArgs = append(cmdArgs, runItems...) + cmdArgs = append(cmdArgs, "-o", "/tmp/gh-aw/aw-mcp/logs", "--json") if len(args.Artifacts) > 0 { cmdArgs = append(cmdArgs, "--artifacts", strings.Join(args.Artifacts, ",")) } cmdArgs = appendRepoFlagFromEnv(cmdArgs) - // Execute the CLI command + // Execute the CLI command. // Use separate stdout/stderr capture instead of CombinedOutput because: // - Stdout contains JSON output (--json flag) // - Stderr contains console messages and debug logs that shouldn't be mixed with JSON @@ -350,12 +370,16 @@ Returns JSON with the following structure: } // Return a JSON error envelope instead of an MCP protocol error so - // callers always receive consistent JSON and the run ID is always present. + // callers always receive consistent JSON and the run IDs are always present. // IsError must be false so that callers (e.g. mcp_cli_bridge) treat this as // a graceful not-found / failure response rather than a fatal protocol error. + errorMsg := "failed to audit workflow run: " + mainMsg + if len(runItems) > 1 { + errorMsg = "failed to audit workflow runs: " + mainMsg + } errorEnvelope := map[string]any{ - "error": "failed to audit workflow run: " + mainMsg, - "run_id_or_url": args.RunIDOrURL, + "error": errorMsg, + "run_ids_or_urls": runItems, "suggestions": []string{ "Verify the run ID is correct", "Use the 'logs' tool to list recent run IDs", @@ -363,7 +387,7 @@ Returns JSON with the following structure: } jsonBytes, jsonErr := json.Marshal(errorEnvelope) if jsonErr != nil { - return nil, nil, newMCPError(jsonrpc.CodeInternalError, "failed to audit workflow run: "+mainMsg, nil) + return nil, nil, newMCPError(jsonrpc.CodeInternalError, errorMsg, nil) } return &mcp.CallToolResult{ IsError: false, diff --git a/pkg/cli/mcp_tools_privileged_test.go b/pkg/cli/mcp_tools_privileged_test.go index 25839d9db2b..f5ccca37d65 100644 --- a/pkg/cli/mcp_tools_privileged_test.go +++ b/pkg/cli/mcp_tools_privileged_test.go @@ -287,7 +287,12 @@ func TestAuditToolErrorEnvelopeSetsIsErrorFalse(t *testing.T) { var envelope map[string]any require.NoError(t, json.Unmarshal([]byte(textContent.Text), &envelope), "error response should be valid JSON") - assert.Equal(t, "9999999999", envelope["run_id_or_url"], "error envelope should include original run ID") + runIDsRaw, hasRunIDs := envelope["run_ids_or_urls"] + require.True(t, hasRunIDs, "error envelope should include run_ids_or_urls field") + runIDs, ok := runIDsRaw.([]any) + require.True(t, ok, "run_ids_or_urls should be an array") + require.Len(t, runIDs, 1, "run_ids_or_urls should contain the single run ID") + assert.Equal(t, "9999999999", runIDs[0], "error envelope should include original run ID") errorMessage, ok := envelope["error"].(string) require.True(t, ok, "error envelope should include string error field") assert.Contains(t, errorMessage, "failed to audit workflow run", "error envelope should include contextual prefix") @@ -335,6 +340,66 @@ func TestAuditTool_AcceptsDeprecatedMaxTokensParameter(t *testing.T) { assert.NotContains(t, strings.Join(capturedArgs, " "), "max_tokens", "audit command args should ignore max_tokens") } +// TestAuditTool_MultiRunDiffMode verifies that when run_ids_or_urls contains +// multiple entries the audit tool passes all of them as positional arguments +// to the audit command (which then runs in diff mode). +func TestAuditTool_MultiRunDiffMode(t *testing.T) { + const expectedStdout = `[{"base_run_id":111,"compare_run_id":222}]` + + var capturedArgs []string + mockExecCmd := func(ctx context.Context, args ...string) *exec.Cmd { + capturedArgs = slices.Clone(args) + return exec.CommandContext(ctx, "sh", "-c", `printf '%s' "$1"`, "sh", expectedStdout) + } + + server := mcp.NewServer(&mcp.Implementation{Name: "test", Version: "1.0"}, nil) + err := registerAuditTool(server, mockExecCmd, "", false) + require.NoError(t, err, "registerAuditTool should succeed") + + session := connectInMemory(t, server) + result, err := session.CallTool(context.Background(), &mcp.CallToolParams{ + Name: "audit", + Arguments: map[string]any{ + "run_ids_or_urls": []string{"111", "222", "333"}, + }, + }) + require.NoError(t, err, "audit tool should succeed with multiple run IDs") + require.NotNil(t, result, "result should not be nil") + assert.False(t, result.IsError, "result should not be an error") + + textContent, ok := result.Content[0].(*mcp.TextContent) + require.True(t, ok, "expected text content in audit response") + assert.JSONEq(t, expectedStdout, textContent.Text, "audit tool should return subprocess stdout") + + // All three run IDs must appear as positional args immediately after "audit" + require.GreaterOrEqual(t, len(capturedArgs), 4, "captured args should include audit + 3 run IDs: %v", capturedArgs) + assert.Equal(t, "audit", capturedArgs[0], "first arg should be 'audit'") + assert.Equal(t, "111", capturedArgs[1], "second arg should be first run ID") + assert.Equal(t, "222", capturedArgs[2], "third arg should be second run ID") + assert.Equal(t, "333", capturedArgs[3], "fourth arg should be third run ID") +} + +// TestAuditTool_FailsWhenNoRunIDProvided verifies that the audit tool +// returns an error when neither run_id_or_url nor run_ids_or_urls is provided. +func TestAuditTool_FailsWhenNoRunIDProvided(t *testing.T) { + mockExecCmd := func(ctx context.Context, args ...string) *exec.Cmd { + return exec.CommandContext(ctx, "nonexistent-command-for-testing-only") + } + + server := mcp.NewServer(&mcp.Implementation{Name: "test", Version: "1.0"}, nil) + err := registerAuditTool(server, mockExecCmd, "", false) + require.NoError(t, err, "registerAuditTool should succeed") + + session := connectInMemory(t, server) + result, err := session.CallTool(context.Background(), &mcp.CallToolParams{ + Name: "audit", + Arguments: map[string]any{}, + }) + // The MCP SDK surfaces InvalidParams as a protocol-level error + assert.True(t, err != nil || (result != nil && result.IsError), + "audit tool should return an error when no run ID is provided") +} + func TestAuditDiffToolErrorEnvelopeSetsIsErrorFalse(t *testing.T) { mockExecCmd := func(ctx context.Context, args ...string) *exec.Cmd { cmd := exec.CommandContext(ctx, os.Args[0], "-test.run=TestAuditDiffToolErrorEnvelopeHelperProcess") diff --git a/scratchpad/dev.md b/scratchpad/dev.md index 24836832dbd..4d16d9d620f 100644 --- a/scratchpad/dev.md +++ b/scratchpad/dev.md @@ -1099,7 +1099,7 @@ gh aw [flags] [arguments] **Command Categories**: - **Workflow Management**: `run`, `compile`, `validate`, `fix` - **Safe Outputs**: `safe-outputs` -- **Audit**: `audit diff`, `audit report`, `logs` +- **Audit**: `audit`, `audit diff` (hidden, use `audit ` instead), `logs` - **Utilities**: `version`, `help` ### Logger Namespace Convention @@ -2644,7 +2644,7 @@ type Everything interface { | `gh aw validate` | Validate workflow | `gh aw validate workflow.md` | | `gh aw safe-outputs` | Test safe outputs | `gh aw safe-outputs --staged` | | `gh aw fix` | Run migration codemods | `gh aw fix` | -| `gh aw audit diff ` | Compare firewall behavior across runs | `gh aw audit diff 12345 67890` | +| `gh aw audit ` | Compare firewall behavior across runs | `gh aw audit 12345 67890` | | `gh aw audit report` | Cross-run security audit report | `gh aw audit report --format markdown` | | `gh aw logs` | Retrieve workflow run logs | `gh aw logs 12345` |