Skip to content
15 changes: 14 additions & 1 deletion .github/aw/debug-agentic-workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ Report back with specific findings and actionable fixes.
- `gh aw run <workflow-name>` → run a workflow (requires workflow_dispatch trigger)
- `gh aw logs [workflow-name] --json` → download and analyze workflow logs with JSON output
- `gh aw audit <run-id> --json` → investigate a specific run with JSON output
- `gh aw audit <base-run-id> <compare-run-id> [<compare-run-id>...] --json` → diff two or more runs to detect regressions (firewall, MCP, metrics)
- `gh aw status` → show status of agentic workflows in the repository

> [!IMPORTANT]
Expand Down Expand Up @@ -91,7 +92,7 @@ Report back with specific findings and actionable fixes.
> - `status` tool → equivalent to `gh aw status`
> - `compile` tool → equivalent to `gh aw compile`
> - `logs` tool → equivalent to `gh aw logs`
> - `audit` tool → equivalent to `gh aw audit`
> - `audit` tool → equivalent to `gh aw audit` (single run: audit report; multiple run IDs: diff mode)
> - `checks` tool → equivalent to `gh aw checks`
> - `update` tool → equivalent to `gh aw update`
> - `add` tool → equivalent to `gh aw add`
Expand Down Expand Up @@ -183,6 +184,18 @@ When the user provides a workflow run URL (e.g., `https://github.com/github/gh-a
- Provides comprehensive JSON analysis
- Stores artifacts in `logs/run-<run-id>/` for offline inspection
- Reports missing tools, errors, and execution metrics

**Comparing two runs (regression detection)**:
Pass a second run ID to produce a diff — no separate `audit diff` command needed:
```bash
gh aw audit <base-run-id> <compare-run-id> --json
# Or compare base against multiple runs at once:
gh aw audit <base-run-id> <run-id-2> <run-id-3> --json
```
Or via the `agentic-workflows` tool:
```
Use the audit tool with run_ids_or_urls: ["<base-run-id>", "<compare-run-id>"]
```

3. **Analyze Missing Tools**

Expand Down
5 changes: 5 additions & 0 deletions debug.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,10 @@ gh aw logs <workflow-name>
# Audit a specific workflow run
gh aw audit <run-id>

# Diff two or more workflow runs (multi-run diff mode)
gh aw audit <base-run-id> <compare-run-id>
gh aw audit <base-run-id> <compare-run-id-1> <compare-run-id-2>

# Compile workflows after fixing
gh aw compile <workflow-name>

Expand All @@ -137,5 +141,6 @@ gh aw status
## Key Debugging Commands

- `gh aw audit <run-id> --json` → Detailed run analysis with missing tools and errors
- `gh aw audit <base-run-id> <compare-run-id> --json` → Diff two runs to detect regressions (firewall, MCP, metrics)
- `gh aw logs <workflow-name> --json` → Download and analyze recent workflow logs
- `gh aw compile <workflow-name> --strict` → Validate workflow with strict security checks
80 changes: 80 additions & 0 deletions docs/adr/28483-unify-audit-multi-run-diff-into-main-command.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# ADR-28483: Unify Multi-Run Diff Mode into the Main `audit` Command

**Date**: 2026-04-25
**Status**: Draft
**Deciders**: pelikhan, Copilot

---

## Part 1 — Narrative (Human-Friendly)

### Context

The `gh aw audit` command previously accepted exactly one run ID or URL and produced a single-run audit report. Comparing two runs required invoking a separate subcommand, `audit diff`, which users had to discover independently. The same limitation existed in the MCP tool wrapper, which exposed `audit` only via a single `run_id_or_url` string field. This two-entry-point design created a discoverability gap: agents and users performing regression detection had to know that multi-run comparison was a distinct subcommand rather than a natural extension of `audit`.

### Decision

We will unify multi-run diff mode into the main `audit` command by changing its argument signature from `ExactArgs(1)` to `MinimumNArgs(1)`. When exactly one argument is provided the command behaves as before; when two or more are provided the first is treated as the base run and the remaining arguments are compared against it, delegating to the existing `RunAuditDiff` implementation. The `audit diff` subcommand will be hidden (`Hidden: true`) and retained only for backward compatibility. In the MCP tool wrapper, we will add a new `run_ids_or_urls []string` field as the preferred input, while keeping the old `run_id_or_url string` field as a deprecated fallback.

### Alternatives Considered

#### Alternative 1: Keep `audit diff` as the Primary Interface, Improve Documentation

The status quo could be preserved and discoverability improved through documentation updates and help text alone. This was rejected because documentation cannot help agents that parse command output programmatically, and it would not simplify the MCP tool schema. The fundamental UX problem — that comparison requires a different command — would remain.

#### Alternative 2: Add a `--compare` Flag to `audit`

A flag-based approach (e.g., `gh aw audit 12345 --compare 12346`) would keep the argument list unambiguous (first positional arg is always the base). This was rejected because it is more verbose and less natural when comparing against multiple runs. Positional arguments are consistent with how `audit diff` already worked, so migration is straightforward for existing users and scripts.

### Consequences

#### Positive
- Users and agents have a single entry point for all audit use cases; no need to remember `audit diff`.
- The MCP tool schema gains a typed `run_ids_or_urls` array that makes multi-run diff intent explicit.
- Validation logic (self-comparison rejection, duplicate ID rejection, invalid ID rejection) is shared between the subcommand and the new path.

#### Negative
- The `audit diff` subcommand must be kept indefinitely as a hidden backward-compatibility alias, adding maintenance surface.
- The `--parse` flag silently becomes a no-op in multi-run mode, which is a subtle inconsistency that may surprise users who upgrade from single-run workflows.
- Agent instruction files and documentation required a sweep to replace `audit diff <id1> <id2>` with `audit <id1> <id2>`.

#### Neutral
- The error envelope returned by the MCP tool was updated to use `run_ids_or_urls` (array) instead of `run_id_or_url` (string), which is a breaking change for any consumer that inspects the error structure. Callers relying on the old field name will need to update.
- Test coverage for the new `runAuditMulti` function and MCP tool multi-run path was added in the same PR.

---

## Part 2 — Normative Specification (RFC 2119)

> The key words **MUST**, **MUST NOT**, **REQUIRED**, **SHALL**, **SHALL NOT**, **SHOULD**, **SHOULD NOT**, **RECOMMENDED**, **MAY**, and **OPTIONAL** in this section are to be interpreted as described in [RFC 2119](https://www.rfc-editor.org/rfc/rfc2119).

### CLI Command Signature

1. The `audit` command **MUST** accept one or more positional arguments, each being a numeric run ID or a supported GitHub Actions URL format.
2. When exactly one argument is provided, the command **MUST** produce a single-run audit report identical in structure to the previous behavior.
3. When two or more arguments are provided, the command **MUST** treat the first argument as the base run and all subsequent arguments as comparison runs, delegating to the multi-run diff implementation.
4. The command **MUST NOT** accept a self-comparison (base run ID equal to any comparison run ID) and **MUST** return a descriptive error in that case.
5. The command **MUST NOT** accept duplicate comparison run IDs and **MUST** return a descriptive error in that case.
6. The `--parse` flag **MUST** be accepted in multi-run mode but **SHOULD** be documented as a no-op; implementations **MUST NOT** fail if `--parse` is passed alongside multiple run IDs.

### `audit diff` Subcommand

1. The `audit diff` subcommand **MUST** remain present in the CLI binary and **MUST** continue to function as before.
2. The `audit diff` subcommand **MUST** be hidden from help output (`Hidden: true`) to discourage new usage.
3. The `audit diff` subcommand **MUST NOT** be removed in any release that does not provide a documented migration path.

### MCP Tool Schema

1. The MCP `audit` tool **MUST** accept a `run_ids_or_urls` field of type `[]string` as the primary input.
2. The MCP `audit` tool **MUST** accept the deprecated `run_id_or_url` field of type `string` as a fallback when `run_ids_or_urls` is absent or empty.
3. When both fields are provided, `run_ids_or_urls` **MUST** take precedence.
4. The tool **MUST** return an MCP `InvalidParams` error when neither field provides at least one run ID.
5. Error envelopes returned by the tool **MUST** include a `run_ids_or_urls` array field (not `run_id_or_url`) reflecting the resolved list of run IDs that were attempted.

### Conformance

An implementation is considered conformant with this ADR if it satisfies all **MUST** and **MUST NOT** requirements above. Failure to meet any **MUST** or **MUST NOT** requirement constitutes non-conformance.

---

*This is a DRAFT ADR generated by the [Design Decision Gate](https://github.com/github/gh-aw/actions/runs/24936666718) workflow. The PR author must review, complete, and finalize this document before the PR can merge.*
4 changes: 2 additions & 2 deletions docs/src/content/docs/guides/audit-with-agents.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ When running locally, all three audit commands accept `--json` to write structur
| --------- | ---------- |
| `gh aw audit <run-id> --json` | Single run — `key_findings`, `recommendations`, `metrics` |
| `gh aw logs [workflow] --last 10 --json` | Trend analysis — `per_run_breakdown`, `domain_inventory` |
| `gh aw audit diff <id1> <id2> --json` | Before/after — `run_metrics_diff`, `firewall_diff` |
| `gh aw audit <id1> <id2> --json` | Before/after — `run_metrics_diff`, `firewall_diff` |

Inside GitHub Actions workflows, agents access these commands through the `agentic-workflows` MCP tool rather than calling the CLI directly.

Expand Down Expand Up @@ -65,7 +65,7 @@ permissions:

# Regression Detection

Use the `agentic-workflows` MCP tool `audit diff` with base run ID ${{ inputs.base_run_id }} and current run ID ${{ inputs.current_run_id }}. Check for new blocked domains, increased MCP error rates, cost increase > 20%, or token usage increase > 50%. If regressions are found, open a GitHub issue with a table from `run_metrics_diff`, affected domains from `firewall_diff`, and affected MCP tools from `mcp_tools_diff`.
Use the `agentic-workflows` MCP tool `audit` with run IDs ${{ inputs.base_run_id }} and ${{ inputs.current_run_id }} to compare the two runs. Check for new blocked domains, increased MCP error rates, cost increase > 20%, or token usage increase > 50%. If regressions are found, open a GitHub issue with a table from `run_metrics_diff`, affected domains from `firewall_diff`, and affected MCP tools from `mcp_tools_diff`.
```

## Filing issues from audit findings
Expand Down
4 changes: 2 additions & 2 deletions docs/src/content/docs/guides/maintaining-repos.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ gh aw logs --filtered-integrity # only runs with DIFC-filtered events
**Compare two runs for regressions:**

```bash
gh aw audit diff BASELINE_ID CURRENT_ID
gh aw audit BASELINE_ID CURRENT_ID
```

### Common Failure Patterns
Expand All @@ -195,7 +195,7 @@ gh aw audit diff BASELINE_ID CURRENT_ID
2. Run `gh aw audit RUN_ID` for a structured breakdown.
3. For complex issues, use `/agent agentic-workflows` in Copilot Chat.
4. Edit the `.md` file → run `gh aw compile` to validate → trigger a new run.
5. Compare the new run against the baseline with `gh aw audit diff`.
5. Compare the new run against the baseline with `gh aw audit BASELINE_ID NEW_ID`.

## Related Documentation

Expand Down
7 changes: 5 additions & 2 deletions docs/src/content/docs/patterns/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,15 +114,18 @@ Use `gh aw status` to see which workflows are enabled and their latest run state
For deeper investigation, the audit commands are the primary monitoring tool for agentic workflows:

- `gh aw audit <run-id>` — single-run report with tool usage, MCP failures, firewall activity, and cost metrics
- `gh aw audit diff <run-id-1> <run-id-2>` — compare two runs to detect behavioral regressions or new network accesses
- `gh aw audit <run-id-1> <run-id-2>` — compare two runs to detect behavioral regressions or new network accesses (pass additional IDs to compare base against multiple runs)
- `gh aw logs --format markdown [workflow]` — cross-run security and performance report for trend monitoring

```bash
# Audit the most recent run
gh aw audit 12345678

# Compare two runs for regressions
gh aw audit diff 12345678 12345679
gh aw audit 12345678 12345679

# Compare base against multiple runs at once
gh aw audit 12345678 12345679 12345680

# Trend report across the last 10 runs of a workflow
gh aw logs my-workflow --format markdown --count 10
Expand Down
68 changes: 25 additions & 43 deletions docs/src/content/docs/reference/audit.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,18 @@ sidebar:

The `gh aw audit` commands download workflow run artifacts and logs, analyze MCP tool usage and network behavior, and produce structured reports suited for security reviews, debugging, and feeding to AI agents.

## `gh aw audit <run-id-or-url>`
## `gh aw audit <run-id-or-url> [<run-id-or-url>...]`

Audit a single workflow run and generate a detailed Markdown report.
Audit one or more workflow runs. When a single run is provided, a detailed Markdown report is generated. When two or more runs are provided, the first is used as the base (reference) run and the remaining runs are compared against it, producing a diff report.

**Arguments:**

| Argument | Description |
|----------|-------------|
| `<run-id-or-url>` | A numeric run ID, GitHub Actions run URL, job URL, or job URL with step anchor |
| `[<run-id-or-url>...]` | Additional run IDs or URLs to compare against the first (diff mode) |

**Accepted input formats:**
**Accepted input formats (per argument):**

- Numeric run ID: `1234567890`
- Run URL: `https://github.com/owner/repo/actions/runs/1234567890`
Expand All @@ -26,19 +27,24 @@ Audit a single workflow run and generate a detailed Markdown report.
- Short run URL: `https://github.com/owner/repo/runs/1234567890`
- GitHub Enterprise URLs using the same formats above

When a job URL is provided without a step anchor, the command extracts the output of the first failing step. When a step anchor is included, it extracts that specific step.
When a job URL is provided without a step anchor (single-run mode), the command extracts the output of the first failing step. When a step anchor is included, it extracts that specific step.

In diff mode, job URLs and step-anchored URLs are accepted for any argument — the job/step specificity is silently normalized to the parent run ID, so it is always a run-level diff.

Self-comparisons and duplicate run IDs are rejected when using diff mode.

**Flags:**

| Flag | Default | Description |
|------|---------|-------------|
| `-o, --output <dir>` | `./logs` | Directory to write downloaded artifacts and report files |
| `--json` | off | Output report as JSON to stdout |
| `--parse` | off | Run JavaScript parsers on agent and firewall logs, writing `log.md` and `firewall.md` |
| `--parse` | off | Run JavaScript parsers on agent and firewall logs, writing `log.md` and `firewall.md` (single-run only) |
| `--repo <owner/repo>` | auto | Specify repository when the run ID is not from a URL |
| `--verbose` | off | Print detailed progress information |
| `--format <fmt>` | `pretty` | Diff output format: `pretty` or `markdown` (multi-run only) |
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gh aw audit is documented here with a --format flag and --format markdown examples for multi-run diff mode, but the audit Cobra command in this PR doesn’t currently define/support --format (only the hidden audit diff subcommand does). Either add --format support to audit multi-run mode or adjust these docs/examples to match the actual CLI behavior.

Suggested change
| `--format <fmt>` | `pretty` | Diff output format: `pretty` or `markdown` (multi-run only) |

Copilot uses AI. Check for mistakes.

**Examples:**
**Single-run examples:**

```bash
gh aw audit 1234567890
Expand All @@ -49,38 +55,24 @@ gh aw audit 1234567890 -o ./audit-reports
gh aw audit 1234567890 --repo owner/repo
```

**Report sections** (rendered in Markdown or JSON): Overview, Comparison, Task/Domain, Behavior Fingerprint, Agentic Assessments, Metrics, Key Findings, Recommendations, Observability Insights, Performance Metrics, Engine Config, Prompt Analysis, Session Analysis, Safe Output Summary, MCP Server Health, Jobs, Downloaded Files, Missing Tools, Missing Data, Noops, MCP Failures, Firewall Analysis, Policy Analysis, Redacted Domains, Errors, Warnings, Tool Usage, MCP Tool Usage, Created Items.
**Multi-run diff examples:**

```bash
gh aw audit 12345 12346 # Compare two runs
gh aw audit 12345 12346 12347 12348 # Compare base against 3 runs
gh aw audit 12345 12346 --format markdown # Markdown output for PR comments
gh aw audit 12345 12346 --json # JSON for CI integration
gh aw audit 12345 12346 --repo owner/repo # Specify repository
```

**Single-run report sections** (rendered in Markdown or JSON): Overview, Comparison, Task/Domain, Behavior Fingerprint, Agentic Assessments, Metrics, Key Findings, Recommendations, Observability Insights, Performance Metrics, Engine Config, Prompt Analysis, Session Analysis, Safe Output Summary, MCP Server Health, Jobs, Downloaded Files, Missing Tools, Missing Data, Noops, MCP Failures, Firewall Analysis, Policy Analysis, Redacted Domains, Errors, Warnings, Tool Usage, MCP Tool Usage, Created Items.

The Metrics section includes an `ambient_context` object when available. Ambient context captures the first LLM inference footprint for the run:
- `ambient_context.input_tokens` — input tokens for the first invocation
- `ambient_context.cached_tokens` — cache-read tokens reused by the first invocation
- `ambient_context.effective_tokens` — `input_tokens + cached_tokens`

## `gh aw audit diff <base-run-id> <comparison-run-id> [<comparison-run-id>...]`

Compare behavior between workflow runs. Detects policy regressions, new unauthorized domains, behavioral drift, and changes in MCP tool usage or run metrics.

**Arguments:**

| Argument | Description |
|----------|-------------|
| `<base-run-id>` | Numeric run ID for the baseline run |
| `<comparison-run-id>` | Numeric run ID for the comparison run |
| `[<comparison-run-id>...]` | Additional run IDs to compare against the same base |

The base run is downloaded once and reused when multiple comparison runs are provided. Self-comparisons and duplicate run IDs are rejected.

**Flags:**

| Flag | Default | Description |
|------|---------|-------------|
| `--format <fmt>` | `pretty` | Output format: `pretty` or `markdown` |
| `--json` | off | Output diff as JSON |
| `--repo <owner/repo>` | auto | Specify repository |
| `-o, --output <dir>` | `./logs` | Directory for downloaded artifacts |
| `--verbose` | off | Print detailed progress |

The diff output includes:
**Diff output** includes:
- New and removed network domains
- Domain status changes (allowed ↔ denied)
- Volume changes (request count changes above a 100% threshold)
Expand All @@ -89,20 +81,10 @@ The diff output includes:
- Run metrics comparison (token usage, duration, turns)
- Token usage breakdown: input tokens, output tokens, cache read/write tokens, effective tokens, total API requests, and cache efficiency per run

**Output behavior with multiple comparisons:**
**Diff output behavior with multiple comparisons:**
- `--json` outputs a single object for one comparison, or an array for multiple
- `--format pretty` and `--format markdown` separate multiple diffs with dividers

**Examples:**

```bash
gh aw audit diff 12345 12346
gh aw audit diff 12345 12346 12347 12348
gh aw audit diff 12345 12346 --format markdown
gh aw audit diff 12345 12346 --json
gh aw audit diff 12345 12346 --repo owner/repo
```

## `gh aw logs --format <fmt>`

Generate a cross-run security and performance audit report across multiple recent workflow runs.
Expand Down
Loading