Add AI-powered build failure analysis agentic workflow#13835
Add AI-powered build failure analysis agentic workflow#13835YuliiaKovalova wants to merge 5 commits into
Conversation
Adds a gh-aw (GitHub Agentic Workflows) workflow that runs on every PR:
when ./build.sh --binaryLog fails, an AI agent reads the binlog (via the
binlog-mcp dotnet global tool), groups errors by root cause, and posts a
single PR comment plus inline ```suggestion blocks tied to the diff.
New files:
* .github/agents/build-failure-analyst.agent.md
Repo-tailored agent prompt covering MSBuild-specific error patterns
(no-warnings policy, WarnAsError promotions, .xlf/.resx workflow,
PublicAPI.Unshipped.txt, multi-TFM guards, breaking-change rules).
* .github/workflows/build-failure-analysis.md (+ .lock.yml)
Pull-request trigger; runs ./build.sh --binaryLog with continue-on-
error, installs AITools.BinlogMcp + NuGet.Mcp.Server, dumps the
binlog to /tmp/binlog-data/*.json via the DumpBinlog helper, and
delegates to the analyst agent. Advisory only -- does not gate.
* .github/workflows/build-failure-analysis-command.md (+ .lock.yml)
/analyze-build-failure slash command for re-running the analysis
after force-pushes or comment dismissals.
* .github/workflows/shared/build-failure-analysis-shared.md
Shared delegation body imported by both workflows; launches the
agent as a background task and noops immediately.
* .github/workflows/scripts/DumpBinlog/
Standalone C# console app (net9.0, ModelContextProtocol 1.3.0) that
speaks MCP stdio to binlog-mcp and writes overview/errors/warnings
JSON. Used as a pre-agent step because the gh-aw MCP gateway does
not support non-containerized stdio MCP servers.
* .github/workflows/agentic_commands.yml
Auto-generated slash-command routing manifest.
Modified:
* .github/aw/actions-lock.json
Minor bump of gh-aw-actions/setup (v0.68.1 -> v0.74.8) and
actions/github-script (v9 -> v9.0.0) -- side effect of the
newer gh-aw CLI compiling the new workflows.
The workflow is fork-PR-safe (forks: [] skips them) and restricted to
admin/maintainer/write roles for manual reruns. Inline suggestion comments
are capped at 10 per run; the summary comment uses hide-older-comments to
collapse prior runs on update.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 Skill Validator Results✅ All checks passed
Summary
Full validator output```text Found 1 agent(s) Validated 1 agent(s)✅ All checks passed (1 agent(s)) |
There was a problem hiding this comment.
Pull request overview
Adds an advisory GitHub Agentic Workflows (gh-aw) automation that runs ./build.sh --binaryLog and, on failure, delegates to a repo-tailored “build-failure-analyst” agent to summarize clustered root causes and (optionally) post inline fix suggestions.
Changes:
- Introduces PR-trigger and slash-command gh-aw workflows for build-failure analysis, plus compiled lock workflows.
- Adds a small standalone
DumpBinlogconsole app to extract overview/errors/warnings JSON from a.binlogviabinlog-mcp. - Adds a dedicated
build-failure-analystagent prompt with MSBuild-specific guidance.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/shared/build-failure-analysis-shared.md | Shared prompt body that spawns the build-failure analyst sub-agent. |
| .github/workflows/scripts/DumpBinlog/Program.cs | Console app that calls MCP tools and writes binlog-derived JSON outputs. |
| .github/workflows/scripts/DumpBinlog/DumpBinlog.csproj | Project file for the DumpBinlog helper tool. |
| .github/workflows/build-failure-analysis.md | PR-triggered advisory workflow wiring (build, binlog dump, agent delegation). |
| .github/workflows/build-failure-analysis.lock.yml | gh-aw compiled lock workflow for the PR-triggered workflow. |
| .github/workflows/build-failure-analysis-command.md | Slash-command workflow wiring for rerunning analysis on demand. |
| .github/workflows/build-failure-analysis-command.lock.yml | gh-aw compiled lock workflow for the slash-command workflow. |
| .github/workflows/agentic_commands.yml | gh-aw routing workflow for the new /analyze-build-failure command. |
| .github/aw/actions-lock.json | Updates pinned action entries used by gh-aw compilation. |
| .github/agents/build-failure-analyst.agent.md | The build-failure analyst agent prompt specialized for this repo. |
* Agent prompt: fix unclosed triple-backtick fence in frontmatter description
-- use inline `suggestion` blocks instead, so the prompt doesn't start a
stray markdown code fence.
* Agent prompt: replace GitHub Actions expression syntax (github.* templates)
with runtime env vars (GITHUB_SERVER_URL, GITHUB_REPOSITORY, GITHUB_RUN_ID)
-- Actions expressions are not interpolated inside an agent prompt and
would have been posted literally in the summary comment.
* Agent prompt: fix tool-name mismatch -- the example said
nuget_fix_vulnerable_packages but the documented tool is
fix_vulnerable_packages. Aligned the example with the actual tool name.
* PR-trigger workflow + shared body: same fence-closing fix in the
description / shared prompt body so importers don't start a stray
markdown code fence.
* Command (slash-command) workflow: add NUGET_MCP_VERSION env, an
"Install NuGet MCP Server" step, and NuGet.Mcp.Server in the bash tool
allowlist -- previously the command workflow couldn't run the NuGet
remediation path the analyst prompt asks for, so /analyze-build-failure
reruns were silently degraded vs the PR-trigger flow.
* Both workflows: resolve the binlog path to an absolute path in
find-binlog (via realpath) so GH_AW_BINLOG_PATH is absolute as the
agent prompt expects. Drop the now-redundant GITHUB_WORKSPACE prefix
when invoking DumpBinlog.
* DumpBinlog: in the fatal outer catch, write placeholder binlog-*.json
files with an { "error": ... } payload so downstream `cat` steps and
the agent always have something structured to read, even when the MCP
client cannot be constructed.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
🔍 Expert MSBuild Review — PR #13835
Verdict: COMMENT (no blocking issues found)
Summary
This PR adds a well-structured build-failure-analysis agentic workflow with good defensive patterns. After reviewing the actual branch content (which differs significantly from the initially-provided diff — the real implementation is more mature), the code is solid across all 24 review dimensions.
Dimension Results
| Dimension | Status | Notes |
|---|---|---|
| 1. Backwards Compatibility | ✅ LGTM | No MSBuild source changes; purely additive CI infrastructure |
| 2. ChangeWave | N/A | No MSBuild behavioral changes |
| 3. Performance | ✅ LGTM | DumpBinlog is a CI tool, not a hot path |
| 4. Test Coverage | CI workflow tools are not unit-tested (standard practice) | |
| 5. Error Messages | ✅ LGTM | Clear diagnostics in DumpBinlog; agent has comprehensive fallback |
| 6. Logging & Diagnostics | ✅ LGTM | tee /tmp/build-output.log captures raw output; JSON dumps cover structured data |
| 7–9 | N/A | No MSBuild source code changes |
| 10. Design | ✅ Good | Layered architecture (workflow → DumpBinlog → agent); imports: for shared body |
| 11. Cross-Platform | ✅ LGTM | Runs on ubuntu-latest; shell scripts use POSIX patterns |
| 12. Code Simplification | ✅ LGTM | DumpBinlog is minimal and focused |
| 13. Concurrency | ✅ LGTM | Proper concurrency: with cancel-in-progress: true |
| 14. Naming | ✅ LGTM | Consistent naming across files |
| 15–16. SDK/C# Patterns | ✅ LGTM | Modern top-level statements, proper async/await, collection expressions |
| 17. File I/O | Empty-string edge case in Path.GetFullPath (see inline) |
|
| 18. Documentation | ✅ LGTM | Agent instructions accurately reflect workflow behavior |
| 19. Build Infrastructure | ✅ Good | Proper continue-on-error, step outcomes, timeout, network restrictions |
| 20. Scope | ✅ LGTM | Focused PR with clear purpose |
| 21. Evaluation Model | N/A | No MSBuild evaluation changes |
| 22. Correctness | ✅ LGTM | Build outcome uses steps.build.outcome (correct); fork blocking; role restrictions |
| 23. Dependency Mgmt | NuGet MCP Server missing from command workflow (see inline) | |
| 24. Security | ✅ Good | Actions SHA-pinned; forks: []; env-var-based interpolation (no script injection); roles restricted |
Highlights (well done)
- Correct build outcome detection:
continue-on-error: true+steps.build.outcomeproperly captures failure without aborting the job - Security-first: Fork PRs blocked, roles restricted, env vars used instead of direct
${{ }}interpolation inrun:blocks - Graceful degradation:
continue-on-error: trueon DumpBinlog + defensive agent instructions mean partial data still produces useful analysis - CPM opt-out: Explicit
ManagePackageVersionsCentrally>falsewith clear comment — correct pattern for standalone CI tools - Stable dependency:
ModelContextProtocol v1.3.0(GA, not preview)
Minor Observations (not blocking)
- Command workflow missing NuGet MCP (inline comment): Inconsistency between the two invocation paths
- Every-push trigger cost: The
pull_request: [opened, synchronize, reopened]trigger runs a full 2-3 min build on every push, duplicating the existing Azure DevOps CI. Thecancel-in-progress: truemitigates rapid-fire pushes, and the advisory nature is well-documented. Consider whether a label-gated or CI-failure-gated trigger would better balance cost vs. value.
Note
🔒 Integrity filter blocked 2 items
The following items were blocked because they don't meet the GitHub integrity level.
- #13835
pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved". - #13835
pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
To allow these resources, lower min-integrity in your GitHub frontmatter:
tools:
github:
min-integrity: approved # merged | approved | unapproved | noneGenerated by Expert Code Review (on open) for issue #13835 · ● 5.1M
Addresses Dimension 17 (File I/O & Path Handling) NIT from the Expert
MSBuild Review. Previously, an empty `args[0]` would flow through
`Path.GetFullPath("")` (which resolves to the current working directory),
then through `File.Exists(<cwd>)` (false because cwd is a directory),
producing a confusing "Binlog not found: /path/to/cwd" error.
Add an explicit `string.IsNullOrWhiteSpace` check up front so the failure
mode is obvious. Applied to both repos symmetrically.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…vers
The previous CI failure at "Start MCP Gateway" (run 26238445223,
job 77219093390) was caused by gh-aw emitting malformed JSON for a
`mcp-servers.nuget: { command: ..., args: [...] }` (process / stdio MCP
server) declaration: the generated config contained `"nuget": {` with
no body and no closing brace, immediately followed by `"safeoutputs":`,
which crashed the gateway with "Expected ',' or '}' after property
value in JSON at position 1001 (line 42 column 1)".
This change rewires NuGet MCP Server the same way `dotnet/sdk` and
`dotnet/msbuild` use it:
* `NuGet.Mcp.Server` is installed as a dotnet global tool in a
pre-agent step (already present here).
* It is listed in the workflow's `bash:` allowlist so the agent can
invoke `NuGet.Mcp.Server` directly via the shell tool.
* The MCP gateway never sees a `nuget` server entry, so the
malformed-JSON path that crashes the gateway is bypassed entirely.
Verified: the regenerated `*.lock.yml` MCP config now contains only the
`github` and `safeoutputs` server entries and parses as valid JSON.
Also applies the same Copilot-review fixes that landed in
`dotnet/sdk#54401` and `dotnet/msbuild#13835`:
* Triple-backtick fence inside YAML `description:` blocks rewritten
as inline backtick spans so the description is not interpreted as
an unterminated code fence.
* `${{ github.* }}` expressions inside the agent prompt body
rewritten as runtime env vars (`${GITHUB_SERVER_URL}`,
`${GITHUB_REPOSITORY}`, `${GITHUB_RUN_ID}`). The agent prompt body
is passed verbatim to the LLM and is not YAML, so MSBuild-style
expressions stay literal.
* Agent description corrected to point at the C# `DumpBinlog`
helper instead of the removed `dump-binlog.js`.
* `Locate binlog` step now resolves the binlog path to absolute via
`realpath` so downstream consumers can treat `GH_AW_BINLOG_PATH`
as absolute, and the `Dump binlog as JSON` step drops the redundant
`$GITHUB_WORKSPACE/` prefix.
* Bumped the `dotnet run` timeout for `DumpBinlog` from 120s to 180s
to match the sdk/msbuild workflows (large binlogs in those repos
needed the extra headroom).
* `DumpBinlog/Program.cs` rejects an empty-string binlog argument
before calling `Path.GetFullPath`, and if `McpClient.CreateAsync`
throws it now writes a stub `{ "error": "DumpBinlog fatal: …" }`
payload to each of the three output JSON files so the agent's
`cat` step always has structured input to read.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The same Copilot-review feedback that was addressed on the equivalent
PRs in `dotnet/sdk` (#54401) and `dotnet/msbuild` (#13835) also applies
to the VMR version of this workflow. Bringing those three fixes here:
* `build-failure-analyst.agent.md` — fix stale reference to the
removed `dump-binlog.js` helper. The C# `DumpBinlog` program has
replaced it; update the agent description so the agent looks for
the right artifact.
* `build-failure-analyst.agent.md` — replace `${{ github.* }}`
expressions in the agent prompt body with runtime env vars
(`${GITHUB_SERVER_URL}`, `${GITHUB_REPOSITORY}`, `${GITHUB_RUN_ID}`,
`${GH_AW_PR_HEAD_SHA}`). The agent prompt body is passed verbatim
to the LLM and is *not* YAML, so `${{ ... }}` expressions stay
literal in the rendered comment instead of being substituted at
runtime.
* `DumpBinlog/Program.cs` — reject an empty-string binlog argument
explicitly before calling `Path.GetFullPath` (a NIT from the
msbuild Expert MSBuild Review), and, if `McpClient.CreateAsync`
throws (e.g., `binlog-mcp` is not on PATH), write a stub
`{ "error": "DumpBinlog fatal: …" }` payload to each of the three
expected output JSON files so the agent's `cat` step always has
structured input to read. Without this, a fatal MCP-client failure
silently produced no files at all and the agent could not even
report the failure.
The lock file (`build-failure-analysis.lock.yml`) is regenerated by
`gh aw compile --strict`.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…vers
The previous CI failure at "Start MCP Gateway" (run 26238445223,
job 77219093390) was caused by gh-aw emitting malformed JSON for a
`mcp-servers.nuget: { command: ..., args: [...] }` (process / stdio MCP
server) declaration: the generated config contained `"nuget": {` with
no body and no closing brace, immediately followed by `"safeoutputs":`,
which crashed the gateway with "Expected ',' or '}' after property
value in JSON at position 1001 (line 42 column 1)".
This change rewires NuGet MCP Server the same way `dotnet/sdk` and
`dotnet/msbuild` use it:
* `NuGet.Mcp.Server` is installed as a dotnet global tool in a
pre-agent step (already present here).
* It is listed in the workflow's `bash:` allowlist so the agent can
invoke `NuGet.Mcp.Server` directly via the shell tool.
* The MCP gateway never sees a `nuget` server entry, so the
malformed-JSON path that crashes the gateway is bypassed entirely.
Verified: the regenerated `*.lock.yml` MCP config now contains only the
`github` and `safeoutputs` server entries and parses as valid JSON.
Also applies the same Copilot-review fixes that landed in
`dotnet/sdk#54401` and `dotnet/msbuild#13835`:
* Triple-backtick fence inside YAML `description:` blocks rewritten
as inline backtick spans so the description is not interpreted as
an unterminated code fence.
* `${{ github.* }}` expressions inside the agent prompt body
rewritten as runtime env vars (`${GITHUB_SERVER_URL}`,
`${GITHUB_REPOSITORY}`, `${GITHUB_RUN_ID}`). The agent prompt body
is passed verbatim to the LLM and is not YAML, so MSBuild-style
expressions stay literal.
* Agent description corrected to point at the C# `DumpBinlog`
helper instead of the removed `dump-binlog.js`.
* `Locate binlog` step now resolves the binlog path to absolute via
`realpath` so downstream consumers can treat `GH_AW_BINLOG_PATH`
as absolute, and the `Dump binlog as JSON` step drops the redundant
`$GITHUB_WORKSPACE/` prefix.
* Bumped the `dotnet run` timeout for `DumpBinlog` from 120s to 180s
to match the sdk/msbuild workflows (large binlogs in those repos
needed the extra headroom).
* `DumpBinlog/Program.cs` rejects an empty-string binlog argument
before calling `Path.GetFullPath`, and if `McpClient.CreateAsync`
throws it now writes a stub `{ "error": "DumpBinlog fatal: …" }`
payload to each of the three output JSON files so the agent's
`cat` step always has structured input to read.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…otnet
- Remove NuGet.Mcp.Server from bash tools list (causes empty
'nuget': {} block in MCP Gateway JSON, crashing the gateway)
- Add 'dotnet' to bash allowlist so agent can invoke global tools
- Update agent to use 'dotnet NuGet.Mcp.Server' instead of stdin JSON
- Recompile both lock files with gh aw compile --strict
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
the output example: YuliiaKovalova/dotnet#3 (comment) |
…ogMcp
The dotnet-tools feed package that backs the `binlog-mcp` CLI has been
renamed from `AITools.BinlogMcp` to `Microsoft.AITools.BinlogMcp` (now
under the canonical `Microsoft.*` namespace). The CLI command exposed
by the tool (`binlog-mcp`) is unchanged, so only the
`dotnet tool install --global …` line and the package version need to
move.
* Bump `BINLOG_MCP_VERSION` from `1.0.0-preview.26268.3` to
`1.0.0-preview.26272.1` (the first version published under the new
package id).
* Update `dotnet tool install --global` invocations in both
`build-failure-analysis.md` and `build-failure-analysis-command.md`.
* Update the AzDO feed permalink in the agent's comment-footer
template to point at the new package.
* Regenerate `.lock.yml` files via `gh aw compile --strict`.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds a gh-aw (GitHub Agentic Workflows) workflow that runs on every PR: when
./build.sh --binaryLogfails, an AI agent reads the binlog and inspects NuGet package metadata through two MCP servers --binlog-mcp-- structured access to the MSBuild binary log (errors, warnings, target timeline, project graph).NuGet.Mcp.Server-- package version / dependency / vulnerability lookup against nuget.org, used to disambiguate package-related failures (NU1605 downgrades, missing transitive deps, MSB3277 conflicts,Microsoft.Build*package version drift).-- groups errors by root cause, and posts:
suggestionblocks tied to the diff for changes the reviewer can accept with one click.The workflow is advisory only -- it never gates the merge. It is also fork-PR-safe:
forks: []is set so PRs from forks do not trigger the agent (avoids token/secret exposure).What's in the PR
.github/agents/build-failure-analyst.agent.md-- repo-tailored agent prompt covering MSBuild-specific concerns: no-warnings policy (TreatWarningsAsErrors+WarnAsError), public-API surface stability (Microsoft.Build*packages), the boundary between MSBuild engine and Roslyn/SDK, and how to handle TFM-conditioned source files..github/workflows/build-failure-analysis.md(+ generated.lock.yml) -- pull-request trigger on[main, vs*]; runs the build withcontinue-on-error, installs the two MCP servers (Microsoft.AITools.BinlogMcp+NuGet.Mcp.Server) asdotnet tools, dumps the binlog to JSON via theDumpBinloghelper, and delegates to the analyst agent.NuGet.Mcp.Serveris registered with gh-aw as a long-running MCP service the agent can call during analysis. Timeout 30 min..github/workflows/build-failure-analysis-command.md(+.lock.yml) --/analyze-build-failureslash command for re-running the analysis after force-pushes. Restricted toadmin/maintainer/writeroles..github/workflows/shared/build-failure-analysis-shared.md-- shared delegation body imported by both workflows..github/workflows/scripts/DumpBinlog/-- standalone C# console app (net9.0,ModelContextProtocol1.3.0) that speaks MCP stdio tobinlog-mcpand writesoverview.json/errors.json/warnings.json. Used as a pre-agent step because the gh-aw MCP gateway does not support non-containerized stdio MCP servers..github/aw/actions-lock.json-- minor bump ofgh-aw-actions/setup(v0.68.1 -> v0.74.8) andgithub-script(v9 -> v9.0.0), side effect of the newergh awCLI used to compile the new workflows.How it runs
./build.sh --binaryLog -c Release..binlogis produced,DumpBinlogextracts errors/warnings/overview to/tmp/binlog-data/*.json.NuGet.Mcp.Serverfor the actual published versions, dependency graphs, and known advisories.hide-older-comments); up to 10 inline suggestion comments are added.Safety properties
forks: []-> agent never runs against an outside-contributor PR (no token leakage).timeout_minutes: 30-> hard cap.permissions: contents: read, pull-requests: write-> minimum needed.[defaults, dotnet].[admin, maintainer, write].Why this is useful for MSBuild
The most common build failures on this repo cluster into a few categories where an LLM with the binlog can do meaningful triage:
WarnAsErrorregressions blocking the build.RS0016/RS0017) violations after adding/removing methods onMicrosoft.Build*types.net472but not onnet8.0(or vice versa).Grouping these by root cause and posting an actionable suggestion saves a round trip for the contributor.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com