Add repo health check agentic workflows#7583
Conversation
… and known baseline
There was a problem hiding this comment.
Pull request overview
This PR adds a 3-tier agentic repo health monitoring system to the dotnet/machinelearning repository using GitHub's gh-aw (agentic workflows) framework. The system automatically monitors issues, PRs, and CI pipelines, and takes automated actions on the dashboard.
Changes:
- Adds an Orchestrator workflow (
repo-health-check) that runs daily at 6:00 UTC, collects health data, diffs against previous runs, updates a pinned dashboard issue, and dispatches investigators for critical findings. - Adds an Investigator workflow (
repo-health-investigate) dispatched for critical/high findings to perform deep-dive analysis and post results back to the dashboard. - Adds a Groomer workflow (
repo-health-groom) that runs daily at 9:00 UTC, links investigation results, hides stale comments, and enforces dashboard structure. - Adds
.github/health-baseline.mdcataloguing 24 known P0/P1 issues and 6 long-running PRs to suppress false positives.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/repo-health-check.md |
Orchestrator prompt/playbook for the daily health check agentic workflow |
.github/workflows/repo-health-check.lock.yml |
Auto-generated compiled workflow for the orchestrator |
.github/workflows/repo-health-investigate.md |
Investigator prompt/playbook for deep-dive analysis |
.github/workflows/repo-health-investigate.lock.yml |
Auto-generated compiled workflow for the investigator |
.github/workflows/repo-health-groom.md |
Groomer prompt/playbook for dashboard maintenance |
.github/workflows/repo-health-groom.lock.yml |
Auto-generated compiled workflow for the groomer |
.github/health-baseline.md |
Known baseline of accepted issues/PRs to suppress from new finding alerts |
.github/aw/actions-lock.json |
Action version pins for gh-aw toolchain |
Comments suppressed due to low confidence (2)
.github/workflows/repo-health-groom.md:155
- The
repo-health-groom.mdplaybook instructs the agent to hide comments usinggh api graphql(line 145), but the compiled lock file's agent prompt explicitly states "The gh CLI is NOT authenticated. Do NOT use gh commands for GitHub operations." The agent cannot usegh api graphqlfor the minimize mutation; instead it should use thehide_commentsafe-output tool (which is defined in the lock file's safe outputs config). Thehide-commentoperation in the playbook should reference the safe-output tool call rather than aghcommand, otherwise the hide operation will silently fail or error on every run.
### Hide Operation
```bash
# Minimize comment (hide with reason)
gh api graphql -f query='
mutation {
minimizeComment(input: {
subjectId: "COMMENT_NODE_ID",
classifier: OUTDATED
}) {
minimizedComment { isMinimized }
}
}
'
**.github/workflows/repo-health-check.md:413**
* In `repo-health-check.md`, Step 5 instructs the agent to dispatch the investigator using `gh workflow run repo-health-investigate.lock.yml ...`. However, the gh CLI is not authenticated, so this command will fail. The agent should instead use the `repo_health_investigate` safe-output tool (which is defined in the lock file as a `dispatch_workflow` safe output) to trigger the investigation workflow.
# Budget: max 5 dispatches
DISPATCHED=0
for finding in critical_and_high_findings; do
if [ $DISPATCHED -ge 5 ]; then
break
fi
gh workflow run repo-health-investigate.lock.yml \
--repo dotnet/machinelearning \
-f finding_id="$FINDING_ID" \
-f category="$CATEGORY" \
-f severity="$SEVERITY" \
-f summary="$SUMMARY" \
-f health_issue_number="$ISSUE"
DISPATCHED=$((DISPATCHED + 1))
done</details>
---
💡 <a href="/dotnet/machinelearning/new/main?filename=.github/instructions/*.instructions.md" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Add Copilot custom instructions</a> for smarter, more guided reviews. <a href="https://docs.github.com/en/copilot/customizing-copilot/adding-repository-custom-instructions-for-github-copilot" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Learn how to get started</a>.<br><br>You can also share your feedback on Copilot code review. [Take the survey](https://www.surveymonkey.com/r/XP6L3XJ).
| # Only if validation passes | ||
| gh issue edit "$ISSUE" --repo dotnet/machinelearning --body "$UPDATED_BODY" |
There was a problem hiding this comment.
In repo-health-groom.md, Step 6 instructs the agent to apply the issue body update via gh issue edit ... (line 204). Since the gh CLI is not authenticated, this will always fail. The agent should instead use the update_issue safe-output tool to apply the validated update, consistent with how the safe-outputs config is set up in the lock file.
This issue also appears on line 141 of the same file.
| # Only if validation passes | |
| gh issue edit "$ISSUE" --repo dotnet/machinelearning --body "$UPDATED_BODY" | |
| # Only if validation passes; use the safe-output tool to apply the update | |
| update_issue "$ISSUE" "$UPDATED_BODY" |
| ```bash | ||
| # Check if AZDO_PAT is available; skip AzDO checks if not | ||
| if [ -z "$AZDO_PAT" ]; then | ||
| echo "AZDO_PAT not set — skipping Azure DevOps pipeline checks" | ||
| else | ||
| for pipeline in vsts-ci codecoverage-ci night-build outer-loop-build; do | ||
| curl -s -u ":$AZDO_PAT" \ | ||
| "https://dev.azure.com/dnceng/public/_apis/build/builds?definitions=$pipeline&\$top=1&api-version=7.0" \ | ||
| | jq '.value[0] | {id, buildNumber, status, result, queueTime, finishTime}' | ||
| done | ||
| ``` | ||
|
|
||
| **A2. Pipeline failure rate (last 7 days)** | ||
|
|
||
| ```bash | ||
| SINCE=$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) | ||
| for pipeline in vsts-ci codecoverage-ci night-build outer-loop-build; do | ||
| curl -s -u ":$AZDO_PAT" \ | ||
| "https://dev.azure.com/dnceng/public/_apis/build/builds?definitions=$pipeline&minTime=$SINCE&api-version=7.0" \ | ||
| | jq '[.value[] | .result] | group_by(.) | map({result: .[0], count: length})' | ||
| done | ||
| ``` | ||
|
|
||
| **A3. Queue times** | ||
|
|
||
| ```bash | ||
| for pipeline in vsts-ci codecoverage-ci night-build outer-loop-build; do | ||
| curl -s -u ":$AZDO_PAT" \ | ||
| "https://dev.azure.com/dnceng/public/_apis/build/builds?definitions=$pipeline&\$top=10&api-version=7.0" \ | ||
| | jq '[.value[] | {queueTime, startTime} | {wait: ((.startTime | fromdateiso8601) - (.queueTime | fromdateiso8601))}] | {avg_wait_seconds: (map(.wait) | add / length)}' | ||
| done | ||
| fi |
There was a problem hiding this comment.
The AzDO bash code in the orchestrator instructions is split across three separate fenced code blocks (A1, A2, A3), but the if/else control flow spans all three:
- A1 block opens
if [ -z "$AZDO_PAT" ]; then ... elsebut has nofiand no closing of the block. - A2 block contains just the
elsebody (twoforloops) with noif/fi. - A3 block contains the final
forloop followed byfito close the A1if.
Since these are separate fenced Markdown code blocks, the agent will treat them as distinct snippets. The fi at the end of A3 appears to close a non-existent if, and A2 has no guard at all. If the agent executes A2 or A3 as standalone blocks, the A2 and A3 curl commands will run unconditionally even when AZDO_PAT is not set, causing authentication errors on every run. The entire if/else/fi guard should be contained within a single code block, or each block should have its own if [ -n "$AZDO_PAT" ] guard.
|
|
||
| ```bash | ||
| # Find existing dashboard issue | ||
| ISSUE=$(gh issue list --repo dotnet/machinelearning \ | ||
| --label "repo-health" --state open \ | ||
| --json number --jq '.[0].number') | ||
|
|
||
| if [ -z "$ISSUE" ]; then | ||
| # Create new dashboard issue | ||
| ISSUE=$(gh issue create --repo dotnet/machinelearning \ | ||
| --title "🏥 Repo Health Dashboard" \ | ||
| --label "repo-health" \ | ||
| --body "$DASHBOARD_BODY") | ||
| # Pin the issue | ||
| gh issue pin "$ISSUE" --repo dotnet/machinelearning | ||
| fi | ||
| ``` | ||
|
|
||
| ### Update Issue Body | ||
|
|
||
| Replace the entire issue body with the current state using the dashboard format. Include: | ||
|
|
||
| 1. **Header** — Last updated timestamp, overall status emoji and counts | ||
| 2. **Summary** — Executive summary (1-2 sentences) | ||
| 3. **Findings tables** — Critical, Warning, Recently Resolved, Baselined | ||
| 4. **Trends (7-day)** — Key metrics with directional arrows | ||
| 5. **Footer** — Link to workflow run and baseline file | ||
|
|
||
| ### Post Daily Comment | ||
|
|
||
| ```bash | ||
| gh issue comment "$ISSUE" --repo dotnet/machinelearning \ | ||
| --body "$DELTA_SUMMARY" | ||
| ``` |
There was a problem hiding this comment.
The repo-health-check.md orchestrator instructs the agent to use gh issue create and gh issue pin to create/pin the dashboard issue (lines 356–362), and gh issue comment (line 379), but the compiled lock file explicitly states "The gh CLI is NOT authenticated." The agent should instead use the create_issue, update_issue, and add_comment safe-output tool calls for these operations. As written, these gh commands will fail silently on every run, meaning the dashboard issue will never be created or commented on.
This issue also appears on line 394 of the same file.
| Post a single comment on the dashboard issue (#${{ inputs.health_issue_number }}). | ||
|
|
||
| ```bash | ||
| gh issue comment ${{ inputs.health_issue_number }} --repo dotnet/machinelearning --body "$REPORT" |
There was a problem hiding this comment.
In repo-health-investigate.md, the Step 5 report-back uses a gh issue comment command (line 199) to post to the dashboard issue. However, the compiled lock file explicitly notes "The gh CLI is NOT authenticated." The agent should instead use the add_comment safe-output tool to post the investigation report. As written, this command will fail and the investigation result will never be posted back to the dashboard.
| gh issue comment ${{ inputs.health_issue_number }} --repo dotnet/machinelearning --body "$REPORT" | |
| add_comment "$REPORT" |
Adds the 3-tier repo health monitoring system:
epo-health-check) — daily at 6:00 UTC, collects issues/PRs/CI data, maintains dashboard issue
epo-health-investigate) — dispatched for critical findings, deep-dive analysis
epo-health-groom) — daily at 9:00 UTC, links results, hides stale comments
Also includes .github/health-baseline.md\ with 24 known P0/P1 issues and 6 long-running PRs.