Skip to content

Add repo health check agentic workflows#7583

Merged
JanKrivanek merged 1 commit intomainfrom
repo-health-check
Mar 6, 2026
Merged

Add repo health check agentic workflows#7583
JanKrivanek merged 1 commit intomainfrom
repo-health-check

Conversation

@JanKrivanek
Copy link
Member

Adds the 3-tier repo health monitoring system:

  • Orchestrator (
    epo-health-check) — daily at 6:00 UTC, collects issues/PRs/CI data, maintains dashboard issue
  • Investigator (
    epo-health-investigate) — dispatched for critical findings, deep-dive analysis
  • Groomer (
    epo-health-groom) — daily at 9:00 UTC, links results, hides stale comments

Also includes .github/health-baseline.md\ with 24 known P0/P1 issues and 6 long-running PRs.

Copilot AI review requested due to automatic review settings March 6, 2026 19:18
@JanKrivanek JanKrivanek merged commit df690c2 into main Mar 6, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a 3-tier agentic repo health monitoring system to the dotnet/machinelearning repository using GitHub's gh-aw (agentic workflows) framework. The system automatically monitors issues, PRs, and CI pipelines, and takes automated actions on the dashboard.

Changes:

  • Adds an Orchestrator workflow (repo-health-check) that runs daily at 6:00 UTC, collects health data, diffs against previous runs, updates a pinned dashboard issue, and dispatches investigators for critical findings.
  • Adds an Investigator workflow (repo-health-investigate) dispatched for critical/high findings to perform deep-dive analysis and post results back to the dashboard.
  • Adds a Groomer workflow (repo-health-groom) that runs daily at 9:00 UTC, links investigation results, hides stale comments, and enforces dashboard structure.
  • Adds .github/health-baseline.md cataloguing 24 known P0/P1 issues and 6 long-running PRs to suppress false positives.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
.github/workflows/repo-health-check.md Orchestrator prompt/playbook for the daily health check agentic workflow
.github/workflows/repo-health-check.lock.yml Auto-generated compiled workflow for the orchestrator
.github/workflows/repo-health-investigate.md Investigator prompt/playbook for deep-dive analysis
.github/workflows/repo-health-investigate.lock.yml Auto-generated compiled workflow for the investigator
.github/workflows/repo-health-groom.md Groomer prompt/playbook for dashboard maintenance
.github/workflows/repo-health-groom.lock.yml Auto-generated compiled workflow for the groomer
.github/health-baseline.md Known baseline of accepted issues/PRs to suppress from new finding alerts
.github/aw/actions-lock.json Action version pins for gh-aw toolchain
Comments suppressed due to low confidence (2)

.github/workflows/repo-health-groom.md:155

  • The repo-health-groom.md playbook instructs the agent to hide comments using gh api graphql (line 145), but the compiled lock file's agent prompt explicitly states "The gh CLI is NOT authenticated. Do NOT use gh commands for GitHub operations." The agent cannot use gh api graphql for the minimize mutation; instead it should use the hide_comment safe-output tool (which is defined in the lock file's safe outputs config). The hide-comment operation in the playbook should reference the safe-output tool call rather than a gh command, otherwise the hide operation will silently fail or error on every run.
### Hide Operation

```bash
# Minimize comment (hide with reason)
gh api graphql -f query='
  mutation {
    minimizeComment(input: {
      subjectId: "COMMENT_NODE_ID",
      classifier: OUTDATED
    }) {
      minimizedComment { isMinimized }
    }
  }
'
**.github/workflows/repo-health-check.md:413**
* In `repo-health-check.md`, Step 5 instructs the agent to dispatch the investigator using `gh workflow run repo-health-investigate.lock.yml ...`. However, the gh CLI is not authenticated, so this command will fail. The agent should instead use the `repo_health_investigate` safe-output tool (which is defined in the lock file as a `dispatch_workflow` safe output) to trigger the investigation workflow.
# Budget: max 5 dispatches
DISPATCHED=0

for finding in critical_and_high_findings; do
  if [ $DISPATCHED -ge 5 ]; then
    break
  fi

  gh workflow run repo-health-investigate.lock.yml \
    --repo dotnet/machinelearning \
    -f finding_id="$FINDING_ID" \
    -f category="$CATEGORY" \
    -f severity="$SEVERITY" \
    -f summary="$SUMMARY" \
    -f health_issue_number="$ISSUE"

  DISPATCHED=$((DISPATCHED + 1))
done
</details>



---

💡 <a href="/dotnet/machinelearning/new/main?filename=.github/instructions/*.instructions.md" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Add Copilot custom instructions</a> for smarter, more guided reviews. <a href="https://docs.github.com/en/copilot/customizing-copilot/adding-repository-custom-instructions-for-github-copilot" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Learn how to get started</a>.<br><br>You can also share your feedback on Copilot code review. [Take the survey](https://www.surveymonkey.com/r/XP6L3XJ).

Comment on lines +203 to +204
# Only if validation passes
gh issue edit "$ISSUE" --repo dotnet/machinelearning --body "$UPDATED_BODY"
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In repo-health-groom.md, Step 6 instructs the agent to apply the issue body update via gh issue edit ... (line 204). Since the gh CLI is not authenticated, this will always fail. The agent should instead use the update_issue safe-output tool to apply the validated update, consistent with how the safe-outputs config is set up in the lock file.

This issue also appears on line 141 of the same file.

Suggested change
# Only if validation passes
gh issue edit "$ISSUE" --repo dotnet/machinelearning --body "$UPDATED_BODY"
# Only if validation passes; use the safe-output tool to apply the update
update_issue "$ISSUE" "$UPDATED_BODY"

Copilot uses AI. Check for mistakes.
Comment on lines +237 to +268
```bash
# Check if AZDO_PAT is available; skip AzDO checks if not
if [ -z "$AZDO_PAT" ]; then
echo "AZDO_PAT not set — skipping Azure DevOps pipeline checks"
else
for pipeline in vsts-ci codecoverage-ci night-build outer-loop-build; do
curl -s -u ":$AZDO_PAT" \
"https://dev.azure.com/dnceng/public/_apis/build/builds?definitions=$pipeline&\$top=1&api-version=7.0" \
| jq '.value[0] | {id, buildNumber, status, result, queueTime, finishTime}'
done
```

**A2. Pipeline failure rate (last 7 days)**

```bash
SINCE=$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v-7d +%Y-%m-%dT%H:%M:%SZ)
for pipeline in vsts-ci codecoverage-ci night-build outer-loop-build; do
curl -s -u ":$AZDO_PAT" \
"https://dev.azure.com/dnceng/public/_apis/build/builds?definitions=$pipeline&minTime=$SINCE&api-version=7.0" \
| jq '[.value[] | .result] | group_by(.) | map({result: .[0], count: length})'
done
```

**A3. Queue times**

```bash
for pipeline in vsts-ci codecoverage-ci night-build outer-loop-build; do
curl -s -u ":$AZDO_PAT" \
"https://dev.azure.com/dnceng/public/_apis/build/builds?definitions=$pipeline&\$top=10&api-version=7.0" \
| jq '[.value[] | {queueTime, startTime} | {wait: ((.startTime | fromdateiso8601) - (.queueTime | fromdateiso8601))}] | {avg_wait_seconds: (map(.wait) | add / length)}'
done
fi
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AzDO bash code in the orchestrator instructions is split across three separate fenced code blocks (A1, A2, A3), but the if/else control flow spans all three:

  • A1 block opens if [ -z "$AZDO_PAT" ]; then ... else but has no fi and no closing of the block.
  • A2 block contains just the else body (two for loops) with no if/fi.
  • A3 block contains the final for loop followed by fi to close the A1 if.

Since these are separate fenced Markdown code blocks, the agent will treat them as distinct snippets. The fi at the end of A3 appears to close a non-existent if, and A2 has no guard at all. If the agent executes A2 or A3 as standalone blocks, the A2 and A3 curl commands will run unconditionally even when AZDO_PAT is not set, causing authentication errors on every run. The entire if/else/fi guard should be contained within a single code block, or each block should have its own if [ -n "$AZDO_PAT" ] guard.

Copilot uses AI. Check for mistakes.
Comment on lines +348 to +381

```bash
# Find existing dashboard issue
ISSUE=$(gh issue list --repo dotnet/machinelearning \
--label "repo-health" --state open \
--json number --jq '.[0].number')

if [ -z "$ISSUE" ]; then
# Create new dashboard issue
ISSUE=$(gh issue create --repo dotnet/machinelearning \
--title "🏥 Repo Health Dashboard" \
--label "repo-health" \
--body "$DASHBOARD_BODY")
# Pin the issue
gh issue pin "$ISSUE" --repo dotnet/machinelearning
fi
```

### Update Issue Body

Replace the entire issue body with the current state using the dashboard format. Include:

1. **Header** — Last updated timestamp, overall status emoji and counts
2. **Summary** — Executive summary (1-2 sentences)
3. **Findings tables** — Critical, Warning, Recently Resolved, Baselined
4. **Trends (7-day)** — Key metrics with directional arrows
5. **Footer** — Link to workflow run and baseline file

### Post Daily Comment

```bash
gh issue comment "$ISSUE" --repo dotnet/machinelearning \
--body "$DELTA_SUMMARY"
```
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repo-health-check.md orchestrator instructs the agent to use gh issue create and gh issue pin to create/pin the dashboard issue (lines 356–362), and gh issue comment (line 379), but the compiled lock file explicitly states "The gh CLI is NOT authenticated." The agent should instead use the create_issue, update_issue, and add_comment safe-output tool calls for these operations. As written, these gh commands will fail silently on every run, meaning the dashboard issue will never be created or commented on.

This issue also appears on line 394 of the same file.

Copilot uses AI. Check for mistakes.
Post a single comment on the dashboard issue (#${{ inputs.health_issue_number }}).

```bash
gh issue comment ${{ inputs.health_issue_number }} --repo dotnet/machinelearning --body "$REPORT"
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In repo-health-investigate.md, the Step 5 report-back uses a gh issue comment command (line 199) to post to the dashboard issue. However, the compiled lock file explicitly notes "The gh CLI is NOT authenticated." The agent should instead use the add_comment safe-output tool to post the investigation report. As written, this command will fail and the investigation result will never be posted back to the dashboard.

Suggested change
gh issue comment ${{ inputs.health_issue_number }} --repo dotnet/machinelearning --body "$REPORT"
add_comment "$REPORT"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants