Skip to content

feat: add e2e triage CI workflow with Slack integration#741

Merged
alishakawaguchi merged 32 commits intomainfrom
alisha/e2e-triage-ci-job
Mar 23, 2026
Merged

feat: add e2e triage CI workflow with Slack integration#741
alishakawaguchi merged 32 commits intomainfrom
alisha/e2e-triage-ci-job

Conversation

@alishakawaguchi
Copy link
Contributor

@alishakawaguchi alishakawaguchi commented Mar 20, 2026

Context

When E2E tests fail on main, a Slack alert is posted but there's no easy way to kick off triage. This adds a triage workflow that can be triggered via workflow_dispatch (manually or via a Cloudflare Worker). Triage results post back to the same Slack thread.

E2E fails → bot posts alert to Slack
  → user triggers triage (manual or via Cloudflare Worker)
    → e2e-triage.yml runs → posts results back to Slack thread
    → generates fix plan → "Fix It" link dispatches e2e-fix.yml

Testing limitations

The claude-code-action requires workflows to be on the default branch (main) to run. This means the triage and fix workflows cannot be tested end-to-end until this PR is merged. The workflow YAML has been validated for syntax and all unit/integration/canary tests pass.

Follow-up work

  • Update e2e.yml to add a "Run Triage" link to the Slack failure alert (separate PR to keep this one focused)
  • The e2e.yml Slack notification is unchanged — still uses the incoming webhook

Summary

  • Add e2e-triage.yml — triages E2E failures per agent using claude-code-action (read-only analysis by default)
  • Add e2e-fix.yml — applies fix plans from triage, runs verification, creates draft PR
  • Add rerun toggle to triage workflow — when enabled, installs agent CLIs and re-runs failing tests to detect flaky vs real bugs (costs API tokens)
  • Plan generation step has EnterPlanMode, Write, and Bash(mise:*) tools
  • Triage uses chat.postMessage for Slack thread replies (incoming webhooks don't support thread_ts)
  • Add plan mode instruction to /e2e:implement skill

Secret/config changes needed

  • Add repo secret SLACK_BOT_TOKEN — bot token with chat:write scope (for triage/fix thread replies)
  • Add repo secret ANTHROPIC_API_KEY — already exists (used by claude-code-action)
  • Existing E2E_SLACK_WEBHOOK_URL is unchanged (used by e2e.yml for top-level alerts)

Test plan

  • mise run fmt && mise run lint && mise run test:ci — all pass (51 canary tests)
  • Merge to main
  • Trigger e2e-triage.yml via workflow_dispatch with a failed run URL
  • Verify triage output in job summary and artifacts
  • Test with rerun: true to verify flaky detection path
  • Verify Slack thread replies (requires SLACK_BOT_TOKEN secret)

🤖 Generated with Claude Code


Note

Medium Risk
Adds new GitHub Actions workflows that invoke claude-code-action, download artifacts, and push branches/create PRs with contents/pull-requests write permissions; misconfiguration could cause unintended repo writes or noisy Slack posting.

Overview
Introduces an on-demand E2E triage → fix pipeline via new workflow_dispatch GitHub Actions.

e2e-triage.yml builds an agent matrix (auto-detecting failed agents/SHA from a run URL), runs /e2e:triage-ci (optionally re-running tests after installing agent CLIs), persists triage + plan artifacts, and posts threaded Slack updates including a generated Fix It link.

e2e-fix.yml consumes those plan artifacts, uses claude-code-action to apply the specified fixes, runs fmt/lint/canary verification, then pushes a fix/e2e-* branch and opens a draft PR, with Slack success/failure notifications. Also updates /e2e:implement guidance to require entering plan mode before making changes.

Written by Cursor Bugbot for commit 8551955. Configure here.

alishakawaguchi and others added 15 commits March 17, 2026 15:15
Make sha and failed_agents optional for workflow_dispatch triggers.
When omitted, these values are derived from the run URL via the
GitHub API, reducing friction when triggering triage from the UI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 4a44db7b807d
- Consolidate two gh API calls into one (headSha + jobs in single request)
- Extract duplicated CSV-to-JSON jq pattern into csv_to_json function
- Add "null" guard to agents_json validation
- Use shallow clone (fetch-depth: 1) for triage jobs
- Add server-side error logging in HTTP handler
- Fix gosec nolint placement and noctx lint errors in tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 0f803598ba36
Copilot AI review requested due to automatic review settings March 20, 2026 17:08
@alishakawaguchi alishakawaguchi self-assigned this Mar 20, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Slack-triggered E2E triage path that bridges Slack thread replies (triage e2e) to a new GitHub Actions triage workflow, so failing CI runs can be triaged and reported back to Slack with minimal manual steps.

Changes:

  • Introduce .github/workflows/e2e-triage.yml (workflow_dispatch + repository_dispatch) to run /e2e:triage-ci per failed agent and post Slack thread updates.
  • Add cmd/e2e-triage-dispatch/ HTTP service plus internal/slacktriage/ helpers to validate Slack events, parse parent alert metadata, and dispatch GitHub events.
  • Add machine-readable meta: data to E2E Slack alerts, plus docs and a runner script for the triage workflow.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scripts/run-e2e-triage.sh Runner script invoked by the triage workflow to execute the Claude E2E triage command and tee logs to artifacts.
internal/slacktriage/parent_message.go Parses the meta: line from Slack alerts into structured metadata for dispatch.
internal/slacktriage/normalize.go Normalizes Slack reply text and checks for the exact triage trigger phrase.
internal/slacktriage/dispatch.go Builds the repository_dispatch payload from parsed metadata + Slack thread info.
internal/slacktriage/*_test.go Unit tests for trigger normalization, parent metadata parsing, and dispatch payload creation.
cmd/e2e-triage-dispatch/main.go Slack event receiver: verifies signatures, fetches parent message, parses metadata, dispatches to GitHub.
cmd/e2e-triage-dispatch/main_test.go Handler + dispatcher unit tests (signature verification, ignore cases, dispatch path).
.github/workflows/e2e.yml Adds machine-readable meta: metadata to the Slack failure alert.
.github/workflows/e2e-triage.yml New triage workflow that validates payload, derives sha/agents when needed, runs triage, posts Slack updates, uploads artifacts.
docs/plans/2026-03-17-slack-triggered-e2e-triage-design.md Design doc describing the Slack→GitHub triage system and contract.
docs/architecture/slack-e2e-triage.md Architecture/runbook-style overview for operating the Slack-triggered triage.
README.md Documents Slack-triggered E2E triage and points to the architecture doc.

alishakawaguchi and others added 8 commits March 20, 2026 10:25
Adds push-triggered test mode that runs with the vogon canary agent
(no API costs) when workflow-related files change on this branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: eec73e0fab92
This reverts commit f8a82d6.

Entire-Checkpoint: 363c74b4a8c5
The triage workflow was checking out the failed run's SHA, which
doesn't contain the triage script. Now checks out the workflow's
own branch and passes the target SHA as an env var instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: d4ec0a1e350d
Use --allowedTools with explicit per-command scoping instead of
--dangerously-skip-permissions. Each gh command is locked to the
specific repo, workflow, and agent. No generic shell access.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 4004148a4e05
Instead of giving Claude shell access to gh/scripts, download
artifacts in the script before invoking Claude. Claude only gets
Read, Grep, and Glob — pure analysis, no shell execution.

Also improve job summary to show helpful message when log is empty.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: ca4f43d851a5
…aries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 4fc3a7119ed8
…lack triage

Replace the Go-based Slack Events API dispatch service with a lightweight
Cloudflare Worker that bridges Slack links to GitHub workflow_dispatch.
The e2e.yml alert now posts via bot token (chat.postMessage) to capture
thread context, then includes a clickable "Run Triage" link.

- Add workers/e2e-triage-trigger/ (Cloudflare Worker)
- Switch e2e.yml Slack alert from webhook to bot token + curl
- Remove repository_dispatch trigger from e2e-triage.yml
- Delete cmd/e2e-triage-dispatch/ and internal/slacktriage/
- Update docs and README

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: b4154940fa73
The e2e-triage-trigger worker belongs in the infra repo
(cloudflare/workers/e2e-triage-trigger/), not the CLI repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: c67902090531
@alishakawaguchi
Copy link
Contributor Author

bugbot run

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

alishakawaguchi and others added 8 commits March 20, 2026 15:20
…orkflow

jq capture() crashes the pipeline when a failed job name doesn't match
the expected (agent) pattern. Wrap in try-catch to gracefully skip
non-matching jobs. Add concurrency group to e2e-triage workflow to
prevent duplicate runs from Slack retries or re-dispatches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 6db4ca1d788d
The tee to stdout made the "Run triage" step log unreadable since
GitHub Actions logs render markdown as plain text. The rendered
report is already written to $GITHUB_STEP_SUMMARY.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 748308902ac5
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 9835d73e7909
Migrate triage to claude-code-action, add plan generation step after
triage, post "Fix It" link to Slack, and create e2e-fix.yml workflow
that applies plans and opens draft PRs via claude-code-action.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: b0e6e608e807
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: f8fee1d345e0
The claude-code-action can succeed without producing an execution_file.
Add the same non-empty guard used by the triage extraction step, and
gate downstream steps (Slack post, artifact upload) on plan_output
succeeding rather than just the plan step.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 7f2ce308957d
Revert Slack notification from chat.postMessage API back to the
original incoming webhook approach for minimal changes. Remove
premature architecture doc, design plan, and README section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: eeb2507079c5
…riage tools

Revert e2e.yml jq change to match main. Delete unused run-e2e-triage.sh
script. Add `rerun` boolean input to e2e-triage.yml that installs agent
CLIs and enables Bash tools for flaky detection via test re-runs. Update
plan generation step with EnterPlanMode, Write, and Bash(mise:*) tools.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 2f2c12c4092e
@alishakawaguchi
Copy link
Contributor Author

bugbot run

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Shell injection in Python URL-encoding command
    • Replaced bash variable interpolation '$RUN_URL' with os.environ['RUN_URL'] in the Python command to avoid shell injection via crafted URL input.
  • ✅ Fixed: Duplicated Slack helper script across two workflows
    • Extracted the duplicated inline Slack helper heredoc from both workflows into a shared scripts/post-slack-message.sh script, and moved checkout steps before Slack notification steps so the script is available.

Create PR

Or push these changes by commenting:

@cursor push dfe21a7917
Preview (dfe21a7917)
diff --git a/.github/workflows/e2e-fix.yml b/.github/workflows/e2e-fix.yml
--- a/.github/workflows/e2e-fix.yml
+++ b/.github/workflows/e2e-fix.yml
@@ -43,38 +43,11 @@
       SLACK_CHANNEL: ${{ inputs.slack_channel }}
       SLACK_THREAD_TS: ${{ inputs.slack_thread_ts }}
     steps:
-      - name: Write Slack helper
-        if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }}
-        shell: bash
-        run: |
-          set -euo pipefail
+      - name: Checkout repository
+        uses: actions/checkout@v6
+        with:
+          fetch-depth: 0
 
-          helper="$RUNNER_TEMP/post-slack-message.sh"
-          cat > "$helper" <<'EOF'
-          #!/usr/bin/env bash
-          set -euo pipefail
-
-          text="${1:?message is required}"
-          payload="$(jq -n \
-            --arg channel "$SLACK_CHANNEL" \
-            --arg thread_ts "$SLACK_THREAD_TS" \
-            --arg text "$text" \
-            '{channel: $channel, thread_ts: $thread_ts, text: $text}')"
-
-          if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \
-            -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \
-            -H 'Content-type: application/json; charset=utf-8' \
-            --data "$payload")"; then
-            echo "warning: slack notification failed" >&2
-            exit 0
-          fi
-
-          if ! jq -e '.ok == true' >/dev/null <<<"$response"; then
-            echo "warning: slack notification returned non-ok response" >&2
-          fi
-          EOF
-          chmod +x "$helper"
-
       - name: Post fix started
         if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }}
         shell: bash
@@ -83,13 +56,8 @@
         run: |
           set -euo pipefail
 
-          "$RUNNER_TEMP/post-slack-message.sh" "Starting E2E fix for \`${FAILED_AGENTS}\`."
+          scripts/post-slack-message.sh "Starting E2E fix for \`${FAILED_AGENTS}\`."
 
-      - name: Checkout repository
-        uses: actions/checkout@v6
-        with:
-          fetch-depth: 0
-
       - name: Setup mise
         uses: jdx/mise-action@v4
 
@@ -157,7 +125,7 @@
             message="E2E fix complete — changes applied but no PR was created. Check the <${RUN_URL}|workflow run> for details."
           fi
 
-          "$RUNNER_TEMP/post-slack-message.sh" "$message"
+          scripts/post-slack-message.sh "$message"
 
       - name: Post failure to Slack
         if: failure() && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != ''
@@ -169,4 +137,4 @@
 
           message="E2E fix failed. Check the <${RUN_URL}|workflow run> for details."
 
-          "$RUNNER_TEMP/post-slack-message.sh" "$message"
+          scripts/post-slack-message.sh "$message"

diff --git a/.github/workflows/e2e-triage.yml b/.github/workflows/e2e-triage.yml
--- a/.github/workflows/e2e-triage.yml
+++ b/.github/workflows/e2e-triage.yml
@@ -129,51 +129,19 @@
       matrix:
         agent: ${{ fromJson(needs.matrix-setup.outputs.agents) }}
     steps:
-      - name: Write Slack helper
-        if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }}
-        shell: bash
-        run: |
-          set -euo pipefail
+      - name: Checkout repository
+        uses: actions/checkout@v6
+        with:
+          fetch-depth: 1
 
-          helper="$RUNNER_TEMP/post-slack-message.sh"
-          cat > "$helper" <<'EOF'
-          #!/usr/bin/env bash
-          set -euo pipefail
-
-          text="${1:?message is required}"
-          payload="$(jq -n \
-            --arg channel "$SLACK_CHANNEL" \
-            --arg thread_ts "$SLACK_THREAD_TS" \
-            --arg text "$text" \
-            '{channel: $channel, thread_ts: $thread_ts, text: $text}')"
-
-          if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \
-            -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \
-            -H 'Content-type: application/json; charset=utf-8' \
-            --data "$payload")"; then
-            echo "warning: slack notification failed" >&2
-            exit 0
-          fi
-
-          if ! jq -e '.ok == true' >/dev/null <<<"$response"; then
-            echo "warning: slack notification returned non-ok response" >&2
-          fi
-          EOF
-          chmod +x "$helper"
-
       - name: Post triage started
         if: ${{ env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != '' }}
         shell: bash
         run: |
           set -euo pipefail
 
-          "$RUNNER_TEMP/post-slack-message.sh" "Starting E2E triage for \`$E2E_AGENT\` on <$RUN_URL|this run>."
+          scripts/post-slack-message.sh "Starting E2E triage for \`$E2E_AGENT\` on <$RUN_URL|this run>."
 
-      - name: Checkout repository
-        uses: actions/checkout@v6
-        with:
-          fetch-depth: 1
-
       - name: Setup mise
         uses: jdx/mise-action@v4
 
@@ -317,7 +285,7 @@
             message="E2E triage failed for \`$E2E_AGENT\`."
           fi
 
-          "$RUNNER_TEMP/post-slack-message.sh" "$message"
+          scripts/post-slack-message.sh "$message"
 
       - name: Post fix plan to Slack
         if: steps.plan_output.outcome == 'success' && env.SLACK_BOT_TOKEN != '' && env.SLACK_CHANNEL != '' && env.SLACK_THREAD_TS != ''
@@ -332,7 +300,7 @@
           summary="$(head -20 "$PLAN_FILE" | sed '/^$/d' | head -5)"
 
           # Construct Fix It URL
-          encoded_run_url="$(python3 -c "import urllib.parse; print(urllib.parse.quote('$RUN_URL', safe=''))")"
+          encoded_run_url="$(python3 -c "import urllib.parse, os; print(urllib.parse.quote(os.environ['RUN_URL'], safe=''))")"
           fix_url="https://e2e-triage.entireio.workers.dev/fix?triage_run_id=${TRIAGE_RUN_ID}&run_url=${encoded_run_url}&failed_agents=${E2E_AGENT}&slack_channel=${SLACK_CHANNEL}&slack_thread_ts=${SLACK_THREAD_TS}"
 
           message="Fix plan ready for \`$E2E_AGENT\`:
@@ -340,7 +308,7 @@
 
           <${fix_url}|Fix It> — applies the plan and creates a draft PR"
 
-          "$RUNNER_TEMP/post-slack-message.sh" "$message"
+          scripts/post-slack-message.sh "$message"
 
       - name: Upload triage output
         if: always()

diff --git a/scripts/post-slack-message.sh b/scripts/post-slack-message.sh
new file mode 100644
--- /dev/null
+++ b/scripts/post-slack-message.sh
@@ -1,0 +1,21 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+text="${1:?message is required}"
+payload="$(jq -n \
+  --arg channel "$SLACK_CHANNEL" \
+  --arg thread_ts "$SLACK_THREAD_TS" \
+  --arg text "$text" \
+  '{channel: $channel, thread_ts: $thread_ts, text: $text}')"
+
+if ! response="$(curl -fsS https://slack.com/api/chat.postMessage \
+  -H "Authorization: Bearer ${SLACK_BOT_TOKEN}" \
+  -H 'Content-type: application/json; charset=utf-8' \
+  --data "$payload")"; then
+  echo "warning: slack notification failed" >&2
+  exit 0
+fi
+
+if ! jq -e '.ok == true' >/dev/null <<<"$response"; then
+  echo "warning: slack notification returned non-ok response" >&2
+fi

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Extract duplicated Slack post-message heredoc from e2e-triage.yml and
e2e-fix.yml into scripts/post-slack-message.sh. Fix shell injection in
Python URL-encoding by reading RUN_URL from os.environ instead of
interpolating into the code string.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 91d990ef6fb1
@alishakawaguchi alishakawaguchi marked this pull request as ready for review March 23, 2026 17:08
@alishakawaguchi alishakawaguchi requested a review from a team as a code owner March 23, 2026 17:08
@alishakawaguchi alishakawaguchi merged commit b56f302 into main Mar 23, 2026
3 checks passed
@alishakawaguchi alishakawaguchi deleted the alisha/e2e-triage-ci-job branch March 23, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants