You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Seven scenarios tested across two runs (2026-05-16 and 2026-05-17), covering all five software worker personas. Overall average score: 4.2/5.0. The agent consistently excels at PR-triggered analysis workflows and struggles with scenarios requiring external infrastructure tooling (cloud CLIs, credentials, network allowlisting).
Key Findings
PR-triggered analysis workflows are the sweet spot — path-filtered pull_request triggers with add-comment safe-output consistently scored 4.4–4.8/5
Scheduled report workflows are reliable — schedule + create-issue/create-discussion with skip-if-match dedup works cleanly
Complex DevOps scenarios score lower due to out-of-scope concerns (cloud credentials, binary installation, large timeout risk), not framework limitations
Security posture is consistently correct — read-only agent job + safe-outputs delegation scored 4–5 across all scenarios
The visual-regression.md and test-coverage.md reference prompts give the agent a major quality boost for those domains
Quality Scores by Scenario
#
Persona
Task
Avg
S3
QA Tester
Test coverage PR comment
4.8
S5
Frontend
Visual regression report
4.6
S1
Backend
Schema migration review
4.4
S4
PM
Weekly feature digest
4.4
S7
Backend
API docs diff on PR
4.4
S6
DevOps
Terraform drift detection
3.8
S2
DevOps
Deployment log monitoring
3.2
Top Patterns Observed
Most common trigger: pull_request (opened/synchronize) with optional paths: filter
Universal security: read-only agent job + safe-outputs for writes + skip-if-match deduplication on scheduled runs
High Quality Responses (scores ≥ 4.4)
S5 — Visual Regression (Frontend, 4.6/5)
Near-perfect scenario fit. The agent maps directly to .github/aw/visual-regression.md as a reference, applies Playwright with SSRF protection (localhost-only), uses cache-memory for baseline persistence, and correctly rate-limits add-comment with max:1. Only gap: should ask which app server command to use (storybook, vite, etc.).
S3 — Test Coverage (QA, 4.8/5)
The agent correctly routes to the test-coverage.md prompt and applies coverage diff analysis with clear guidance on thresholds. Best-performing scenario across both runs.
S7 — API Docs (Backend, 4.4/5)
Clean path-filtered trigger, correct token-cost mitigation (scope analysis to changed files only), and update-comment pattern to avoid PR comment spam on re-runs.
Areas for Improvement (scores ≤ 3.8)
S2 — Deployment Monitoring (DevOps, 3.2/5)
The workflow_run trigger is correct but the required actions:read permission is easy to miss. The scenario implies multi-stage behavior (wait for deployment to stabilize) that the single-job model cannot support — the agent should explicitly call this out.
S6 — Terraform Drift (DevOps, 3.8/5)
The framework mechanics are sound, but the agent may generate a syntactically valid workflow that silently fails at runtime because:
Cloud API hostnames must be explicitly allowlisted (non-trivial for AWS/GCP/Azure)
Large workspaces risk the default 20-minute timeout
The agent should proactively surface these blockers rather than generate a workflow that appears complete.
Recommendations
Add a "complex ops checklist" to .github/aw/create-agentic-workflow.md — when the agent detects keywords like "cloud", "terraform", "kubernetes", or "deploy", it should automatically prompt the user for: cloud provider, credentials strategy, required binaries, and expected run time. This prevents silent workflow failures on DevOps scenarios.
Expand .github/aw/create-agentic-workflow.md with workflow_run trigger guidance — document the actions:read permission requirement and the single-job constraint (no waiting for deployment stabilization). Both S2 and similar monitoring scenarios are blocked by this knowledge gap.
Add update-comment pattern to the PR workflow template in .github/aw/create-agentic-workflow.md — the "check for existing comment before posting" pattern (to avoid spam on synchronize events) came up in multiple scenarios (S1, S5, S7) and should be a first-class recommendation for any PR-triggered workflow that posts comments.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
Seven scenarios tested across two runs (2026-05-16 and 2026-05-17), covering all five software worker personas. Overall average score: 4.2/5.0. The agent consistently excels at PR-triggered analysis workflows and struggles with scenarios requiring external infrastructure tooling (cloud CLIs, credentials, network allowlisting).
Key Findings
pull_requesttriggers withadd-commentsafe-output consistently scored 4.4–4.8/5schedule+create-issue/create-discussionwithskip-if-matchdedup works cleanlyvisual-regression.mdandtest-coverage.mdreference prompts give the agent a major quality boost for those domainsQuality Scores by Scenario
Top Patterns Observed
pull_request(opened/synchronize) with optionalpaths:filtergithub(gh-proxy),bash,playwright(browser scenarios),cache-memory(persistence)skip-if-matchdeduplication on scheduled runsHigh Quality Responses (scores ≥ 4.4)
S5 — Visual Regression (Frontend, 4.6/5)
Near-perfect scenario fit. The agent maps directly to
.github/aw/visual-regression.mdas a reference, applies Playwright with SSRF protection (localhost-only), usescache-memoryfor baseline persistence, and correctly rate-limitsadd-commentwithmax:1. Only gap: should ask which app server command to use (storybook,vite, etc.).S3 — Test Coverage (QA, 4.8/5)
The agent correctly routes to the
test-coverage.mdprompt and applies coverage diff analysis with clear guidance on thresholds. Best-performing scenario across both runs.S7 — API Docs (Backend, 4.4/5)
Clean path-filtered trigger, correct token-cost mitigation (scope analysis to changed files only), and
update-commentpattern to avoid PR comment spam on re-runs.Areas for Improvement (scores ≤ 3.8)
S2 — Deployment Monitoring (DevOps, 3.2/5)
The
workflow_runtrigger is correct but the requiredactions:readpermission is easy to miss. The scenario implies multi-stage behavior (wait for deployment to stabilize) that the single-job model cannot support — the agent should explicitly call this out.S6 — Terraform Drift (DevOps, 3.8/5)
The framework mechanics are sound, but the agent may generate a syntactically valid workflow that silently fails at runtime because:
The agent should proactively surface these blockers rather than generate a workflow that appears complete.
Recommendations
Add a "complex ops checklist" to
.github/aw/create-agentic-workflow.md— when the agent detects keywords like "cloud", "terraform", "kubernetes", or "deploy", it should automatically prompt the user for: cloud provider, credentials strategy, required binaries, and expected run time. This prevents silent workflow failures on DevOps scenarios.Expand
.github/aw/create-agentic-workflow.mdwithworkflow_runtrigger guidance — document theactions:readpermission requirement and the single-job constraint (no waiting for deployment stabilization). Both S2 and similar monitoring scenarios are blocked by this knowledge gap.Add
update-commentpattern to the PR workflow template in.github/aw/create-agentic-workflow.md— the "check for existing comment before posting" pattern (to avoid spam onsynchronizeevents) came up in multiple scenarios (S1, S5, S7) and should be a first-class recommendation for any PR-triggered workflow that posts comments.Runs: §25995345892 | Previous: 2026-05-16, 2026-05-15
Beta Was this translation helpful? Give feedback.
All reactions