plan: add reasoning_depth A/B experiment#42955
Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR does not have the 'implementation' label and has 0 new lines of code in business logic directories (threshold: 100). |
|
❌ Test Quality Sentinel failed during test quality analysis. |
There was a problem hiding this comment.
Pull request overview
This PR adds an A/B/n experiment (reasoning_depth) to the plan workflow to compare three planning instruction variants (shallow/baseline/deep), and updates the prompt to conditionally inject variant-specific “Planning Approach” guidance before the planning begins.
Changes:
- Adds
experiments.reasoning_depthfrontmatter with variants, weights, minimum samples, and guardrail metrics. - Adds a conditional
## Planning Approachblock that changes planning instructions based on the selected experiment variant. - Regenerates
plan.lock.ymlto wire experiment selection/state persistence into the compiled GitHub Actions workflow.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/plan.md | Defines the reasoning_depth experiment and conditionally injects variant-specific planning instructions into the prompt. |
| .github/workflows/plan.lock.yml | Compiled workflow update that introduces experiment pick/state steps and updates runtime/action/container pins. |
Review details
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 3
- Review effort level: Low
| - name: workflow_success_rate | ||
| direction: min | ||
| threshold: 0.95 |
| - name: Setup Scripts | ||
| id: setup | ||
| uses: ./actions/setup | ||
| uses: github/gh-aw-actions/setup@v0.81.6 |
| @@ -147,15 +143,16 @@ jobs: | |||
| GH_AW_INFO_ENGINE_ID: "copilot" | |||
| GH_AW_INFO_ENGINE_NAME: "GitHub Copilot CLI" | |||
| GH_AW_INFO_MODEL: ${{ vars.GH_AW_MODEL_AGENT_COPILOT || vars.GH_AW_DEFAULT_MODEL_COPILOT || 'claude-sonnet-4.6' }} | |||
| GH_AW_INFO_VERSION: "1.0.67" | |||
| GH_AW_INFO_AGENT_VERSION: "1.0.67" | |||
| GH_AW_INFO_VERSION: "1.0.65" | |||
| GH_AW_INFO_AGENT_VERSION: "1.0.65" | |||
| GH_AW_INFO_CLI_VERSION: "v0.81.6" | |||
🤖 PR Triage
Score breakdown: Impact 20 · Urgency 5 · Quality 5 Draft/WIP. Adds A/B experiment campaign for reasoning depth to
|
There was a problem hiding this comment.
REQUEST_CHANGES: three high-severity issues block this experiment from working correctly.
Blocking issues (3 HIGH)
- Unresolvable primary metric — has no collection mechanism; the experiment cannot be evaluated against its own hypothesis.
- Supply chain risk — is the only action in the manifest pinned by mutable tag rather than commit SHA; all other actions use immutable SHAs.
- Container image regression — firewall images downgraded from 0.27.20 to 0.27.11 and from v1.5.0 to v1.4.0 with no documented reason; this could re-introduce fixed bugs on every invocation.
Non-blocking concerns (3 MEDIUM)
- **No ** — experiment runs indefinitely, requires manual cleanup PR.
- ** v5.0.5 vs v6.1.0** — one step selectively uses an older version with no explanation.
- Artifact retention 30 days, no end_date — run data older than 30 days will be deleted before the sample set is complete.
🔎 Code quality review by PR Code Quality Reviewer · 93.9 AIC · ⌖ 10.2 AIC · ⊞ 1.6K
Comment /review to run again
| # - actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0 | ||
| # - actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e # v6.4.0 | ||
| # - actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1 | ||
| # - github/gh-aw-actions/setup@v0.81.6 |
There was a problem hiding this comment.
** is pinned by mutable tag, not a commit SHA** — every other action in this manifest has an immutable SHA; this one has , silently re-pointable.
💡 Details
The manifest header shows:
Every other action entry (e.g. , ) carries a real 40-hex commit SHA. A tag can be force-pushed without any PR or audit trail. If is force-pushed to a malicious commit, all jobs using this action will silently execute untrusted code with permissions.
Resolve by pinning to the commit SHA the tag currently points to:
| variants: [shallow, baseline, deep] | ||
| description: "Tests whether lighter or more reflective planning instructions produce better agent-ready sub-issues for /plan requests." | ||
| hypothesis: "H0: no change in plan quality score or success rate. H1: deep improves quality score by >=8%, or shallow reduces run cost/latency by >=15% with no material quality loss." | ||
| metric: plan_quality_score |
There was a problem hiding this comment.
** is declared as the primary metric but has no defined collection path in this PR** — the experiment cannot be analysed.
💡 Details
The hypothesis states:
H1: deep improves quality score by >=8%
But neither nor the new pipeline emits a value. Without instrumentation, the guardrail check against this metric is a no-op and the experiment will only ever produce secondary metrics (, , ).
Before merging, either:
- Add a scoring step that emits to the experiment artifact, or
- Replace the primary metric with one already collected (e.g. ) and document the proxy rationale in the field.
| threshold: 0.02 | ||
| min_samples: 137 | ||
| weight: [34, 33, 33] | ||
| start_date: "2026-07-02" |
There was a problem hiding this comment.
No — this experiment will run indefinitely past its intended window.
💡 Details
The workflow classifies an experiment as while ; without it never terminates automatically. only gates analysis readiness, not experiment shutdown.
A permanently active experiment continuously splits production traffic among all three variants, making the prompt permanently larger and requiring a manual follow-up PR to remove the conditionals.
Add an explicit :
| if: ${{ env.GH_AW_MAX_DAILY_AI_CREDITS != '' }} | ||
| continue-on-error: true | ||
| uses: actions/cache/restore@55cc8345863c7cc4c66a329aec7e433d2d1c52a9 # v6.1.0 | ||
| uses: actions/cache/restore@27d5ce7f107fe9357f9df03efb73ab90386fccae # v5.0.5 |
There was a problem hiding this comment.
** selectively downgraded from v6.1.0 to v5.0.5 for this one step only** — unexplained inconsistency across the same lock file.
💡 Details
All other cache/restore usages in this file use (v6.1.0), but the daily-AIC-cache restore step was changed to (v5.0.5). The manifest header now carries two separate SHAs for the same action.
This inconsistency has no explanation in the PR description and is suspicious: if v5 was needed here for compatibility, that should be documented. If it is an error, it silently pins one cache step to an older version that may lack bugfixes.
Either upgrade to v6.1.0 consistently, or add a comment explaining the required downgrade.
| name: plan-experiment | ||
| path: /tmp/gh-aw/experiments | ||
| if-no-files-found: ignore | ||
| retention-days: 30 |
There was a problem hiding this comment.
Experiment artifact retention (30 days) is shorter than the experiment could run — early-run data will be deleted before analysis completes.
💡 Details
With required and no , this experiment could easily run for more than 30 days if the command is not invoked frequently. GitHub Actions artifacts are automatically deleted after the window; any run artifacts older than 30 days will be gone by the time you are ready to analyse the full sample set.
Recommendations:
- Set an (see the separate comment on ) so you can size the retention window to match.
- Or increase to at least 60-90 days as a buffer.
- Or ensure the job writes cumulative state to the branch on every run, so the branch (not the artifact) is the source of truth for analysis.
| # - ghcr.io/github/gh-aw-firewall/cli-proxy:0.27.20 | ||
| # - ghcr.io/github/gh-aw-firewall/squid:0.27.20 | ||
| # - ghcr.io/github/gh-aw-mcpg:v0.3.32@sha256:63e46b56dfd70895a701b6fc6dd0189e11e2d875f327f1781e81b31848735477 | ||
| # - ghcr.io/github/gh-aw-firewall/agent:0.27.11@sha256:979723c628182da7729333f2208bb249fd25ddee579645cf9a3892d681a929c7 |
There was a problem hiding this comment.
All firewall container images and MCP server were downgraded without explanation — 0.27.20 → 0.27.11, v1.5.0 → v1.4.0.
💡 Details
The previous lock file used firewall image and . This PR downgrades both to earlier versions. Lock files are compiled outputs, and the new lock was compiled with compiler v0.81.6 (added field), which implies these older image tags are what v0.81.6 emits.
If this is intentional (the new compiler pins to an older runtime for compatibility), that should be documented. If it is unintentional, it may re-introduce bugs or CVEs fixed in the higher versions, and the workflow will run against an older MCP server API surface.
Verify that the downgraded image versions are intentional and compatible with the new experiment features added in this PR.
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /grill-with-docs and /tdd — commenting with 5 findings, no blocking issues but a few items worth addressing before the experiment concludes.
📋 Key Themes & Highlights
Key Themes
- Undefined primary metric (line 48):
plan_quality_scorehas no collection mechanism described in this PR — the experiment's H1 threshold can't be evaluated without it. - No experiment lifetime bound (line 57): Missing
end_datemeans the experiment runs indefinitely if/planvolume is low or results are not acted on. - Fragile positional weight mapping (line 58):
weight: [34, 33, 33]maps silently to[shallow, baseline, deep]— a future reorder would silently mis-assign weights. - Implicit baseline branch (line 174): The
{{#else}}silently catches any unknown variant as baseline; naming it explicitly makes the contract clearer. - Minor: shallow lower bound (line 162): "3-5 deliverables" anchors a lower bound that the other variants don't have; intentional or not, it should be documented.
Positive Highlights
- ✅ Well-structured experiment spec with clear hypothesis, guardrail metrics, and secondary metrics
- ✅ Variant text is precise and actionable — the deep reflection steps are particularly well-specified
- ✅ Experiment artifact upload and state-restore steps are properly guarded with
if: always() - ✅ Guardrail thresholds (
workflow_success_rate >= 0.95,empty_plan_rate <= 0.02) are sensible safety nets
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 77.9 AIC · ⌖ 9 AIC · ⊞ 6.6K
Comment /matt to run again
| variants: [shallow, baseline, deep] | ||
| description: "Tests whether lighter or more reflective planning instructions produce better agent-ready sub-issues for /plan requests." | ||
| hypothesis: "H0: no change in plan quality score or success rate. H1: deep improves quality score by >=8%, or shallow reduces run cost/latency by >=15% with no material quality loss." | ||
| metric: plan_quality_score |
There was a problem hiding this comment.
[/grill-with-docs] plan_quality_score is listed as the primary metric, but there is no definition of how it is measured or collected anywhere in this PR.
This metric drives the experiment conclusion (H1 threshold is >=8%), yet without a collection mechanism the experiment cannot be evaluated against it.
💡 Suggestion
Either:
- Add a note in the frontmatter or a linked doc explaining where
plan_quality_scoreis produced and how it is persisted (e.g., a separate evaluation workflow, a manual labeling step), or - Replace it with a metric that is already automatically collected such as
issue_creation_countoroutput_token_estimateuntil a quality-scoring pipeline exists.
@copilot please address this.
| - name: empty_plan_rate | ||
| direction: min | ||
| threshold: 0.02 | ||
| min_samples: 137 |
There was a problem hiding this comment.
[/grill-with-docs] No end_date or maximum-duration guardrail is defined for this experiment.
Without a stop condition, the experiment could run indefinitely if min_samples is never reached (e.g., the /plan workflow is rarely triggered), or if the results are not actioned after completion.
💡 Suggestion
Add an end_date field to bound the experiment lifetime:
min_samples: 137
end_date: "2026-09-01" # auto-conclude after this dateThis also makes it clear when to revisit and decide whether to promote a winner or abort.
@copilot please address this.
| direction: min | ||
| threshold: 0.02 | ||
| min_samples: 137 | ||
| weight: [34, 33, 33] |
There was a problem hiding this comment.
[/grill-with-docs] The weight array [34, 33, 33] maps positionally to variants: [shallow, baseline, deep], but this mapping is implicit and fragile — reordering either array would silently change which variant gets which weight.
A reader or future editor has no in-line signal that 34 corresponds to shallow.
💡 Suggestion
If the schema supports named weights, prefer that form:
weight:
shallow: 34
baseline: 33
deep: 33If only array form is supported, add a comment immediately after the weights:
weight: [34, 33, 33] # shallow, baseline, deep@copilot please address this.
|
|
||
| {{#if experiments.reasoning_depth == 'shallow'}} | ||
| ## Planning Approach | ||
|
|
There was a problem hiding this comment.
[/grill-with-docs] The shallow variant prompt says "Scan the issue for 3-5 concrete deliverables" but the workflow-level safe-outputs hard-cap is also max: 5. The shallow description anchors the user to 3-5, but the baseline and deep variants only say "at most 5" (from ## Your Mission).
This introduces a soft inconsistency: shallow has a tighter self-imposed lower bound (3) while the other variants have no stated lower bound.
💡 Suggestion
If a minimum of 3 sub-issues is intentional for shallow, document why (to prevent trivially small plans). If not intentional, align to "up to 5" to match the other variants:
Scan the issue for concrete deliverables (up to 5) and immediately draft sub-issues.
@copilot please address this.
| 4. **Acceptance-criteria check**: Ensure each task has at least one testable success condition. | ||
|
|
||
| Only after completing this reflection, proceed to create the sub-issues. | ||
| {{#else}} |
There was a problem hiding this comment.
[/tdd] The {{#else}} branch represents the baseline variant, but the condition is unnamed — if a new variant is added in the future and the #elseif chain is not updated, that new variant silently falls through to baseline behavior.
💡 Suggestion
Make the fallthrough explicit by checking for baseline:
This makes the experiment's variant contract machine-readable and prevents silent behavioral drift when variants are added or renamed.
@copilot please address this.
|
🎉 This pull request is included in a new release. Release: |
Instruments the
planworkflow with a three-variantreasoning_depthexperiment to measure whether a lighter one-pass prompt (shallow) or an explicit reflection step (deep) improves sub-issue quality and/or reduces cost versus the current behavior (baseline).Frontmatter
experiments: reasoning_depthblock with variants[shallow, baseline, deep], weights[34, 33, 33], 137 minimum samples per variant, and guardrails onworkflow_success_rate(≥0.95) andempty_plan_rate(≤0.02).Prompt body
Adds a
## Planning Approachconditional block immediately before## Begin Planning:Lock file
Regenerated
plan.lock.ymlviagh aw compile plan— zero errors.