plan: add reasoning_depth A/B experiment by Copilot · Pull Request #42955 · github/gh-aw

Copilot · 2026-07-02T12:32:58Z

Instruments the plan workflow with a three-variant reasoning_depth experiment to measure whether a lighter one-pass prompt (shallow) or an explicit reflection step (deep) improves sub-issue quality and/or reduces cost versus the current behavior (baseline).

Frontmatter

Adds experiments: reasoning_depth block with variants [shallow, baseline, deep], weights [34, 33, 33], 137 minimum samples per variant, and guardrails on workflow_success_rate (≥0.95) and empty_plan_rate (≤0.02).

Prompt body

Adds a ## Planning Approach conditional block immediately before ## Begin Planning:

{{#if experiments.reasoning_depth == 'shallow'}}
## Planning Approach
Use a concise, single-pass approach. Scan the issue for 3-5 concrete deliverables
and immediately draft sub-issues. Skip elaborate reflection…
{{#elseif experiments.reasoning_depth == 'deep'}}
## Planning Approach
Before drafting any sub-issue, complete an explicit reflection step:
1. Enumerate all distinct work items…
2. Sequence check…
3. Independence check…
4. Acceptance-criteria check…
{{#else}}
## Planning Approach
Use the standard planning approach…
{{/if}}

Lock file

Regenerated plan.lock.yml via gh aw compile plan — zero errors.

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

github-actions · 2026-07-02T12:57:10Z

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

github-actions · 2026-07-02T12:57:22Z

✅ PR Code Quality Reviewer completed the code quality review.

github-actions · 2026-07-02T12:57:32Z

✅ Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR does not have the 'implementation' label and has 0 new lines of code in business logic directories (threshold: 100).

github-actions · 2026-07-02T12:57:38Z

❌ Test Quality Sentinel failed during test quality analysis.

Copilot

Pull request overview

This PR adds an A/B/n experiment (reasoning_depth) to the plan workflow to compare three planning instruction variants (shallow/baseline/deep), and updates the prompt to conditionally inject variant-specific “Planning Approach” guidance before the planning begins.

Changes:

Adds experiments.reasoning_depth frontmatter with variants, weights, minimum samples, and guardrail metrics.
Adds a conditional ## Planning Approach block that changes planning instructions based on the selected experiment variant.
Regenerates plan.lock.yml to wire experiment selection/state persistence into the compiled GitHub Actions workflow.

Show a summary per file

File	Description
.github/workflows/plan.md	Defines the `reasoning_depth` experiment and conditionally injects variant-specific planning instructions into the prompt.
.github/workflows/plan.lock.yml	Compiled workflow update that introduces experiment pick/state steps and updates runtime/action/container pins.

Review details

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 2/2 changed files
Comments generated: 3
Review effort level: Low

+      - name: workflow_success_rate
+        direction: min
+        threshold: 0.95


      - name: Setup Scripts
        id: setup
-        uses: ./actions/setup
+        uses: github/gh-aw-actions/setup@v0.81.6


@@ -147,15 +143,16 @@ jobs:
          GH_AW_INFO_ENGINE_ID: "copilot"
          GH_AW_INFO_ENGINE_NAME: "GitHub Copilot CLI"
          GH_AW_INFO_MODEL: ${{ vars.GH_AW_MODEL_AGENT_COPILOT || vars.GH_AW_DEFAULT_MODEL_COPILOT || 'claude-sonnet-4.6' }}
-          GH_AW_INFO_VERSION: "1.0.67"
-          GH_AW_INFO_AGENT_VERSION: "1.0.67"
+          GH_AW_INFO_VERSION: "1.0.65"
+          GH_AW_INFO_AGENT_VERSION: "1.0.65"
+          GH_AW_INFO_CLI_VERSION: "v0.81.6"


github-actions · 2026-07-02T13:02:20Z

🤖 PR Triage

Field	Value
Category	feature
Risk	🟡 Medium
Score	30 / 100
Action	⏸️ defer

Score breakdown: Impact 20 · Urgency 5 · Quality 5

Draft/WIP. Adds A/B experiment campaign for reasoning depth to plan.md + recompiled lock file. CI unstable. Low priority until experiment design is finalized and PR is ready for review.

Run §28591321991

Generated by 🔧 PR Triage Agent · 76 AIC · ⌖ 9.74 AIC · ⊞ 1.6K · ◷

github-actions

REQUEST_CHANGES: three high-severity issues block this experiment from working correctly.

Blocking issues (3 HIGH)

Unresolvable primary metric — has no collection mechanism; the experiment cannot be evaluated against its own hypothesis.
Supply chain risk — is the only action in the manifest pinned by mutable tag rather than commit SHA; all other actions use immutable SHAs.
Container image regression — firewall images downgraded from 0.27.20 to 0.27.11 and from v1.5.0 to v1.4.0 with no documented reason; this could re-introduce fixed bugs on every invocation.

Non-blocking concerns (3 MEDIUM)

**No ** — experiment runs indefinitely, requires manual cleanup PR.
** v5.0.5 vs v6.1.0** — one step selectively uses an older version with no explanation.
Artifact retention 30 days, no end_date — run data older than 30 days will be deleted before the sample set is complete.

🔎 Code quality review by PR Code Quality Reviewer · 93.9 AIC · ⌖ 10.2 AIC · ⊞ 1.6K
_{Comment /review to run again}

github-actions · 2026-07-02T13:08:04Z

 #   - actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0
 #   - actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e # v6.4.0
 #   - actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
+#   - github/gh-aw-actions/setup@v0.81.6


** is pinned by mutable tag, not a commit SHA** — every other action in this manifest has an immutable SHA; this one has , silently re-pointable.

💡 Details

The manifest header shows:

Every other action entry (e.g. , ) carries a real 40-hex commit SHA. A tag can be force-pushed without any PR or audit trail. If is force-pushed to a malicious commit, all jobs using this action will silently execute untrusted code with permissions.

Resolve by pinning to the commit SHA the tag currently points to:

github-actions · 2026-07-02T13:08:04Z

+    variants: [shallow, baseline, deep]
+    description: "Tests whether lighter or more reflective planning instructions produce better agent-ready sub-issues for /plan requests."
+    hypothesis: "H0: no change in plan quality score or success rate. H1: deep improves quality score by >=8%, or shallow reduces run cost/latency by >=15% with no material quality loss."
+    metric: plan_quality_score


** is declared as the primary metric but has no defined collection path in this PR** — the experiment cannot be analysed.

💡 Details

The hypothesis states:

H1: deep improves quality score by >=8%

But neither nor the new pipeline emits a value. Without instrumentation, the guardrail check against this metric is a no-op and the experiment will only ever produce secondary metrics (, , ).

Before merging, either:

Add a scoring step that emits to the experiment artifact, or

Replace the primary metric with one already collected (e.g. ) and document the proxy rationale in the field.

github-actions · 2026-07-02T13:08:04Z

+        threshold: 0.02
+    min_samples: 137
+    weight: [34, 33, 33]
+    start_date: "2026-07-02"


No — this experiment will run indefinitely past its intended window.

💡 Details

The workflow classifies an experiment as while ; without it never terminates automatically. only gates analysis readiness, not experiment shutdown.

A permanently active experiment continuously splits production traffic among all three variants, making the prompt permanently larger and requiring a manual follow-up PR to remove the conditionals.

Add an explicit :

github-actions · 2026-07-02T13:08:04Z

        if: ${{ env.GH_AW_MAX_DAILY_AI_CREDITS != '' }}
        continue-on-error: true
-        uses: actions/cache/restore@55cc8345863c7cc4c66a329aec7e433d2d1c52a9 # v6.1.0
+        uses: actions/cache/restore@27d5ce7f107fe9357f9df03efb73ab90386fccae # v5.0.5


** selectively downgraded from v6.1.0 to v5.0.5 for this one step only** — unexplained inconsistency across the same lock file.

💡 Details

All other cache/restore usages in this file use (v6.1.0), but the daily-AIC-cache restore step was changed to (v5.0.5). The manifest header now carries two separate SHAs for the same action.

This inconsistency has no explanation in the PR description and is suspicious: if v5 was needed here for compatibility, that should be documented. If it is an error, it silently pins one cache step to an older version that may lack bugfixes.

Either upgrade to v6.1.0 consistently, or add a comment explaining the required downgrade.

github-actions · 2026-07-02T13:08:04Z

+          name: plan-experiment
+          path: /tmp/gh-aw/experiments
+          if-no-files-found: ignore
+          retention-days: 30


Experiment artifact retention (30 days) is shorter than the experiment could run — early-run data will be deleted before analysis completes.

💡 Details

With required and no , this experiment could easily run for more than 30 days if the command is not invoked frequently. GitHub Actions artifacts are automatically deleted after the window; any run artifacts older than 30 days will be gone by the time you are ready to analyse the full sample set.

Recommendations:

Set an (see the separate comment on ) so you can size the retention window to match.

Or increase to at least 60-90 days as a buffer.

Or ensure the job writes cumulative state to the branch on every run, so the branch (not the artifact) is the source of truth for analysis.

github-actions · 2026-07-02T13:08:04Z

-#   - ghcr.io/github/gh-aw-firewall/cli-proxy:0.27.20
-#   - ghcr.io/github/gh-aw-firewall/squid:0.27.20
-#   - ghcr.io/github/gh-aw-mcpg:v0.3.32@sha256:63e46b56dfd70895a701b6fc6dd0189e11e2d875f327f1781e81b31848735477
+#   - ghcr.io/github/gh-aw-firewall/agent:0.27.11@sha256:979723c628182da7729333f2208bb249fd25ddee579645cf9a3892d681a929c7


All firewall container images and MCP server were downgraded without explanation — 0.27.20 → 0.27.11, v1.5.0 → v1.4.0.

💡 Details

The previous lock file used firewall image and . This PR downgrades both to earlier versions. Lock files are compiled outputs, and the new lock was compiled with compiler v0.81.6 (added field), which implies these older image tags are what v0.81.6 emits.

If this is intentional (the new compiler pins to an older runtime for compatibility), that should be documented. If it is unintentional, it may re-introduce bugs or CVEs fixed in the higher versions, and the workflow will run against an older MCP server API surface.

Verify that the downgraded image versions are intentional and compatible with the new experiment features added in this PR.

github-actions

Skills-Based Review 🧠

Applied /grill-with-docs and /tdd — commenting with 5 findings, no blocking issues but a few items worth addressing before the experiment concludes.

📋 Key Themes & Highlights

Key Themes

Undefined primary metric (line 48): plan_quality_score has no collection mechanism described in this PR — the experiment's H1 threshold can't be evaluated without it.
No experiment lifetime bound (line 57): Missing end_date means the experiment runs indefinitely if /plan volume is low or results are not acted on.
Fragile positional weight mapping (line 58): weight: [34, 33, 33] maps silently to [shallow, baseline, deep] — a future reorder would silently mis-assign weights.
Implicit baseline branch (line 174): The {{#else}} silently catches any unknown variant as baseline; naming it explicitly makes the contract clearer.
Minor: shallow lower bound (line 162): "3-5 deliverables" anchors a lower bound that the other variants don't have; intentional or not, it should be documented.

Positive Highlights

✅ Well-structured experiment spec with clear hypothesis, guardrail metrics, and secondary metrics
✅ Variant text is precise and actionable — the deep reflection steps are particularly well-specified
✅ Experiment artifact upload and state-restore steps are properly guarded with if: always()
✅ Guardrail thresholds (workflow_success_rate >= 0.95, empty_plan_rate <= 0.02) are sensible safety nets

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 77.9 AIC · ⌖ 9 AIC · ⊞ 6.6K
_{Comment /matt to run again}

github-actions · 2026-07-02T13:09:09Z

+    variants: [shallow, baseline, deep]
+    description: "Tests whether lighter or more reflective planning instructions produce better agent-ready sub-issues for /plan requests."
+    hypothesis: "H0: no change in plan quality score or success rate. H1: deep improves quality score by >=8%, or shallow reduces run cost/latency by >=15% with no material quality loss."
+    metric: plan_quality_score


[/grill-with-docs] plan_quality_score is listed as the primary metric, but there is no definition of how it is measured or collected anywhere in this PR.

This metric drives the experiment conclusion (H1 threshold is >=8%), yet without a collection mechanism the experiment cannot be evaluated against it.

💡 Suggestion

Either:

Add a note in the frontmatter or a linked doc explaining where plan_quality_score is produced and how it is persisted (e.g., a separate evaluation workflow, a manual labeling step), or

Replace it with a metric that is already automatically collected such as issue_creation_count or output_token_estimate until a quality-scoring pipeline exists.

@copilot please address this.

github-actions · 2026-07-02T13:09:09Z

+      - name: empty_plan_rate
+        direction: min
+        threshold: 0.02
+    min_samples: 137


[/grill-with-docs] No end_date or maximum-duration guardrail is defined for this experiment.

Without a stop condition, the experiment could run indefinitely if min_samples is never reached (e.g., the /plan workflow is rarely triggered), or if the results are not actioned after completion.

💡 Suggestion

Add an end_date field to bound the experiment lifetime:

min_samples: 137 end_date: "2026-09-01" # auto-conclude after this date

This also makes it clear when to revisit and decide whether to promote a winner or abort.

@copilot please address this.

github-actions · 2026-07-02T13:09:09Z

+        direction: min
+        threshold: 0.02
+    min_samples: 137
+    weight: [34, 33, 33]


[/grill-with-docs] The weight array [34, 33, 33] maps positionally to variants: [shallow, baseline, deep], but this mapping is implicit and fragile — reordering either array would silently change which variant gets which weight.

A reader or future editor has no in-line signal that 34 corresponds to shallow.

💡 Suggestion

If the schema supports named weights, prefer that form:

weight: shallow: 34 baseline: 33 deep: 33

If only array form is supported, add a comment immediately after the weights:

weight: [34, 33, 33] # shallow, baseline, deep

@copilot please address this.

github-actions · 2026-07-02T13:09:09Z


+{{#if experiments.reasoning_depth == 'shallow'}}
+## Planning Approach
+


[/grill-with-docs] The shallow variant prompt says "Scan the issue for 3-5 concrete deliverables" but the workflow-level safe-outputs hard-cap is also max: 5. The shallow description anchors the user to 3-5, but the baseline and deep variants only say "at most 5" (from ## Your Mission).

This introduces a soft inconsistency: shallow has a tighter self-imposed lower bound (3) while the other variants have no stated lower bound.

💡 Suggestion

If a minimum of 3 sub-issues is intentional for shallow, document why (to prevent trivially small plans). If not intentional, align to "up to 5" to match the other variants:

Scan the issue for concrete deliverables (up to 5) and immediately draft sub-issues.

@copilot please address this.

github-actions · 2026-07-02T13:09:09Z

+4. **Acceptance-criteria check**: Ensure each task has at least one testable success condition.
+
+Only after completing this reflection, proceed to create the sub-issues.
+{{#else}}


[/tdd] The {{#else}} branch represents the baseline variant, but the condition is unnamed — if a new variant is added in the future and the #elseif chain is not updated, that new variant silently falls through to baseline behavior.

💡 Suggestion

Make the fallthrough explicit by checking for baseline:

{{#elseif experiments.reasoning_depth == 'baseline'}} ## Planning Approach Use the standard planning approach... {{#else}} {{!-- Unknown variant: treat as baseline, but log a warning if possible --}} ## Planning Approach Use the standard planning approach... {{/if}}

This makes the experiment's variant contract machine-readable and prevents silent behavioral drift when variants are added or renamed.

@copilot please address this.

github-actions · 2026-07-02T17:18:41Z

🎉 This pull request is included in a new release.

Release: v0.82.2

Initial plan

5168512

Copilot AI assigned Copilot and pelikhan Jul 2, 2026

Copilot started work on behalf of pelikhan July 2, 2026 12:43 View session

Copilot AI linked an issue Jul 2, 2026 that may be closed by this pull request

[ab-advisor] Experiment campaign for plan: A/B test reasoning_depth #42941

Closed

7 tasks

Add reasoning_depth A/B experiment to plan workflow

25e804a

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add experiment campaign for A/B test on reasoning depth~~ plan: add reasoning_depth A/B experiment Jul 2, 2026

Copilot finished work on behalf of pelikhan July 2, 2026 12:53

Copilot AI requested a review from pelikhan July 2, 2026 12:53

pelikhan marked this pull request as ready for review July 2, 2026 12:56

Copilot AI review requested due to automatic review settings July 2, 2026 12:56

Copilot started reviewing on behalf of pelikhan July 2, 2026 12:56 View session

pelikhan merged commit 281d68a into main Jul 2, 2026
47 of 60 checks passed

pelikhan deleted the copilot/ab-advisor-experiment-campaign branch July 2, 2026 12:56

github-actions Bot added pr-action:defer pr-agent:copilot-swe-agent pr-priority:low pr-risk:medium pr-type:feature labels Jul 2, 2026

Copilot AI reviewed Jul 2, 2026

View reviewed changes

github-actions Bot mentioned this pull request Jul 2, 2026

[PR Triage Report] PR Triage Report — 2026-07-02 (Run §28591321991) #42966

Open

github-actions Bot requested changes Jul 2, 2026

View reviewed changes

github-actions Bot reviewed Jul 2, 2026

View reviewed changes


		{{#if experiments.reasoning_depth == 'shallow'}}
		## Planning Approach

Uh oh!

Conversation

Copilot AI commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Frontmatter

Prompt body

Lock file

Uh oh!

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Review details

Uh oh!

github-actions Bot commented Jul 2, 2026

🤖 PR Triage

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Skills-Based Review 🧠

Key Themes

Positive Highlights

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

Copilot AI commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading