Skip to content

plan: add reasoning_depth A/B experiment#42955

Merged
pelikhan merged 2 commits into
mainfrom
copilot/ab-advisor-experiment-campaign
Jul 2, 2026
Merged

plan: add reasoning_depth A/B experiment#42955
pelikhan merged 2 commits into
mainfrom
copilot/ab-advisor-experiment-campaign

Conversation

Copilot AI commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Instruments the plan workflow with a three-variant reasoning_depth experiment to measure whether a lighter one-pass prompt (shallow) or an explicit reflection step (deep) improves sub-issue quality and/or reduces cost versus the current behavior (baseline).

Frontmatter

  • Adds experiments: reasoning_depth block with variants [shallow, baseline, deep], weights [34, 33, 33], 137 minimum samples per variant, and guardrails on workflow_success_rate (≥0.95) and empty_plan_rate (≤0.02).

Prompt body

Adds a ## Planning Approach conditional block immediately before ## Begin Planning:

{{#if experiments.reasoning_depth == 'shallow'}}
## Planning Approach
Use a concise, single-pass approach. Scan the issue for 3-5 concrete deliverables
and immediately draft sub-issues. Skip elaborate reflection…
{{#elseif experiments.reasoning_depth == 'deep'}}
## Planning Approach
Before drafting any sub-issue, complete an explicit reflection step:
1. Enumerate all distinct work items…
2. Sequence check…
3. Independence check…
4. Acceptance-criteria check…
{{#else}}
## Planning Approach
Use the standard planning approach…
{{/if}}

Lock file

Regenerated plan.lock.yml via gh aw compile plan — zero errors.

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Add experiment campaign for A/B test on reasoning depth plan: add reasoning_depth A/B experiment Jul 2, 2026
Copilot AI requested a review from pelikhan July 2, 2026 12:53
@pelikhan pelikhan marked this pull request as ready for review July 2, 2026 12:56
Copilot AI review requested due to automatic review settings July 2, 2026 12:56
@pelikhan pelikhan merged commit 281d68a into main Jul 2, 2026
47 of 60 checks passed
@pelikhan pelikhan deleted the copilot/ab-advisor-experiment-campaign branch July 2, 2026 12:56
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

PR Code Quality Reviewer completed the code quality review.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR does not have the 'implementation' label and has 0 new lines of code in business logic directories (threshold: 100).

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Test Quality Sentinel failed during test quality analysis.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an A/B/n experiment (reasoning_depth) to the plan workflow to compare three planning instruction variants (shallow/baseline/deep), and updates the prompt to conditionally inject variant-specific “Planning Approach” guidance before the planning begins.

Changes:

  • Adds experiments.reasoning_depth frontmatter with variants, weights, minimum samples, and guardrail metrics.
  • Adds a conditional ## Planning Approach block that changes planning instructions based on the selected experiment variant.
  • Regenerates plan.lock.yml to wire experiment selection/state persistence into the compiled GitHub Actions workflow.
Show a summary per file
File Description
.github/workflows/plan.md Defines the reasoning_depth experiment and conditionally injects variant-specific planning instructions into the prompt.
.github/workflows/plan.lock.yml Compiled workflow update that introduces experiment pick/state steps and updates runtime/action/container pins.

Review details

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 3
  • Review effort level: Low

Comment thread .github/workflows/plan.md
Comment on lines +51 to +53
- name: workflow_success_rate
direction: min
threshold: 0.95
Comment on lines 123 to +125
- name: Setup Scripts
id: setup
uses: ./actions/setup
uses: github/gh-aw-actions/setup@v0.81.6
Comment on lines 135 to +148
@@ -147,15 +143,16 @@ jobs:
GH_AW_INFO_ENGINE_ID: "copilot"
GH_AW_INFO_ENGINE_NAME: "GitHub Copilot CLI"
GH_AW_INFO_MODEL: ${{ vars.GH_AW_MODEL_AGENT_COPILOT || vars.GH_AW_DEFAULT_MODEL_COPILOT || 'claude-sonnet-4.6' }}
GH_AW_INFO_VERSION: "1.0.67"
GH_AW_INFO_AGENT_VERSION: "1.0.67"
GH_AW_INFO_VERSION: "1.0.65"
GH_AW_INFO_AGENT_VERSION: "1.0.65"
GH_AW_INFO_CLI_VERSION: "v0.81.6"
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

🤖 PR Triage

Field Value
Category feature
Risk 🟡 Medium
Score 30 / 100
Action ⏸️ defer

Score breakdown: Impact 20 · Urgency 5 · Quality 5

Draft/WIP. Adds A/B experiment campaign for reasoning depth to plan.md + recompiled lock file. CI unstable. Low priority until experiment design is finalized and PR is ready for review.

Run §28591321991

Generated by 🔧 PR Triage Agent · 76 AIC · ⌖ 9.74 AIC · ⊞ 1.6K ·

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REQUEST_CHANGES: three high-severity issues block this experiment from working correctly.

Blocking issues (3 HIGH)
  1. Unresolvable primary metric — has no collection mechanism; the experiment cannot be evaluated against its own hypothesis.
  2. Supply chain risk — is the only action in the manifest pinned by mutable tag rather than commit SHA; all other actions use immutable SHAs.
  3. Container image regression — firewall images downgraded from 0.27.20 to 0.27.11 and from v1.5.0 to v1.4.0 with no documented reason; this could re-introduce fixed bugs on every invocation.
Non-blocking concerns (3 MEDIUM)
  1. **No ** — experiment runs indefinitely, requires manual cleanup PR.
  2. ** v5.0.5 vs v6.1.0** — one step selectively uses an older version with no explanation.
  3. Artifact retention 30 days, no end_date — run data older than 30 days will be deleted before the sample set is complete.

🔎 Code quality review by PR Code Quality Reviewer · 93.9 AIC · ⌖ 10.2 AIC · ⊞ 1.6K
Comment /review to run again

# - actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0
# - actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e # v6.4.0
# - actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
# - github/gh-aw-actions/setup@v0.81.6

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

** is pinned by mutable tag, not a commit SHA** — every other action in this manifest has an immutable SHA; this one has , silently re-pointable.

💡 Details

The manifest header shows:

Every other action entry (e.g. , ) carries a real 40-hex commit SHA. A tag can be force-pushed without any PR or audit trail. If is force-pushed to a malicious commit, all jobs using this action will silently execute untrusted code with permissions.

Resolve by pinning to the commit SHA the tag currently points to:

Comment thread .github/workflows/plan.md
variants: [shallow, baseline, deep]
description: "Tests whether lighter or more reflective planning instructions produce better agent-ready sub-issues for /plan requests."
hypothesis: "H0: no change in plan quality score or success rate. H1: deep improves quality score by >=8%, or shallow reduces run cost/latency by >=15% with no material quality loss."
metric: plan_quality_score

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

** is declared as the primary metric but has no defined collection path in this PR** — the experiment cannot be analysed.

💡 Details

The hypothesis states:

H1: deep improves quality score by >=8%

But neither nor the new pipeline emits a value. Without instrumentation, the guardrail check against this metric is a no-op and the experiment will only ever produce secondary metrics (, , ).

Before merging, either:

  • Add a scoring step that emits to the experiment artifact, or
  • Replace the primary metric with one already collected (e.g. ) and document the proxy rationale in the field.

Comment thread .github/workflows/plan.md
threshold: 0.02
min_samples: 137
weight: [34, 33, 33]
start_date: "2026-07-02"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No — this experiment will run indefinitely past its intended window.

💡 Details

The workflow classifies an experiment as while ; without it never terminates automatically. only gates analysis readiness, not experiment shutdown.

A permanently active experiment continuously splits production traffic among all three variants, making the prompt permanently larger and requiring a manual follow-up PR to remove the conditionals.

Add an explicit :

if: ${{ env.GH_AW_MAX_DAILY_AI_CREDITS != '' }}
continue-on-error: true
uses: actions/cache/restore@55cc8345863c7cc4c66a329aec7e433d2d1c52a9 # v6.1.0
uses: actions/cache/restore@27d5ce7f107fe9357f9df03efb73ab90386fccae # v5.0.5

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

** selectively downgraded from v6.1.0 to v5.0.5 for this one step only** — unexplained inconsistency across the same lock file.

💡 Details

All other cache/restore usages in this file use (v6.1.0), but the daily-AIC-cache restore step was changed to (v5.0.5). The manifest header now carries two separate SHAs for the same action.

This inconsistency has no explanation in the PR description and is suspicious: if v5 was needed here for compatibility, that should be documented. If it is an error, it silently pins one cache step to an older version that may lack bugfixes.

Either upgrade to v6.1.0 consistently, or add a comment explaining the required downgrade.

name: plan-experiment
path: /tmp/gh-aw/experiments
if-no-files-found: ignore
retention-days: 30

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experiment artifact retention (30 days) is shorter than the experiment could run — early-run data will be deleted before analysis completes.

💡 Details

With required and no , this experiment could easily run for more than 30 days if the command is not invoked frequently. GitHub Actions artifacts are automatically deleted after the window; any run artifacts older than 30 days will be gone by the time you are ready to analyse the full sample set.

Recommendations:

  1. Set an (see the separate comment on ) so you can size the retention window to match.
  2. Or increase to at least 60-90 days as a buffer.
  3. Or ensure the job writes cumulative state to the branch on every run, so the branch (not the artifact) is the source of truth for analysis.

# - ghcr.io/github/gh-aw-firewall/cli-proxy:0.27.20
# - ghcr.io/github/gh-aw-firewall/squid:0.27.20
# - ghcr.io/github/gh-aw-mcpg:v0.3.32@sha256:63e46b56dfd70895a701b6fc6dd0189e11e2d875f327f1781e81b31848735477
# - ghcr.io/github/gh-aw-firewall/agent:0.27.11@sha256:979723c628182da7729333f2208bb249fd25ddee579645cf9a3892d681a929c7

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All firewall container images and MCP server were downgraded without explanation — 0.27.20 → 0.27.11, v1.5.0 → v1.4.0.

💡 Details

The previous lock file used firewall image and . This PR downgrades both to earlier versions. Lock files are compiled outputs, and the new lock was compiled with compiler v0.81.6 (added field), which implies these older image tags are what v0.81.6 emits.

If this is intentional (the new compiler pins to an older runtime for compatibility), that should be documented. If it is unintentional, it may re-introduce bugs or CVEs fixed in the higher versions, and the workflow will run against an older MCP server API surface.

Verify that the downgraded image versions are intentional and compatible with the new experiment features added in this PR.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /grill-with-docs and /tdd — commenting with 5 findings, no blocking issues but a few items worth addressing before the experiment concludes.

📋 Key Themes & Highlights

Key Themes

  • Undefined primary metric (line 48): plan_quality_score has no collection mechanism described in this PR — the experiment's H1 threshold can't be evaluated without it.
  • No experiment lifetime bound (line 57): Missing end_date means the experiment runs indefinitely if /plan volume is low or results are not acted on.
  • Fragile positional weight mapping (line 58): weight: [34, 33, 33] maps silently to [shallow, baseline, deep] — a future reorder would silently mis-assign weights.
  • Implicit baseline branch (line 174): The {{#else}} silently catches any unknown variant as baseline; naming it explicitly makes the contract clearer.
  • Minor: shallow lower bound (line 162): "3-5 deliverables" anchors a lower bound that the other variants don't have; intentional or not, it should be documented.

Positive Highlights

  • ✅ Well-structured experiment spec with clear hypothesis, guardrail metrics, and secondary metrics
  • ✅ Variant text is precise and actionable — the deep reflection steps are particularly well-specified
  • ✅ Experiment artifact upload and state-restore steps are properly guarded with if: always()
  • ✅ Guardrail thresholds (workflow_success_rate >= 0.95, empty_plan_rate <= 0.02) are sensible safety nets

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 77.9 AIC · ⌖ 9 AIC · ⊞ 6.6K
Comment /matt to run again

Comment thread .github/workflows/plan.md
variants: [shallow, baseline, deep]
description: "Tests whether lighter or more reflective planning instructions produce better agent-ready sub-issues for /plan requests."
hypothesis: "H0: no change in plan quality score or success rate. H1: deep improves quality score by >=8%, or shallow reduces run cost/latency by >=15% with no material quality loss."
metric: plan_quality_score

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/grill-with-docs] plan_quality_score is listed as the primary metric, but there is no definition of how it is measured or collected anywhere in this PR.

This metric drives the experiment conclusion (H1 threshold is >=8%), yet without a collection mechanism the experiment cannot be evaluated against it.

💡 Suggestion

Either:

  1. Add a note in the frontmatter or a linked doc explaining where plan_quality_score is produced and how it is persisted (e.g., a separate evaluation workflow, a manual labeling step), or
  2. Replace it with a metric that is already automatically collected such as issue_creation_count or output_token_estimate until a quality-scoring pipeline exists.

@copilot please address this.

Comment thread .github/workflows/plan.md
- name: empty_plan_rate
direction: min
threshold: 0.02
min_samples: 137

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/grill-with-docs] No end_date or maximum-duration guardrail is defined for this experiment.

Without a stop condition, the experiment could run indefinitely if min_samples is never reached (e.g., the /plan workflow is rarely triggered), or if the results are not actioned after completion.

💡 Suggestion

Add an end_date field to bound the experiment lifetime:

    min_samples: 137
    end_date: "2026-09-01"  # auto-conclude after this date

This also makes it clear when to revisit and decide whether to promote a winner or abort.

@copilot please address this.

Comment thread .github/workflows/plan.md
direction: min
threshold: 0.02
min_samples: 137
weight: [34, 33, 33]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/grill-with-docs] The weight array [34, 33, 33] maps positionally to variants: [shallow, baseline, deep], but this mapping is implicit and fragile — reordering either array would silently change which variant gets which weight.

A reader or future editor has no in-line signal that 34 corresponds to shallow.

💡 Suggestion

If the schema supports named weights, prefer that form:

    weight:
      shallow: 34
      baseline: 33
      deep: 33

If only array form is supported, add a comment immediately after the weights:

    weight: [34, 33, 33]  # shallow, baseline, deep

@copilot please address this.

Comment thread .github/workflows/plan.md

{{#if experiments.reasoning_depth == 'shallow'}}
## Planning Approach

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/grill-with-docs] The shallow variant prompt says "Scan the issue for 3-5 concrete deliverables" but the workflow-level safe-outputs hard-cap is also max: 5. The shallow description anchors the user to 3-5, but the baseline and deep variants only say "at most 5" (from ## Your Mission).

This introduces a soft inconsistency: shallow has a tighter self-imposed lower bound (3) while the other variants have no stated lower bound.

💡 Suggestion

If a minimum of 3 sub-issues is intentional for shallow, document why (to prevent trivially small plans). If not intentional, align to "up to 5" to match the other variants:

Scan the issue for concrete deliverables (up to 5) and immediately draft sub-issues.

@copilot please address this.

Comment thread .github/workflows/plan.md
4. **Acceptance-criteria check**: Ensure each task has at least one testable success condition.

Only after completing this reflection, proceed to create the sub-issues.
{{#else}}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The {{#else}} branch represents the baseline variant, but the condition is unnamed — if a new variant is added in the future and the #elseif chain is not updated, that new variant silently falls through to baseline behavior.

💡 Suggestion

Make the fallthrough explicit by checking for baseline:

{{#elseif experiments.reasoning_depth == 'baseline'}}
## Planning Approach

Use the standard planning approach...
{{#else}}
{{!-- Unknown variant: treat as baseline, but log a warning if possible --}}
## Planning Approach

Use the standard planning approach...
{{/if}}

This makes the experiment's variant contract machine-readable and prevents silent behavioral drift when variants are added or renamed.

@copilot please address this.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

🎉 This pull request is included in a new release.

Release: v0.82.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ab-advisor] Experiment campaign for plan: A/B test reasoning_depth

3 participants