feat: rewrite daily-experiment-report workflow with statistical analysis, bar charts, and discussion output by Copilot · Pull Request #29635 · github/gh-aw

Copilot · 2026-05-01T21:43:55Z

Automated experiment reporting was missing — results from A/B experiment runs were invisible without manual artifact downloads and analysis.

Changes

Workflow rewrite (`.github/workflows/daily-experiment-report.md`)

Replaces the previous status-only report with a statistically rigorous pipeline that posts a rich Discussion with embedded charts:

Frontmatter: fuzzy schedule: daily around 8:00, engine: copilot, toolsets: [default, actions], imports shared/daily-audit-charts.md (Python env + create-discussion + OTLP observability); upload-asset for chart embedding
Artifact aggregation: downloads state.json from each run's experiment artifact; infers per-run variant assignment by diffing consecutive cumulative counts, with explicit handling for missing artifacts, zero-delta (unknown), and multi-delta (ambiguous) cases
Statistics per variant: mean duration, sample variance (Bessel-corrected), 95% CI via t-distribution, success rate — guards for n < 2 (N/A) and zero variance (skip Welch test)
Significance detection (p < 0.05): two-proportion z-test for success rate, Welch's t-test for duration; recommends scipy.stats for precision with interpolation-table fallback
Bar charts per experiment generated with Matplotlib/Seaborn:
- Chart A — Success rate by variant with percentage labels, sample-size annotations, and control baseline dashed line
- Chart B — Mean duration by variant with 95% CI error bars and control baseline
ASCII comparison table per experiment (variant, n, success %, mean duration, 95% CI, p-value)
Output: single GitHub Discussion per run (title-prefix [experiments], 3-day expiry, older discussions auto-closed) containing all embedded charts, ASCII tables, and recommendations
Decision rules: PROMOTE (p < 0.05, improvement), EXTEND (p ≥ 0.05, n < 20), ABANDON (p ≥ 0.05, n ≥ 20 or degradation); includes low-power warning for n < 20

Compiled lock file

.github/workflows/daily-experiment-report.lock.yml regenerated via ./gh-aw compile --approve (approved standard OTEL secrets introduced by shared/daily-audit-base.md import, already used by all other audit workflows in this repo).

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/bb347623-6d55-453b-82f4-0e16d9f922bd Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

- Rewrites daily-experiment-report.md with proper frontmatter: - cron: "0 8 * * *" (08:00 UTC daily) + workflow_dispatch - engine: copilot with github toolsets [default, actions] - safe-outputs add-comment with hide-older-comments - mentions/reference filtering to avoid notification noise - Prompt instructs the agent to: 1. Find workflows declaring experiments: in frontmatter 2. List last 30 runs and download state.json artifacts 3. Infer per-run variant from consecutive cumulative count diffs 4. Compute mean, variance, 95% CI, success rate per variant 5. Detect significance: two-proportion z-test (binary) or Welch t-test (continuous), threshold p < 0.05 6. Render ASCII comparison table per experiment 7. Post to tracking issue (if issue: set) or step summary 8. Recommend PROMOTE / EXTEND / ABANDON with rationale - Recompiles cleanly (make recompile passes with no other diffs) Agent-Logs-Url: https://github.com/github/gh-aw/sessions/bb347623-6d55-453b-82f4-0e16d9f922bd Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

- Clarify "last 30 completed runs" means any final state (success/failure/cancelled/skipped) - Add variant inference edge case handling: missing artifacts (skip), zero increases (unknown/excluded), multiple increases (ambiguous/excluded) - Recommend scipy.stats for precise t/z critical values with linear interpolation fallback - Add n=15 and n=60 entries to the t-critical value table - Add n < 2 edge case: show N/A for variance/CI, exclude from Welch t-test - Add zero-variance edge case for Welch t-test: show N/A with note - Add statistical power note: n < 20 tests have low power, use EXTEND to gather more data Agent-Logs-Url: https://github.com/github/gh-aw/sessions/bb347623-6d55-453b-82f4-0e16d9f922bd Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

pelikhan

@copilot create report in discussion, fancy bare charts

pelikhan · 2026-05-01T21:58:49Z

 on:
  schedule:
-    - cron: "daily around 9:00"
+    - cron: "0 8 * * *"   # 08:00 UTC daily


@copilot keep fuzybscheduke

Done in e03f3a8 — changed to schedule: daily around 8:00.

- Use fuzzy schedule `daily around 8:00` instead of fixed cron - Import shared/daily-audit-charts.md for Python env + discussion creation - Add discussions: read permission - Add two bar charts per experiment (success rate + mean duration with 95% CI) - Replace add-comment output with create-discussion via shared import - Upload charts as assets and embed in discussion body - Recompile lock file (approved new OTEL secrets from standard audit import chain) Agent-Logs-Url: https://github.com/github/gh-aw/sessions/ff4c25af-0a3f-4ddc-a398-bcdc73f9a248 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot · 2026-05-01T22:05:05Z

@copilot create report in discussion, fancy bare charts

Done in e03f3a8. The workflow now:

Imports shared/daily-audit-charts.md which bundles the Python env (NumPy/Pandas/Matplotlib/Seaborn/SciPy) and configures a create-discussion output with [experiments] title-prefix and 3-day expiry
Generates two bar charts per experiment: success rate by variant (with n annotations and control baseline dashed line) and mean duration by variant with 95% CI error bars
Uploads each chart as an asset and embeds it inline in the discussion body
Posts a single Discussion per run with all experiments, embedded charts, ASCII comparison tables, and PROMOTE/EXTEND/ABANDON recommendations

Copilot

Pull request overview

Rewrites the daily experiment reporting workflow to aggregate experiment artifacts across recent runs, compute statistical summaries and significance tests, generate charts/tables, and publish a consolidated GitHub Discussion report.

Changes:

Replaced the previous daily experiment status report prompt with a detailed artifact-based statistical analysis + visualization/reporting prompt.
Added chart generation + asset upload support and updated discussion configuration via shared imports/safe-outputs.
Regenerated compiled lock workflows (including updated experiment spec serialization for an existing workflow).

Show a summary per file

File	Description
.github/workflows/daily-experiment-report.md	New end-to-end instructions for discovering experiments, aggregating artifacts, computing stats/significance, generating charts/tables, and posting a Discussion.
.github/workflows/daily-experiment-report.lock.yml	Recompiled workflow lock to include discussion/charts imports, OTLP env, Python setup, asset upload plumbing, and related runtime wiring.
.github/workflows/daily-community-attribution.lock.yml	Updates the serialized experiment spec format used by the compiled workflow.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

.github/workflows/daily-experiment-report.md:60

The 'no active experiments' behavior is contradictory: Step 1 says to write a notice to $GITHUB_STEP_SUMMARY and exit, but Step 7 later says to create a discussion with a notice when no workflows declare experiments:. Please pick one behavior (or clarify conditional logic) so the agent doesn't produce both/incorrectly exit before creating the discussion.

If no workflows declare `experiments:`, append the following to `$GITHUB_STEP_SUMMARY` and exit:

No active experiments found in ${{ github.repository }} — nothing to report.

Files reviewed: 3/3 changed files
Comments generated: 3

+You are a **statistical analyst** for agentic workflow A/B experiments. Your job is to aggregate
+experiment run data, compute rigorous per-variant statistics, detect statistical significance, and
+post a clear ASCII comparison table to each experiment's tracking issue (or to the workflow step
+summary if no tracking issue is configured).


+read its frontmatter and identify those that declare an `experiments:` section. Collect:

-## Phase 1: Discover Active Experiments
+- Workflow file name (e.g. `daily-report.lock.yml`)


+- `run_id`
+- `conclusion` (`success`, `failure`, `cancelled`, …)
+- `created_at` and `updated_at`
+- `run_duration_ms` (derived from `created_at` and `updated_at`)


Initial plan

b3d7c76

Copilot AI assigned Copilot and pelikhan May 1, 2026

Copilot started work on behalf of pelikhan May 1, 2026 21:43 View session

Copilot AI linked an issue May 1, 2026 that may be closed by this pull request

[plan] Create daily-experiment-report workflow for statistical analysis #29613

Closed

7 tasks

Copilot AI and others added 3 commits May 1, 2026 21:50

initial: outline plan for daily-experiment-report workflow

4c4d85d

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/bb347623-6d55-453b-82f4-0e16d9f922bd Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Create daily-experiment-report workflow for statistical analysis~~ feat: rewrite daily-experiment-report workflow with statistical analysis and significance detection May 1, 2026

Copilot finished work on behalf of pelikhan May 1, 2026 21:57

Copilot AI requested a review from pelikhan May 1, 2026 21:57

pelikhan reviewed May 1, 2026

View reviewed changes

Copilot started work on behalf of pelikhan May 1, 2026 21:59 View session

Copilot AI changed the title ~~feat: rewrite daily-experiment-report workflow with statistical analysis and significance detection~~ feat: rewrite daily-experiment-report workflow with statistical analysis, bar charts, and discussion output May 1, 2026

Copilot finished work on behalf of pelikhan May 1, 2026 22:06

Copilot AI requested a review from pelikhan May 1, 2026 22:06

pelikhan marked this pull request as ready for review May 1, 2026 22:07

Copilot AI review requested due to automatic review settings May 1, 2026 22:07

pelikhan merged commit 25d1705 into main May 1, 2026
28 checks passed

pelikhan deleted the copilot/create-daily-experiment-report-workflow branch May 1, 2026 22:08

Copilot started reviewing on behalf of pelikhan May 1, 2026 22:08 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

github-actions Bot mentioned this pull request May 1, 2026

[aw] No-Op Runs #29134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: rewrite daily-experiment-report workflow with statistical analysis, bar charts, and discussion output#29635

feat: rewrite daily-experiment-report workflow with statistical analysis, bar charts, and discussion output#29635
pelikhan merged 5 commits intomainfrom
copilot/create-daily-experiment-report-workflow

Copilot AI commented May 1, 2026 •

edited

Loading

Uh oh!

pelikhan left a comment

Uh oh!

pelikhan May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI commented May 1, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Workflow rewrite (.github/workflows/daily-experiment-report.md)

Compiled lock file

Uh oh!

pelikhan left a comment

Choose a reason for hiding this comment

Uh oh!

pelikhan May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented May 1, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented May 1, 2026 •

edited

Loading

Workflow rewrite (`.github/workflows/daily-experiment-report.md`)