Skip to content

feat: rewrite daily-experiment-report workflow with statistical analysis, bar charts, and discussion output#29635

Merged
pelikhan merged 5 commits intomainfrom
copilot/create-daily-experiment-report-workflow
May 1, 2026
Merged

feat: rewrite daily-experiment-report workflow with statistical analysis, bar charts, and discussion output#29635
pelikhan merged 5 commits intomainfrom
copilot/create-daily-experiment-report-workflow

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 1, 2026

Automated experiment reporting was missing — results from A/B experiment runs were invisible without manual artifact downloads and analysis.

Changes

Workflow rewrite (.github/workflows/daily-experiment-report.md)

Replaces the previous status-only report with a statistically rigorous pipeline that posts a rich Discussion with embedded charts:

  • Frontmatter: fuzzy schedule: daily around 8:00, engine: copilot, toolsets: [default, actions], imports shared/daily-audit-charts.md (Python env + create-discussion + OTLP observability); upload-asset for chart embedding
  • Artifact aggregation: downloads state.json from each run's experiment artifact; infers per-run variant assignment by diffing consecutive cumulative counts, with explicit handling for missing artifacts, zero-delta (unknown), and multi-delta (ambiguous) cases
  • Statistics per variant: mean duration, sample variance (Bessel-corrected), 95% CI via t-distribution, success rate — guards for n < 2 (N/A) and zero variance (skip Welch test)
  • Significance detection (p < 0.05): two-proportion z-test for success rate, Welch's t-test for duration; recommends scipy.stats for precision with interpolation-table fallback
  • Bar charts per experiment generated with Matplotlib/Seaborn:
    • Chart A — Success rate by variant with percentage labels, sample-size annotations, and control baseline dashed line
    • Chart B — Mean duration by variant with 95% CI error bars and control baseline
  • ASCII comparison table per experiment (variant, n, success %, mean duration, 95% CI, p-value)
  • Output: single GitHub Discussion per run (title-prefix [experiments], 3-day expiry, older discussions auto-closed) containing all embedded charts, ASCII tables, and recommendations
  • Decision rules: PROMOTE (p < 0.05, improvement), EXTEND (p ≥ 0.05, n < 20), ABANDON (p ≥ 0.05, n ≥ 20 or degradation); includes low-power warning for n < 20

Compiled lock file

.github/workflows/daily-experiment-report.lock.yml regenerated via ./gh-aw compile --approve (approved standard OTEL secrets introduced by shared/daily-audit-base.md import, already used by all other audit workflows in this repo).

Copilot AI and others added 3 commits May 1, 2026 21:50
- Rewrites daily-experiment-report.md with proper frontmatter:
  - cron: "0 8 * * *" (08:00 UTC daily) + workflow_dispatch
  - engine: copilot with github toolsets [default, actions]
  - safe-outputs add-comment with hide-older-comments
  - mentions/reference filtering to avoid notification noise
- Prompt instructs the agent to:
  1. Find workflows declaring experiments: in frontmatter
  2. List last 30 runs and download state.json artifacts
  3. Infer per-run variant from consecutive cumulative count diffs
  4. Compute mean, variance, 95% CI, success rate per variant
  5. Detect significance: two-proportion z-test (binary) or
     Welch t-test (continuous), threshold p < 0.05
  6. Render ASCII comparison table per experiment
  7. Post to tracking issue (if issue: set) or step summary
  8. Recommend PROMOTE / EXTEND / ABANDON with rationale
- Recompiles cleanly (make recompile passes with no other diffs)

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/bb347623-6d55-453b-82f4-0e16d9f922bd

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
- Clarify "last 30 completed runs" means any final state (success/failure/cancelled/skipped)
- Add variant inference edge case handling: missing artifacts (skip), zero increases
  (unknown/excluded), multiple increases (ambiguous/excluded)
- Recommend scipy.stats for precise t/z critical values with linear interpolation fallback
- Add n=15 and n=60 entries to the t-critical value table
- Add n < 2 edge case: show N/A for variance/CI, exclude from Welch t-test
- Add zero-variance edge case for Welch t-test: show N/A with note
- Add statistical power note: n < 20 tests have low power, use EXTEND to gather more data

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/bb347623-6d55-453b-82f4-0e16d9f922bd

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Create daily-experiment-report workflow for statistical analysis feat: rewrite daily-experiment-report workflow with statistical analysis and significance detection May 1, 2026
Copilot AI requested a review from pelikhan May 1, 2026 21:57
Copy link
Copy Markdown
Collaborator

@pelikhan pelikhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot create report in discussion, fancy bare charts

on:
schedule:
- cron: "daily around 9:00"
- cron: "0 8 * * *" # 08:00 UTC daily
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot keep fuzybscheduke

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in e03f3a8 — changed to schedule: daily around 8:00.

- Use fuzzy schedule `daily around 8:00` instead of fixed cron
- Import shared/daily-audit-charts.md for Python env + discussion creation
- Add discussions: read permission
- Add two bar charts per experiment (success rate + mean duration with 95% CI)
- Replace add-comment output with create-discussion via shared import
- Upload charts as assets and embed in discussion body
- Recompile lock file (approved new OTEL secrets from standard audit import chain)

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/ff4c25af-0a3f-4ddc-a398-bcdc73f9a248

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 1, 2026

@copilot create report in discussion, fancy bare charts

Done in e03f3a8. The workflow now:

  • Imports shared/daily-audit-charts.md which bundles the Python env (NumPy/Pandas/Matplotlib/Seaborn/SciPy) and configures a create-discussion output with [experiments] title-prefix and 3-day expiry
  • Generates two bar charts per experiment: success rate by variant (with n annotations and control baseline dashed line) and mean duration by variant with 95% CI error bars
  • Uploads each chart as an asset and embeds it inline in the discussion body
  • Posts a single Discussion per run with all experiments, embedded charts, ASCII comparison tables, and PROMOTE/EXTEND/ABANDON recommendations

Copilot AI changed the title feat: rewrite daily-experiment-report workflow with statistical analysis and significance detection feat: rewrite daily-experiment-report workflow with statistical analysis, bar charts, and discussion output May 1, 2026
Copilot AI requested a review from pelikhan May 1, 2026 22:06
@pelikhan pelikhan marked this pull request as ready for review May 1, 2026 22:07
Copilot AI review requested due to automatic review settings May 1, 2026 22:07
@pelikhan pelikhan merged commit 25d1705 into main May 1, 2026
28 checks passed
@pelikhan pelikhan deleted the copilot/create-daily-experiment-report-workflow branch May 1, 2026 22:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Rewrites the daily experiment reporting workflow to aggregate experiment artifacts across recent runs, compute statistical summaries and significance tests, generate charts/tables, and publish a consolidated GitHub Discussion report.

Changes:

  • Replaced the previous daily experiment status report prompt with a detailed artifact-based statistical analysis + visualization/reporting prompt.
  • Added chart generation + asset upload support and updated discussion configuration via shared imports/safe-outputs.
  • Regenerated compiled lock workflows (including updated experiment spec serialization for an existing workflow).
Show a summary per file
File Description
.github/workflows/daily-experiment-report.md New end-to-end instructions for discovering experiments, aggregating artifacts, computing stats/significance, generating charts/tables, and posting a Discussion.
.github/workflows/daily-experiment-report.lock.yml Recompiled workflow lock to include discussion/charts imports, OTLP env, Python setup, asset upload plumbing, and related runtime wiring.
.github/workflows/daily-community-attribution.lock.yml Updates the serialized experiment spec format used by the compiled workflow.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

.github/workflows/daily-experiment-report.md:60

  • The 'no active experiments' behavior is contradictory: Step 1 says to write a notice to $GITHUB_STEP_SUMMARY and exit, but Step 7 later says to create a discussion with a notice when no workflows declare experiments:. Please pick one behavior (or clarify conditional logic) so the agent doesn't produce both/incorrectly exit before creating the discussion.
If no workflows declare `experiments:`, append the following to `$GITHUB_STEP_SUMMARY` and exit:

No active experiments found in ${{ github.repository }} — nothing to report.

  • Files reviewed: 3/3 changed files
  • Comments generated: 3

Comment on lines +41 to +44
You are a **statistical analyst** for agentic workflow A/B experiments. Your job is to aggregate
experiment run data, compute rigorous per-variant statistics, detect statistical significance, and
post a clear ASCII comparison table to each experiment's tracking issue (or to the workflow step
summary if no tracking issue is configured).
read its frontmatter and identify those that declare an `experiments:` section. Collect:

## Phase 1: Discover Active Experiments
- Workflow file name (e.g. `daily-report.lock.yml`)
- `run_id`
- `conclusion` (`success`, `failure`, `cancelled`, …)
- `created_at` and `updated_at`
- `run_duration_ms` (derived from `created_at` and `updated_at`)
@github-actions github-actions Bot mentioned this pull request May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[plan] Create daily-experiment-report workflow for statistical analysis

3 participants