Conversation
7 tasks
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/bb347623-6d55-453b-82f4-0e16d9f922bd Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
- Rewrites daily-experiment-report.md with proper frontmatter:
- cron: "0 8 * * *" (08:00 UTC daily) + workflow_dispatch
- engine: copilot with github toolsets [default, actions]
- safe-outputs add-comment with hide-older-comments
- mentions/reference filtering to avoid notification noise
- Prompt instructs the agent to:
1. Find workflows declaring experiments: in frontmatter
2. List last 30 runs and download state.json artifacts
3. Infer per-run variant from consecutive cumulative count diffs
4. Compute mean, variance, 95% CI, success rate per variant
5. Detect significance: two-proportion z-test (binary) or
Welch t-test (continuous), threshold p < 0.05
6. Render ASCII comparison table per experiment
7. Post to tracking issue (if issue: set) or step summary
8. Recommend PROMOTE / EXTEND / ABANDON with rationale
- Recompiles cleanly (make recompile passes with no other diffs)
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/bb347623-6d55-453b-82f4-0e16d9f922bd
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
- Clarify "last 30 completed runs" means any final state (success/failure/cancelled/skipped) - Add variant inference edge case handling: missing artifacts (skip), zero increases (unknown/excluded), multiple increases (ambiguous/excluded) - Recommend scipy.stats for precise t/z critical values with linear interpolation fallback - Add n=15 and n=60 entries to the t-critical value table - Add n < 2 edge case: show N/A for variance/CI, exclude from Welch t-test - Add zero-variance edge case for Welch t-test: show N/A with note - Add statistical power note: n < 20 tests have low power, use EXTEND to gather more data Agent-Logs-Url: https://github.com/github/gh-aw/sessions/bb347623-6d55-453b-82f4-0e16d9f922bd Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Create daily-experiment-report workflow for statistical analysis
feat: rewrite daily-experiment-report workflow with statistical analysis and significance detection
May 1, 2026
pelikhan
reviewed
May 1, 2026
- Use fuzzy schedule `daily around 8:00` instead of fixed cron - Import shared/daily-audit-charts.md for Python env + discussion creation - Add discussions: read permission - Add two bar charts per experiment (success rate + mean duration with 95% CI) - Replace add-comment output with create-discussion via shared import - Upload charts as assets and embed in discussion body - Recompile lock file (approved new OTEL secrets from standard audit import chain) Agent-Logs-Url: https://github.com/github/gh-aw/sessions/ff4c25af-0a3f-4ddc-a398-bcdc73f9a248 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Contributor
Author
Done in e03f3a8. The workflow now:
|
Copilot
AI
changed the title
feat: rewrite daily-experiment-report workflow with statistical analysis and significance detection
feat: rewrite daily-experiment-report workflow with statistical analysis, bar charts, and discussion output
May 1, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Rewrites the daily experiment reporting workflow to aggregate experiment artifacts across recent runs, compute statistical summaries and significance tests, generate charts/tables, and publish a consolidated GitHub Discussion report.
Changes:
- Replaced the previous daily experiment status report prompt with a detailed artifact-based statistical analysis + visualization/reporting prompt.
- Added chart generation + asset upload support and updated discussion configuration via shared imports/safe-outputs.
- Regenerated compiled lock workflows (including updated experiment spec serialization for an existing workflow).
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/daily-experiment-report.md | New end-to-end instructions for discovering experiments, aggregating artifacts, computing stats/significance, generating charts/tables, and posting a Discussion. |
| .github/workflows/daily-experiment-report.lock.yml | Recompiled workflow lock to include discussion/charts imports, OTLP env, Python setup, asset upload plumbing, and related runtime wiring. |
| .github/workflows/daily-community-attribution.lock.yml | Updates the serialized experiment spec format used by the compiled workflow. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (1)
.github/workflows/daily-experiment-report.md:60
- The 'no active experiments' behavior is contradictory: Step 1 says to write a notice to
$GITHUB_STEP_SUMMARYand exit, but Step 7 later says to create a discussion with a notice when no workflows declareexperiments:. Please pick one behavior (or clarify conditional logic) so the agent doesn't produce both/incorrectly exit before creating the discussion.
If no workflows declare `experiments:`, append the following to `$GITHUB_STEP_SUMMARY` and exit:
No active experiments found in ${{ github.repository }} — nothing to report.
- Files reviewed: 3/3 changed files
- Comments generated: 3
Comment on lines
+41
to
+44
| You are a **statistical analyst** for agentic workflow A/B experiments. Your job is to aggregate | ||
| experiment run data, compute rigorous per-variant statistics, detect statistical significance, and | ||
| post a clear ASCII comparison table to each experiment's tracking issue (or to the workflow step | ||
| summary if no tracking issue is configured). |
| read its frontmatter and identify those that declare an `experiments:` section. Collect: | ||
|
|
||
| ## Phase 1: Discover Active Experiments | ||
| - Workflow file name (e.g. `daily-report.lock.yml`) |
| - `run_id` | ||
| - `conclusion` (`success`, `failure`, `cancelled`, …) | ||
| - `created_at` and `updated_at` | ||
| - `run_duration_ms` (derived from `created_at` and `updated_at`) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated experiment reporting was missing — results from A/B experiment runs were invisible without manual artifact downloads and analysis.
Changes
Workflow rewrite (
.github/workflows/daily-experiment-report.md)Replaces the previous status-only report with a statistically rigorous pipeline that posts a rich Discussion with embedded charts:
schedule: daily around 8:00,engine: copilot,toolsets: [default, actions], importsshared/daily-audit-charts.md(Python env +create-discussion+ OTLP observability);upload-assetfor chart embeddingstate.jsonfrom each run'sexperimentartifact; infers per-run variant assignment by diffing consecutive cumulative counts, with explicit handling for missing artifacts, zero-delta (unknown), and multi-delta (ambiguous) casesscipy.statsfor precision with interpolation-table fallback[experiments], 3-day expiry, older discussions auto-closed) containing all embedded charts, ASCII tables, and recommendationsCompiled lock file
.github/workflows/daily-experiment-report.lock.ymlregenerated via./gh-aw compile --approve(approved standard OTEL secrets introduced byshared/daily-audit-base.mdimport, already used by all other audit workflows in this repo).