Skip to content

feat: update daily-experiment-report to use experiments CLI commands#30044

Merged
pelikhan merged 2 commits intomainfrom
copilot/update-daily-experiments-report
May 4, 2026
Merged

feat: update daily-experiment-report to use experiments CLI commands#30044
pelikhan merged 2 commits intomainfrom
copilot/update-daily-experiments-report

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 4, 2026

Summary

Updates the daily-experiment-report workflow to leverage the gh aw experiments CLI command and gh aw experiments analyze tool, replacing the manual frontmatter-parsing approach with the dedicated CLI tooling.

Changes

  • Added cli-proxy: true to the tools section so the agent can call gh aw commands
  • Step 1 (Discovery): Now calls gh aw experiments list --json --repo ${{ github.repository }} to discover active experiments, then calls gh aw experiments analyze <id> --json --repo ${{ github.repository }} per experiment to retrieve:
    • Per-variant counts and percentages from git branch state
    • Chi-square balance test results (chi_square, p_value, is_balanced)
    • Min-samples readiness gate (recommendation: EXTEND / READY_FOR_ANALYSIS)
    • Hypothesis text, analysis type, guardrail thresholds from frontmatter
    • Bonferroni-corrected alpha for 3+ variant experiments
  • Step 2 (Run Data): Uses the recent_runs array from the analyze output for explicit variant assignments, supplemented by GitHub MCP tools for per-run outcome data (success rates, durations)
  • Step 3 (Statistics): References analyze output fields directly (count, min_samples, balance test) instead of recomputing them; only outcome metrics (success rate, duration CI) are computed from per-run records
  • Step 4 (Significance): Uses the CLI's recommendation field for the min-samples gate instead of computing it independently
  • Updated description to reflect the new CLI-driven approach
  • Recompiled lock file (daily-experiment-report.lock.yml)

@pelikhan pelikhan marked this pull request as ready for review May 4, 2026 02:50
Copilot AI review requested due to automatic review settings May 4, 2026 02:50
@pelikhan pelikhan merged commit 34dfc2a into main May 4, 2026
19 checks passed
@pelikhan pelikhan deleted the copilot/update-daily-experiments-report branch May 4, 2026 02:51
Copilot stopped work on behalf of pelikhan due to an error May 4, 2026 02:51
Copilot AI requested a review from pelikhan May 4, 2026 02:51
@github-actions github-actions Bot mentioned this pull request May 4, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the daily-experiment-report workflow spec to rely on the gh aw experiments CLI for experiment discovery/analysis, and regenerates compiled workflow lockfiles.

Changes:

  • Update daily-experiment-report.md to use gh aw experiments list/analyze outputs (and add tools.cli-proxy: true).
  • Recompile daily-experiment-report.lock.yml to reflect the updated spec.
  • Regenerate multiple other *.lock.yml files, changing Copilot CLI --allow-tool shell(...) allowlist prefixes.
Show a summary per file
File Description
.github/workflows/daily-experiment-report.md Switches the workflow instructions to CLI-driven experiment discovery/analysis; adds cli-proxy: true.
.github/workflows/daily-experiment-report.lock.yml Recompiled lockfile reflecting updated prompt/spec and environment.
.github/workflows/workflow-skill-extractor.lock.yml Lockfile recompile; --allow-tool shell(...) prefixes became more generic.
.github/workflows/ubuntu-image-analyzer.lock.yml Lockfile recompile; broader find allowlist prefix.
.github/workflows/spec-librarian.lock.yml Lockfile recompile; broader find/grep/git log allowlist prefixes.
.github/workflows/spec-extractor.lock.yml Lockfile recompile; broader find/grep allowlist prefixes.
.github/workflows/layout-spec-maintainer.lock.yml Lockfile recompile; broader find/grep allowlist prefixes and removed constrained yq form.
.github/workflows/discussion-task-miner.lock.yml Lockfile recompile; broader find allowlist prefix.
.github/workflows/delight.lock.yml Lockfile recompile; broader find allowlist prefixes.
.github/workflows/daily-testify-uber-super-expert.lock.yml Lockfile recompile; broader find/grep allowlist prefixes.
.github/workflows/daily-safe-output-integrator.lock.yml Lockfile recompile; broader find/grep allowlist prefixes.
.github/workflows/daily-mcp-concurrency-analysis.lock.yml Lockfile recompile; broader find/git log/grep/jq allowlist prefixes.
.github/workflows/daily-file-diet.lock.yml Lockfile recompile; broader find/grep allowlist prefixes.
.github/workflows/daily-compiler-quality.lock.yml Lockfile recompile; broader find/git log/grep allowlist prefixes.
.github/workflows/copilot-cli-deep-research.lock.yml Lockfile recompile; broader find allowlist prefixes.
.github/workflows/ab-testing-advisor.lock.yml Lockfile recompile; broader find/grep allowlist prefixes.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (3)

.github/workflows/daily-safe-output-integrator.lock.yml:768

  • --allow-tool shell(grep -n) / shell(grep -rn) are very generic prefixes compared to the previous file/pattern-scoped grep invocations. With prefix matching, this broadens allowed shell activity; consider updating the source workflow’s tools.bash commands to avoid single quotes so the compiled lockfile can keep more restrictive prefixes.
        # --allow-tool shell(grep -n)
        # --allow-tool shell(grep -rn)

.github/workflows/layout-spec-maintainer.lock.yml:724

  • --allow-tool shell(grep -r) is much broader than the previous constrained grep -r '.*' pkg/workflow/... commands. With prefix matching, this expands what can be searched/parsed via shell beyond the apparent intent. Consider changing the underlying tools.bash patterns to avoid single quotes (use double quotes) so the lockfile can keep tighter prefixes.
        # --allow-tool shell(grep -r)

.github/workflows/daily-testify-uber-super-expert.lock.yml:822

  • --allow-tool shell(grep -r) is a very generic prefix compared to the previous constrained grep -r ... --include=... entries. With prefix matching, this increases the shell surface area beyond the original intent; consider adjusting the underlying tools.bash entries to avoid single quotes so the lockfile can retain narrower prefixes.
        # --allow-tool shell(grep -r)
  • Files reviewed: 16/16 changed files
  • Comments generated: 18

Comment on lines +780 to +781
# --allow-tool shell(find pkg -name)
# --allow-tool shell(find pkg -type f -name)
Comment on lines +777 to +778
# --allow-tool shell(find .github/workflows -name)
# --allow-tool shell(find docs/src/content/docs -name)
Comment on lines +240 to +242
recommendation (regardless of p-value) and show the per-variant progress toward `min_samples`
from `analyses[].variants[].min_samples_reached`. Only proceed with `PROMOTE` or `ABANDON` when
the CLI returns `READY_FOR_ANALYSIS`.
# --allow-tool shell(date)
# --allow-tool shell(echo)
# --allow-tool shell(find pkg/cli/workflows -name 'test-*.md' -type f)
# --allow-tool shell(find pkg/cli/workflows -name)
Comment on lines +816 to +818
# --allow-tool shell(find . -name)
# --allow-tool shell(find pkg -name)
# --allow-tool shell(find pkg -type f -name)
Comment on lines +713 to +715
# --allow-tool shell(find .github -name)
# --allow-tool shell(find .github -type f -exec cat {} +)
# --allow-tool shell(find pkg -name 'copilot*.go')
# --allow-tool shell(find pkg -name)
# For more information: https://github.github.com/gh-aw/introduction/overview/
#
# Daily statistical report that aggregates experiment-state artifacts across recent runs, computes per-variant statistics (mean, variance, 95% CI, success rate), detects significance via Welch t-test or two-proportion z-test (p < 0.05), checks guardrail metric thresholds, renders bar charts and an ASCII comparison table per experiment, and posts a discussion with a promote/extend/abandon recommendation; notifies tracking issues when experiments reach statistical significance or min_samples
# Daily statistical report that uses the experiments CLI command to list active experiments and the experiments analyze tool to get per-variant statistics and statistical significance, then computes per-variant success rates and durations from run artifacts, renders bar charts and an ASCII comparison table per experiment, and posts a discussion with a promote/extend/abandon recommendation; notifies tracking issues when experiments reach statistical significance or min_samples
Comment on lines +848 to 852
# --allow-tool shell(find actions/setup/js -name)
# --allow-tool shell(git log -1 --format=)
# --allow-tool shell(git log -3 --format=)
# --allow-tool shell(grep -r)
# --allow-tool shell(grep)
Comment on lines +724 to +726
# --allow-tool shell(grep -l)
# --allow-tool shell(grep -rL)
# --allow-tool shell(grep -rn)
Comment on lines +152 to +156
Use the `analyses` array from `gh aw experiments analyze` (Step 1) for the following fields — no
recomputation is needed:

- **n** (variant count): from `analyses[].variants[].count`
- **min_samples**: from `analyses[].min_samples`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants