Add output_format A/B experiment to daily-code-metrics workflow#32535
Merged
Conversation
8 tasks
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Experiment with output format for daily-code-metrics
Add May 16, 2026
output_format A/B experiment to daily-code-metrics workflow
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an output_format A/B experiment to the daily-code-metrics workflow so real-run metrics (engagement, tokens, runtime, chart volume) can be compared between full_detail and executive_summary report variants.
Changes:
- Adds
experiments.output_formatfrontmatter metadata and gates chart + report templates on the selected variant. - Updates the compiled workflow (
.lock.yml) to restore/pick/persist experiment state and plumbGH_AW_EXPERIMENTS_OUTPUT_FORMAT. - Regenerates
daily-model-inventory.lock.yml(appears unrelated to the PR’s stated scope).
Show a summary per file
| File | Description |
|---|---|
.github/workflows/daily-code-metrics.md |
Defines the experiment and adds variant-gated instructions for chart generation and report structure. |
.github/workflows/daily-code-metrics.lock.yml |
Compiled workflow updates to select variants, pass env vars, and persist experiment state. |
.github/workflows/daily-model-inventory.lock.yml |
Lock file regen (heredoc marker/hash churn; likely incidental). |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 3/3 changed files
- Comments generated: 5
| ## Data Visualization with Python | ||
|
|
||
| {{#if experiments.output_format == 'full_detail' }} | ||
| Generate **6 high-quality charts** to visualize code metrics and trends using Python, matplotlib, and seaborn. All charts must be uploaded as assets and embedded in the discussion report. |
Comment on lines
+258
to
+262
| # Generate required charts for selected variant | ||
| # Chart: Quality Score Breakdown | ||
| # ... implementation ... | ||
|
|
||
| # Chart 6: Historical Trends | ||
| # Chart: Historical Trends |
| min_samples: 20 | ||
| weight: [50, 50] | ||
| start_date: "2026-05-16" | ||
| issue: 1 |
Comment on lines
+581
to
+583
| cat > "${RUNNER_TEMP}/gh-aw/safeoutputs/config.json" << GH_AW_SAFE_OUTPUTS_CONFIG_4230166978ab76c7_EOF | ||
| {"create_discussion":{"category":"audits","close_older_discussions":true,"expires":72,"fallback_to_issue":true,"max":1,"title_prefix":"[daily-code-metrics] "},"create_report_incomplete_issue":{},"missing_data":{},"missing_tool":{},"noop":{"max":1,"report-as-issue":"true"},"push_repo_memory":{"memories":[{"dir":"/tmp/gh-aw/repo-memory/default","id":"default","max_file_count":100,"max_file_size":102400,"max_patch_size":51200}]},"report_incomplete":{},"upload_asset":{"allowed-exts":[".png",".jpg",".jpeg",".svg"],"branch":"assets/${GITHUB_WORKFLOW}","max":5,"max-size":10240}} | ||
| GH_AW_SAFE_OUTPUTS_CONFIG_7874ab997f8429ba_EOF | ||
| GH_AW_SAFE_OUTPUTS_CONFIG_4230166978ab76c7_EOF |
Comment on lines
+206
to
+218
| cat << 'GH_AW_PROMPT_04c8251975e742fc_EOF' | ||
| <system> | ||
| GH_AW_PROMPT_4beb7ac5e590862b_EOF | ||
| GH_AW_PROMPT_04c8251975e742fc_EOF | ||
| cat "${RUNNER_TEMP}/gh-aw/prompts/xpia.md" | ||
| cat "${RUNNER_TEMP}/gh-aw/prompts/temp_folder_prompt.md" | ||
| cat "${RUNNER_TEMP}/gh-aw/prompts/markdown.md" | ||
| cat "${RUNNER_TEMP}/gh-aw/prompts/playwright_prompt.md" | ||
| cat "${RUNNER_TEMP}/gh-aw/prompts/safe_outputs_prompt.md" | ||
| cat << 'GH_AW_PROMPT_4beb7ac5e590862b_EOF' | ||
| cat << 'GH_AW_PROMPT_04c8251975e742fc_EOF' | ||
| <safe-output-tools> | ||
| Tools: create_issue, missing_tool, missing_data, noop | ||
| </safe-output-tools> | ||
| GH_AW_PROMPT_4beb7ac5e590862b_EOF | ||
| GH_AW_PROMPT_04c8251975e742fc_EOF |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This updates
daily-code-metricsto run an A/B experiment on report verbosity (full_detailvsexecutive_summary) so engagement, token cost, runtime, and chart volume can be measured from real runs. The workflow prompt now supports variant-specific chart generation and report structure while preserving the existing full-detail behavior as control.Experiment wiring in workflow frontmatter
experiments.output_formatwith variants, hypothesis, primary/secondary metrics, guardrails, weights, and sampling window.issueplaceholder required by current schema.Variant-gated chart instructions
full_detail: existing 6-chart specification.executive_summary: reduced to 2 charts (quality_score_breakdown,historical_trends).Variant-gated report template
executive_summary:<details>metric tableselsebranch for control.Prompt guidance alignment
GH_AW_EXPERIMENTS_OUTPUT_FORMATin Python logic for branch selection.Compiled workflow output
.github/workflows/daily-code-metrics.lock.ymlto include experiment activation/state plumbing.