Skip to content

Add output_format A/B experiment to daily-code-metrics workflow#32535

Merged
pelikhan merged 4 commits into
mainfrom
copilot/ab-advisor-experiment-campaign-again
May 16, 2026
Merged

Add output_format A/B experiment to daily-code-metrics workflow#32535
pelikhan merged 4 commits into
mainfrom
copilot/ab-advisor-experiment-campaign-again

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 16, 2026

This updates daily-code-metrics to run an A/B experiment on report verbosity (full_detail vs executive_summary) so engagement, token cost, runtime, and chart volume can be measured from real runs. The workflow prompt now supports variant-specific chart generation and report structure while preserving the existing full-detail behavior as control.

  • Experiment wiring in workflow frontmatter

    • Added experiments.output_format with variants, hypothesis, primary/secondary metrics, guardrails, weights, and sampling window.
    • Included a compile-valid numeric issue placeholder required by current schema.
  • Variant-gated chart instructions

    • Wrapped chart requirements in conditional template blocks.
    • full_detail: existing 6-chart specification.
    • executive_summary: reduced to 2 charts (quality_score_breakdown, historical_trends).
  • Variant-gated report template

    • Added conditional report body branch for executive_summary:
      • key metrics line
      • 2 chart embeds
      • 3 recommendation bullets
      • no <details> metric tables
    • Kept existing full report template in else branch for control.
  • Prompt guidance alignment

    • Updated chart/report guidance text to explicitly describe variant-dependent output expectations.
    • Added note to read GH_AW_EXPERIMENTS_OUTPUT_FORMAT in Python logic for branch selection.
  • Compiled workflow output

    • Regenerated .github/workflows/daily-code-metrics.lock.yml to include experiment activation/state plumbing.
experiments:
  output_format:
    variants: [full_detail, executive_summary]
    metric: discussion_engagement_score
    secondary_metrics: [output_token_count, run_duration_seconds, chart_count]
{{#if experiments.output_format == 'executive_summary' }}
### 📊 Key Visualizations
...2 charts...
{{else}}
### 📊 Visualizations
...existing full-detail sections...
{{/if}}

Copilot AI and others added 3 commits May 16, 2026 02:45
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Experiment with output format for daily-code-metrics Add output_format A/B experiment to daily-code-metrics workflow May 16, 2026
Copilot AI requested a review from pelikhan May 16, 2026 02:54
@pelikhan pelikhan marked this pull request as ready for review May 16, 2026 02:55
Copilot AI review requested due to automatic review settings May 16, 2026 02:55
@pelikhan pelikhan merged commit 0392b07 into main May 16, 2026
@pelikhan pelikhan deleted the copilot/ab-advisor-experiment-campaign-again branch May 16, 2026 02:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an output_format A/B experiment to the daily-code-metrics workflow so real-run metrics (engagement, tokens, runtime, chart volume) can be compared between full_detail and executive_summary report variants.

Changes:

  • Adds experiments.output_format frontmatter metadata and gates chart + report templates on the selected variant.
  • Updates the compiled workflow (.lock.yml) to restore/pick/persist experiment state and plumb GH_AW_EXPERIMENTS_OUTPUT_FORMAT.
  • Regenerates daily-model-inventory.lock.yml (appears unrelated to the PR’s stated scope).
Show a summary per file
File Description
.github/workflows/daily-code-metrics.md Defines the experiment and adds variant-gated instructions for chart generation and report structure.
.github/workflows/daily-code-metrics.lock.yml Compiled workflow updates to select variants, pass env vars, and persist experiment state.
.github/workflows/daily-model-inventory.lock.yml Lock file regen (heredoc marker/hash churn; likely incidental).

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 3/3 changed files
  • Comments generated: 5

## Data Visualization with Python

{{#if experiments.output_format == 'full_detail' }}
Generate **6 high-quality charts** to visualize code metrics and trends using Python, matplotlib, and seaborn. All charts must be uploaded as assets and embedded in the discussion report.
Comment on lines +258 to +262
# Generate required charts for selected variant
# Chart: Quality Score Breakdown
# ... implementation ...

# Chart 6: Historical Trends
# Chart: Historical Trends
min_samples: 20
weight: [50, 50]
start_date: "2026-05-16"
issue: 1
Comment on lines +581 to +583
cat > "${RUNNER_TEMP}/gh-aw/safeoutputs/config.json" << GH_AW_SAFE_OUTPUTS_CONFIG_4230166978ab76c7_EOF
{"create_discussion":{"category":"audits","close_older_discussions":true,"expires":72,"fallback_to_issue":true,"max":1,"title_prefix":"[daily-code-metrics] "},"create_report_incomplete_issue":{},"missing_data":{},"missing_tool":{},"noop":{"max":1,"report-as-issue":"true"},"push_repo_memory":{"memories":[{"dir":"/tmp/gh-aw/repo-memory/default","id":"default","max_file_count":100,"max_file_size":102400,"max_patch_size":51200}]},"report_incomplete":{},"upload_asset":{"allowed-exts":[".png",".jpg",".jpeg",".svg"],"branch":"assets/${GITHUB_WORKFLOW}","max":5,"max-size":10240}}
GH_AW_SAFE_OUTPUTS_CONFIG_7874ab997f8429ba_EOF
GH_AW_SAFE_OUTPUTS_CONFIG_4230166978ab76c7_EOF
Comment on lines +206 to +218
cat << 'GH_AW_PROMPT_04c8251975e742fc_EOF'
<system>
GH_AW_PROMPT_4beb7ac5e590862b_EOF
GH_AW_PROMPT_04c8251975e742fc_EOF
cat "${RUNNER_TEMP}/gh-aw/prompts/xpia.md"
cat "${RUNNER_TEMP}/gh-aw/prompts/temp_folder_prompt.md"
cat "${RUNNER_TEMP}/gh-aw/prompts/markdown.md"
cat "${RUNNER_TEMP}/gh-aw/prompts/playwright_prompt.md"
cat "${RUNNER_TEMP}/gh-aw/prompts/safe_outputs_prompt.md"
cat << 'GH_AW_PROMPT_4beb7ac5e590862b_EOF'
cat << 'GH_AW_PROMPT_04c8251975e742fc_EOF'
<safe-output-tools>
Tools: create_issue, missing_tool, missing_data, noop
</safe-output-tools>
GH_AW_PROMPT_4beb7ac5e590862b_EOF
GH_AW_PROMPT_04c8251975e742fc_EOF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ab-advisor] Experiment campaign for daily-code-metrics: A/B test output_format

3 participants