Skip to content

[ab-advisor] Experiment campaign for daily-code-metrics: A/B test output_format #32524

@github-actions

Description

@github-actions

🧪 Experiment Campaign: daily-code-metrics

Workflow file: .github/workflows/daily-code-metrics.md
Selected dimension: output_format
Triggered by: ab-testing-advisor on 2026-05-16


Background

The daily-code-metrics workflow runs on a daily schedule and produces a comprehensive code health report with 6 data visualizations, detailed metric tables, a quality score, and 3–5 actionable recommendations. The report is very information-dense (6 charts + multiple collapsible tables), which raises an open question about whether all that detail translates to reader engagement or whether a leaner executive summary would be equally useful at lower token cost. Testing output_format directly addresses the signal-to-noise tradeoff in this workflow's report output.

Hypothesis

Null hypothesis (H0): Changing the report format from a full-detail layout (6 charts + full metric tables) to an executive summary (key metrics + 2 charts + recommendations) does not change discussion engagement or quality score accuracy.

Alternative hypothesis (H1): The executive_summary variant produces reports with measurably higher discussion reaction/comment rates (proxy for engagement) because readers are more likely to scan a concise report, while the full_detail variant produces longer but potentially under-read reports.

Experiment Configuration

Add the following experiments: block to the workflow frontmatter:

experiments:
  output_format:
    variants: [full_detail, executive_summary]
    description: "Tests whether a concise executive summary report drives higher reader engagement than the current full-detail 6-chart report."
    hypothesis: "H0: no change in discussion engagement rate. H1: executive_summary variant increases discussion reactions+comments by ≥20% due to improved readability."
    metric: discussion_engagement_score
    secondary_metrics: [output_token_count, run_duration_seconds, chart_count]
    guardrail_metrics:
      - name: report_empty_rate
        direction: min
        threshold: "<=0"
      - name: quality_score_present
        direction: min
        threshold: ">=1"
    min_samples: 20
    weight: [50, 50]
    start_date: "2026-05-16"
    issue: PLACEHOLDER

Variant descriptions:

  • full_detail: Current behaviour — 6 charts (LOC by language, top directories, quality score breakdown, test coverage, code churn, historical trends), full metric tables wrapped in <details> blocks, 3–5 recommendations. Maximally informative.
  • executive_summary: Condensed format — 2 charts only (quality score breakdown + historical trends), a 5-row key-metrics table, and 3 bullet-point recommendations. No <details> tables. Targets quick reading.

Workflow Changes Required

Add the experiments: frontmatter block above, then wrap the chart generation and detailed tables sections using {{#if}} conditional blocks.

Before (current prompt excerpt in the Python script section):

Generate **6 high-quality charts** to visualize code metrics and trends using Python, matplotlib, and seaborn.

### Required Charts

#### 1. LOC by Language (`loc_by_language.png`)
...
#### 2. Top Directories (`top_directories.png`)
...
#### 3. Quality Score Breakdown (`quality_score_breakdown.png`)
...
#### 4. Test Coverage (`test_coverage.png`)
...
#### 5. Code Churn (`code_churn.png`)
...
#### 6. Historical Trends (`historical_trends.png`)

After (with experiment conditional):

{{#if experiments.output_format == "full_detail" }}
Generate **6 high-quality charts** to visualize code metrics and trends using Python, matplotlib, and seaborn.

### Required Charts

#### 1. LOC by Language (`loc_by_language.png`)
...
#### 2. Top Directories (`top_directories.png`)
...
#### 3. Quality Score Breakdown (`quality_score_breakdown.png`)
...
#### 4. Test Coverage (`test_coverage.png`)
...
#### 5. Code Churn (`code_churn.png`)
...
#### 6. Historical Trends (`historical_trends.png`)
{{else}}
Generate **2 high-quality charts** focusing on the most actionable signals:

### Required Charts (Executive Summary Mode)

#### 1. Quality Score Breakdown (`quality_score_breakdown.png`)
... (same spec as full_detail)

#### 2. Historical Trends (`historical_trends.png`)
... (same spec as full_detail)
{{/if}}

Similarly wrap the report body section:

Before (Discussion Structure excerpt):

Brief 2-3 paragraph executive summary highlighting key findings...

### 📊 Visualizations
[6 chart embeds with analysis paragraphs]

<details><summary>📈 Detailed Metrics</summary>
[full tables for size, quality, tests, churn, workflows, docs]
</details>

### 💡 Insights & Recommendations
[3-5 items]

After:

{{#if experiments.output_format == "executive_summary" }}
**Key metrics today**: LOC: X,XXX | Quality score: XX/100 | Test ratio: X.XX | Active files (7d): XXX

### 📊 Key Visualizations
[2 chart embeds: quality score + historical trends]

### 💡 Top Recommendations
- [Recommendation 1]
- [Recommendation 2]
- [Recommendation 3]

*For full metric tables, switch to `full_detail` variant.*
{{else}}
Brief 2-3 paragraph executive summary...
[existing full template]
{{/if}}

Success Metrics

Metric Type Target
discussion_engagement_score (reactions + comments on daily report discussion) Primary ≥20% higher for winning variant
output_token_count (approximate prompt+completion tokens) Secondary executive_summary expected ≥30% lower
run_duration_seconds Secondary executive_summary expected ≥20% faster
chart_count (charts generated per run) Secondary 6 vs 2 (confirming variant assignment worked)
report_empty_rate (runs with no discussion created) Guardrail Must remain 0 in both variants
quality_score_present (quality score in report body) Guardrail Must appear in every run

Statistical Design

  • Variants: full_detail (control), executive_summary (treatment)
  • Assignment: Balanced round-robin via gh-aw experiments runtime (cache-based), 50/50 split
  • Minimum runs per variant: 20 (set as min_samples: 20)
  • Runs per day: 1 (daily schedule)
  • Expected experiment duration: ~40 days until minimum sample size reached (20 per variant)
  • Analysis approach: Mann-Whitney U test on discussion engagement scores (non-parametric, appropriate for count data with potential outliers)

Implementation Steps

  • Add experiments: section to frontmatter (update issue: field after this issue is created)
  • Add {{#if experiments.output_format == "full_detail" }} / {{else}} / {{/if}} conditional blocks around chart generation instructions in the Python section
  • Add matching conditional block around the discussion report template section
  • Run gh aw compile daily-code-metrics to regenerate lock file
  • Monitor experiment artifact uploaded per run to /tmp/gh-aw/experiments/state.json
  • After ~40 runs, download state.json artifacts from workflow run history and compute Mann-Whitney U on discussion engagement per variant
  • Document findings and promote winning variant

References

Generated by 🧪 Daily A/B Testing Advisor · ● 5.9M ·

  • expires on May 30, 2026, 2:02 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions