🧪 Experiment Campaign: daily-code-metrics
Workflow file: .github/workflows/daily-code-metrics.md
Selected dimension: output_format
Triggered by: ab-testing-advisor on 2026-05-16
Background
The daily-code-metrics workflow runs on a daily schedule and produces a comprehensive code health report with 6 data visualizations, detailed metric tables, a quality score, and 3–5 actionable recommendations. The report is very information-dense (6 charts + multiple collapsible tables), which raises an open question about whether all that detail translates to reader engagement or whether a leaner executive summary would be equally useful at lower token cost. Testing output_format directly addresses the signal-to-noise tradeoff in this workflow's report output.
Hypothesis
Null hypothesis (H0): Changing the report format from a full-detail layout (6 charts + full metric tables) to an executive summary (key metrics + 2 charts + recommendations) does not change discussion engagement or quality score accuracy.
Alternative hypothesis (H1): The executive_summary variant produces reports with measurably higher discussion reaction/comment rates (proxy for engagement) because readers are more likely to scan a concise report, while the full_detail variant produces longer but potentially under-read reports.
Experiment Configuration
Add the following experiments: block to the workflow frontmatter:
experiments:
output_format:
variants: [full_detail, executive_summary]
description: "Tests whether a concise executive summary report drives higher reader engagement than the current full-detail 6-chart report."
hypothesis: "H0: no change in discussion engagement rate. H1: executive_summary variant increases discussion reactions+comments by ≥20% due to improved readability."
metric: discussion_engagement_score
secondary_metrics: [output_token_count, run_duration_seconds, chart_count]
guardrail_metrics:
- name: report_empty_rate
direction: min
threshold: "<=0"
- name: quality_score_present
direction: min
threshold: ">=1"
min_samples: 20
weight: [50, 50]
start_date: "2026-05-16"
issue: PLACEHOLDER
Variant descriptions:
full_detail: Current behaviour — 6 charts (LOC by language, top directories, quality score breakdown, test coverage, code churn, historical trends), full metric tables wrapped in <details> blocks, 3–5 recommendations. Maximally informative.
executive_summary: Condensed format — 2 charts only (quality score breakdown + historical trends), a 5-row key-metrics table, and 3 bullet-point recommendations. No <details> tables. Targets quick reading.
Workflow Changes Required
Add the experiments: frontmatter block above, then wrap the chart generation and detailed tables sections using {{#if}} conditional blocks.
Before (current prompt excerpt in the Python script section):
Generate **6 high-quality charts** to visualize code metrics and trends using Python, matplotlib, and seaborn.
### Required Charts
#### 1. LOC by Language (`loc_by_language.png`)
...
#### 2. Top Directories (`top_directories.png`)
...
#### 3. Quality Score Breakdown (`quality_score_breakdown.png`)
...
#### 4. Test Coverage (`test_coverage.png`)
...
#### 5. Code Churn (`code_churn.png`)
...
#### 6. Historical Trends (`historical_trends.png`)
After (with experiment conditional):
{{#if experiments.output_format == "full_detail" }}
Generate **6 high-quality charts** to visualize code metrics and trends using Python, matplotlib, and seaborn.
### Required Charts
#### 1. LOC by Language (`loc_by_language.png`)
...
#### 2. Top Directories (`top_directories.png`)
...
#### 3. Quality Score Breakdown (`quality_score_breakdown.png`)
...
#### 4. Test Coverage (`test_coverage.png`)
...
#### 5. Code Churn (`code_churn.png`)
...
#### 6. Historical Trends (`historical_trends.png`)
{{else}}
Generate **2 high-quality charts** focusing on the most actionable signals:
### Required Charts (Executive Summary Mode)
#### 1. Quality Score Breakdown (`quality_score_breakdown.png`)
... (same spec as full_detail)
#### 2. Historical Trends (`historical_trends.png`)
... (same spec as full_detail)
{{/if}}
Similarly wrap the report body section:
Before (Discussion Structure excerpt):
Brief 2-3 paragraph executive summary highlighting key findings...
### 📊 Visualizations
[6 chart embeds with analysis paragraphs]
<details><summary>📈 Detailed Metrics</summary>
[full tables for size, quality, tests, churn, workflows, docs]
</details>
### 💡 Insights & Recommendations
[3-5 items]
After:
{{#if experiments.output_format == "executive_summary" }}
**Key metrics today**: LOC: X,XXX | Quality score: XX/100 | Test ratio: X.XX | Active files (7d): XXX
### 📊 Key Visualizations
[2 chart embeds: quality score + historical trends]
### 💡 Top Recommendations
- [Recommendation 1]
- [Recommendation 2]
- [Recommendation 3]
*For full metric tables, switch to `full_detail` variant.*
{{else}}
Brief 2-3 paragraph executive summary...
[existing full template]
{{/if}}
Success Metrics
| Metric |
Type |
Target |
discussion_engagement_score (reactions + comments on daily report discussion) |
Primary |
≥20% higher for winning variant |
output_token_count (approximate prompt+completion tokens) |
Secondary |
executive_summary expected ≥30% lower |
run_duration_seconds |
Secondary |
executive_summary expected ≥20% faster |
chart_count (charts generated per run) |
Secondary |
6 vs 2 (confirming variant assignment worked) |
report_empty_rate (runs with no discussion created) |
Guardrail |
Must remain 0 in both variants |
quality_score_present (quality score in report body) |
Guardrail |
Must appear in every run |
Statistical Design
- Variants:
full_detail (control), executive_summary (treatment)
- Assignment: Balanced round-robin via
gh-aw experiments runtime (cache-based), 50/50 split
- Minimum runs per variant: 20 (set as
min_samples: 20)
- Runs per day: 1 (daily schedule)
- Expected experiment duration: ~40 days until minimum sample size reached (20 per variant)
- Analysis approach: Mann-Whitney U test on discussion engagement scores (non-parametric, appropriate for count data with potential outliers)
Implementation Steps
References
Generated by 🧪 Daily A/B Testing Advisor · ● 5.9M · ◷
🧪 Experiment Campaign: daily-code-metrics
Workflow file:
.github/workflows/daily-code-metrics.mdSelected dimension:
output_formatTriggered by:
ab-testing-advisoron 2026-05-16Background
The
daily-code-metricsworkflow runs on a daily schedule and produces a comprehensive code health report with 6 data visualizations, detailed metric tables, a quality score, and 3–5 actionable recommendations. The report is very information-dense (6 charts + multiple collapsible tables), which raises an open question about whether all that detail translates to reader engagement or whether a leaner executive summary would be equally useful at lower token cost. Testingoutput_formatdirectly addresses the signal-to-noise tradeoff in this workflow's report output.Hypothesis
Null hypothesis (H0): Changing the report format from a full-detail layout (6 charts + full metric tables) to an executive summary (key metrics + 2 charts + recommendations) does not change discussion engagement or quality score accuracy.
Alternative hypothesis (H1): The
executive_summaryvariant produces reports with measurably higher discussion reaction/comment rates (proxy for engagement) because readers are more likely to scan a concise report, while thefull_detailvariant produces longer but potentially under-read reports.Experiment Configuration
Add the following
experiments:block to the workflow frontmatter:Variant descriptions:
full_detail: Current behaviour — 6 charts (LOC by language, top directories, quality score breakdown, test coverage, code churn, historical trends), full metric tables wrapped in<details>blocks, 3–5 recommendations. Maximally informative.executive_summary: Condensed format — 2 charts only (quality score breakdown + historical trends), a 5-row key-metrics table, and 3 bullet-point recommendations. No<details>tables. Targets quick reading.Workflow Changes Required
Add the
experiments:frontmatter block above, then wrap the chart generation and detailed tables sections using{{#if}}conditional blocks.Before (current prompt excerpt in the Python script section):
After (with experiment conditional):
Similarly wrap the report body section:
Before (Discussion Structure excerpt):
After:
Success Metrics
discussion_engagement_score(reactions + comments on daily report discussion)output_token_count(approximate prompt+completion tokens)executive_summaryexpected ≥30% lowerrun_duration_secondsexecutive_summaryexpected ≥20% fasterchart_count(charts generated per run)report_empty_rate(runs with no discussion created)quality_score_present(quality score in report body)Statistical Design
full_detail(control),executive_summary(treatment)gh-awexperiments runtime (cache-based), 50/50 splitmin_samples: 20)Implementation Steps
experiments:section to frontmatter (updateissue:field after this issue is created){{#if experiments.output_format == "full_detail" }}/{{else}}/{{/if}}conditional blocks around chart generation instructions in the Python sectiongh aw compile daily-code-metricsto regenerate lock file/tmp/gh-aw/experiments/state.jsonstate.jsonartifacts from workflow run history and compute Mann-Whitney U on discussion engagement per variantReferences
.github/workflows/daily-code-metrics.md