[ab-advisor] Experiment campaign for daily-code-metrics: A/B test output_format

### 🧪 Experiment Campaign: daily-code-metrics

**Workflow file**: `.github/workflows/daily-code-metrics.md`
**Selected dimension**: `output_format`
**Triggered by**: `ab-testing-advisor` on 2026-05-16

---

### Background

The `daily-code-metrics` workflow runs on a daily schedule and produces a comprehensive code health report with 6 data visualizations, detailed metric tables, a quality score, and 3–5 actionable recommendations. The report is very information-dense (6 charts + multiple collapsible tables), which raises an open question about whether all that detail translates to reader engagement or whether a leaner executive summary would be equally useful at lower token cost. Testing `output_format` directly addresses the signal-to-noise tradeoff in this workflow's report output.

### Hypothesis

**Null hypothesis (H0):** Changing the report format from a full-detail layout (6 charts + full metric tables) to an executive summary (key metrics + 2 charts + recommendations) does not change discussion engagement or quality score accuracy.

**Alternative hypothesis (H1):** The `executive_summary` variant produces reports with measurably higher discussion reaction/comment rates (proxy for engagement) because readers are more likely to scan a concise report, while the `full_detail` variant produces longer but potentially under-read reports.

### Experiment Configuration

Add the following `experiments:` block to the workflow frontmatter:

```yaml
experiments:
  output_format:
    variants: [full_detail, executive_summary]
    description: "Tests whether a concise executive summary report drives higher reader engagement than the current full-detail 6-chart report."
    hypothesis: "H0: no change in discussion engagement rate. H1: executive_summary variant increases discussion reactions+comments by ≥20% due to improved readability."
    metric: discussion_engagement_score
    secondary_metrics: [output_token_count, run_duration_seconds, chart_count]
    guardrail_metrics:
      - name: report_empty_rate
        direction: min
        threshold: "<=0"
      - name: quality_score_present
        direction: min
        threshold: ">=1"
    min_samples: 20
    weight: [50, 50]
    start_date: "2026-05-16"
    issue: PLACEHOLDER
```

**Variant descriptions**:
- `full_detail`: Current behaviour — 6 charts (LOC by language, top directories, quality score breakdown, test coverage, code churn, historical trends), full metric tables wrapped in `<details>` blocks, 3–5 recommendations. Maximally informative.
- `executive_summary`: Condensed format — 2 charts only (quality score breakdown + historical trends), a 5-row key-metrics table, and 3 bullet-point recommendations. No `<details>` tables. Targets quick reading.

### Workflow Changes Required

Add the `experiments:` frontmatter block above, then wrap the chart generation and detailed tables sections using `{{#if}}` conditional blocks.

**Before** (current prompt excerpt in the Python script section):
```
Generate **6 high-quality charts** to visualize code metrics and trends using Python, matplotlib, and seaborn.

### Required Charts

#### 1. LOC by Language (`loc_by_language.png`)
...
#### 2. Top Directories (`top_directories.png`)
...
#### 3. Quality Score Breakdown (`quality_score_breakdown.png`)
...
#### 4. Test Coverage (`test_coverage.png`)
...
#### 5. Code Churn (`code_churn.png`)
...
#### 6. Historical Trends (`historical_trends.png`)
```

**After** (with experiment conditional):
```
{{#if experiments.output_format == "full_detail" }}
Generate **6 high-quality charts** to visualize code metrics and trends using Python, matplotlib, and seaborn.

### Required Charts

#### 1. LOC by Language (`loc_by_language.png`)
...
#### 2. Top Directories (`top_directories.png`)
...
#### 3. Quality Score Breakdown (`quality_score_breakdown.png`)
...
#### 4. Test Coverage (`test_coverage.png`)
...
#### 5. Code Churn (`code_churn.png`)
...
#### 6. Historical Trends (`historical_trends.png`)
{{else}}
Generate **2 high-quality charts** focusing on the most actionable signals:

### Required Charts (Executive Summary Mode)

#### 1. Quality Score Breakdown (`quality_score_breakdown.png`)
... (same spec as full_detail)

#### 2. Historical Trends (`historical_trends.png`)
... (same spec as full_detail)
{{/if}}
```

Similarly wrap the report body section:

**Before** (Discussion Structure excerpt):
```
Brief 2-3 paragraph executive summary highlighting key findings...

### 📊 Visualizations
[6 chart embeds with analysis paragraphs]

<details><summary>📈 Detailed Metrics</summary>
[full tables for size, quality, tests, churn, workflows, docs]
</details>

### 💡 Insights & Recommendations
[3-5 items]
```

**After**:
```
{{#if experiments.output_format == "executive_summary" }}
**Key metrics today**: LOC: X,XXX | Quality score: XX/100 | Test ratio: X.XX | Active files (7d): XXX

### 📊 Key Visualizations
[2 chart embeds: quality score + historical trends]

### 💡 Top Recommendations
- [Recommendation 1]
- [Recommendation 2]
- [Recommendation 3]

*For full metric tables, switch to `full_detail` variant.*
{{else}}
Brief 2-3 paragraph executive summary...
[existing full template]
{{/if}}
```

### Success Metrics

| Metric | Type | Target |
|--------|------|--------|
| `discussion_engagement_score` (reactions + comments on daily report discussion) | Primary | ≥20% higher for winning variant |
| `output_token_count` (approximate prompt+completion tokens) | Secondary | `executive_summary` expected ≥30% lower |
| `run_duration_seconds` | Secondary | `executive_summary` expected ≥20% faster |
| `chart_count` (charts generated per run) | Secondary | 6 vs 2 (confirming variant assignment worked) |
| `report_empty_rate` (runs with no discussion created) | Guardrail | Must remain 0 in both variants |
| `quality_score_present` (quality score in report body) | Guardrail | Must appear in every run |

### Statistical Design

- **Variants**: `full_detail` (control), `executive_summary` (treatment)
- **Assignment**: Balanced round-robin via `gh-aw` experiments runtime (cache-based), 50/50 split
- **Minimum runs per variant**: 20 (set as `min_samples: 20`)
- **Runs per day**: 1 (daily schedule)
- **Expected experiment duration**: ~40 days until minimum sample size reached (20 per variant)
- **Analysis approach**: Mann-Whitney U test on discussion engagement scores (non-parametric, appropriate for count data with potential outliers)

### Implementation Steps

- [ ] Add `experiments:` section to frontmatter (update `issue:` field after this issue is created)
- [ ] Add `{{#if experiments.output_format == "full_detail" }}` / `{{else}}` / `{{/if}}` conditional blocks around chart generation instructions in the Python section
- [ ] Add matching conditional block around the discussion report template section
- [ ] Run `gh aw compile daily-code-metrics` to regenerate lock file
- [ ] Monitor experiment artifact uploaded per run to `/tmp/gh-aw/experiments/state.json`
- [ ] After ~40 runs, download `state.json` artifacts from workflow run history and compute Mann-Whitney U on discussion engagement per variant
- [ ] Document findings and promote winning variant

### References

- [A/B Testing in gh-aw](https://github.com/github/gh-aw/blob/main/.github/aw/github-agentic-workflows.md)
- Workflow file: `.github/workflows/daily-code-metrics.md`







> Generated by [🧪 Daily A/B Testing Advisor](https://github.com/github/gh-aw/actions/runs/25949805147) · ● 5.9M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fab-testing-advisor%22&type=issues)
> - [x] expires  on May 30, 2026, 2:02 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ab-advisor] Experiment campaign for daily-code-metrics: A/B test output_format #32524

🧪 Experiment Campaign: daily-code-metrics

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Type	Target
`discussion_engagement_score` (reactions + comments on daily report discussion)	Primary	≥20% higher for winning variant
`output_token_count` (approximate prompt+completion tokens)	Secondary	`executive_summary` expected ≥30% lower
`run_duration_seconds`	Secondary	`executive_summary` expected ≥20% faster
`chart_count` (charts generated per run)	Secondary	6 vs 2 (confirming variant assignment worked)
`report_empty_rate` (runs with no discussion created)	Guardrail	Must remain 0 in both variants
`quality_score_present` (quality score in report body)	Guardrail	Must appear in every run

[ab-advisor] Experiment campaign for daily-code-metrics: A/B test output_format #32524

Description

🧪 Experiment Campaign: daily-code-metrics

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions