[ab-advisor] Experiment campaign for deep-report: A/B test output_format

### 🧪 Experiment Campaign: deep-report

**Workflow file**: `.github/workflows/deep-report.md`
**Selected dimension**: `output_format`
**Triggered by**: `ab-testing-advisor` on 2026-05-05

---

### Background

DeepReport is an intelligence-gathering agent that runs daily (weekdays ~15:00 UTC) and synthesizes patterns, trends, and actionable tasks from across all agent-generated discussions, workflow logs, and recent issues. The prompt is extremely detailed (~250 lines) and the current report format prescribes seven distinct sections with heavy formatting requirements. This verbosity may inflate token usage, increase run duration, and reduce readability — making output format an ideal dimension to experiment on.

### Hypothesis

**Null hypothesis (H0):** The report format has no significant effect on discussion engagement score or token consumption compared to the current full-briefing baseline.

**Alternative hypothesis (H1):** A denser `executive_brief` format reduces token consumption by ≥20% while maintaining or improving discussion engagement (reactions + replies), whereas the `annotated_brief` variant adds inline source citations that improve the actionability of findings.

### Experiment Configuration

Add the following `experiments:` block to the workflow frontmatter:

```yaml
experiments:
  output_format:
    variants: [full_briefing, executive_brief, annotated_brief]
    description: "Tests whether report verbosity and structure affect token cost and discussion engagement"
    hypothesis: "H0: no change in discussion engagement or token cost. H1: executive_brief reduces token usage by ≥20% without reducing engagement; annotated_brief improves actionability."
    metric: token_count
    secondary_metrics: [discussion_reactions, discussion_replies, output_char_length, run_duration_ms]
    guardrail_metrics:
      - name: empty_output_rate
        direction: min
        threshold: 0
      - name: issue_creation_success_rate
        direction: min
        threshold: 0.8
    min_samples: 15
    owner: "`@team-agents`"
    weight: [34, 33, 33]
    start_date: "2026-05-06"
    analysis_type: mann_whitney
    tags: [output-format, token-cost, engagement, daily]
    issue: 0
```

**Variant descriptions**:
- `full_briefing`: Current behavior — seven verbose sections (Executive Summary, Pattern Analysis, Trend Intelligence, Notable Findings, Predictions & Recommendations, Actionable Tasks, Source Attribution).
- `executive_brief`: Condensed single-pass report — 3-sentence executive summary, a flat list of top-5 patterns/findings with bullet points, and the 7 actionable tasks. No trend prose or prediction section.
- `annotated_brief`: Same condensed structure as `executive_brief` but every finding includes an inline citation (discussion/issue/run URL) directly next to the claim rather than in a separate attribution section.

### Workflow Changes Required

Replace the **Report Structure** section guidance in the prompt body to make it variant-aware:

**Before:**
```
## Report Structure

Generate an intelligence briefing with the following sections:

### 🔍 Executive Summary
...
### 📊 Pattern Analysis
...
### 📈 Trend Intelligence
...
### 🚨 Notable Findings
...
### 🔮 Predictions and Recommendations
...
### ✅ Actionable Agentic Tasks (Quick Wins)
...
### 📚 Source Attribution
...
```

**After (using Handlebars conditional blocks):**
```handlebars
## Report Structure

{{#if experiments.output_format "executive_brief"}}
Generate a **condensed intelligence brief** with these sections only:
1. **🔍 Executive Summary** — 3 sentences: overall health, top finding, urgent action.
2. **🚨 Top 5 Findings** — Flat bullet list, one line each, most impactful first.
3. **✅ Actionable Agentic Tasks** — Exactly 7 items as before.
{{else}}{{#if experiments.output_format "annotated_brief"}}
Generate a **condensed intelligence brief with inline citations** with these sections only:
1. **🔍 Executive Summary** — 3 sentences with at least one cited source link per sentence.
2. **🚨 Top 5 Findings** — Flat bullet list, one line each, each ending with `([source](url))`.
3. **✅ Actionable Agentic Tasks** — Exactly 7 items as before, each linking its evidence.
{{else}}
Generate an intelligence briefing with the following sections:

### 🔍 Executive Summary
...
### 📊 Pattern Analysis
...
### 📈 Trend Intelligence
...
### 🚨 Notable Findings
...
### 🔮 Predictions and Recommendations
...
### ✅ Actionable Agentic Tasks (Quick Wins)
...
### 📚 Source Attribution
...
{{/if}}{{/if}}
```

### Success Metrics

| Metric | Type | Target |
|--------|------|--------|
| `token_count` | Primary | ≥20% reduction vs `full_briefing` for `executive_brief` |
| `discussion_reactions` | Secondary | Must not drop vs baseline |
| `output_char_length` | Secondary | Observe directionality |
| `run_duration_ms` | Secondary | Expect reduction for brief variants |
| `empty_output_rate` | Guardrail | Must remain 0 |
| `issue_creation_success_rate` | Guardrail | Must stay ≥0.8 |

### Statistical Design

- **Variants**: `full_briefing` (34%), `executive_brief` (33%), `annotated_brief` (33%)
- **Assignment**: Round-robin via `gh-aw` experiments runtime (cache-based, repo storage)
- **Minimum runs per variant**: 15 (total ≥45 runs)
- **Expected frequency**: ~5 runs/week (weekdays only)
- **Expected experiment duration**: ~9 weeks from `start_date`
- **Analysis approach**: Mann-Whitney U test on token counts and output length (non-parametric, robust to non-normal distributions)

### Implementation Steps

- [ ] Add `experiments:` section to frontmatter (YAML block above)
- [ ] Add conditional blocks to workflow prompt body using the `{{#if experiments.output_format "<variant>"}}...{{else}}...{{/if}}` syntax
- [ ] Run `gh aw compile deep-report` to regenerate lock file
- [ ] Monitor experiment artifact uploaded per run to `/tmp/gh-aw/experiments/state.json`
- [ ] After 45 total runs (≥15 per variant), analyze variant distribution via workflow run artifacts
- [ ] Document findings and promote winning variant

### Infrastructure Status

✅ **Infrastructure is complete.** All three advanced experiment schema fields (`analysis_type`, `tags`, `notify`) are fully implemented in both `pkg/workflow/compiler_experiments.go` and `actions/setup/js/pick_experiment.cjs`. No sub-issue is needed.

### References

- [A/B Testing in gh-aw](https://github.com/github/gh-aw/blob/main/.github/aw/github-agentic-workflows.md)
- Workflow file: `.github/workflows/deep-report.md`







> Generated by [Daily A/B Testing Advisor](https://github.com/github/gh-aw/actions/runs/25372257202/agentic_workflow) · ● 568.7K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fab-testing-advisor%22&type=issues)
> - [x] expires  on May 19, 2026, 11:02 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ab-advisor] Experiment campaign for deep-report: A/B test output_format #30335

🧪 Experiment Campaign: deep-report

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

Infrastructure Status

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Type	Target
`token_count`	Primary	≥20% reduction vs `full_briefing` for `executive_brief`
`discussion_reactions`	Secondary	Must not drop vs baseline
`output_char_length`	Secondary	Observe directionality
`run_duration_ms`	Secondary	Expect reduction for brief variants
`empty_output_rate`	Guardrail	Must remain 0
`issue_creation_success_rate`	Guardrail	Must stay ≥0.8

[ab-advisor] Experiment campaign for deep-report: A/B test output_format #30335

Description

🧪 Experiment Campaign: deep-report

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

Infrastructure Status

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions