Skip to content

[ab-advisor] Experiment campaign for deep-report: A/B test output_format #30335

@github-actions

Description

@github-actions

🧪 Experiment Campaign: deep-report

Workflow file: .github/workflows/deep-report.md
Selected dimension: output_format
Triggered by: ab-testing-advisor on 2026-05-05


Background

DeepReport is an intelligence-gathering agent that runs daily (weekdays ~15:00 UTC) and synthesizes patterns, trends, and actionable tasks from across all agent-generated discussions, workflow logs, and recent issues. The prompt is extremely detailed (~250 lines) and the current report format prescribes seven distinct sections with heavy formatting requirements. This verbosity may inflate token usage, increase run duration, and reduce readability — making output format an ideal dimension to experiment on.

Hypothesis

Null hypothesis (H0): The report format has no significant effect on discussion engagement score or token consumption compared to the current full-briefing baseline.

Alternative hypothesis (H1): A denser executive_brief format reduces token consumption by ≥20% while maintaining or improving discussion engagement (reactions + replies), whereas the annotated_brief variant adds inline source citations that improve the actionability of findings.

Experiment Configuration

Add the following experiments: block to the workflow frontmatter:

experiments:
  output_format:
    variants: [full_briefing, executive_brief, annotated_brief]
    description: "Tests whether report verbosity and structure affect token cost and discussion engagement"
    hypothesis: "H0: no change in discussion engagement or token cost. H1: executive_brief reduces token usage by ≥20% without reducing engagement; annotated_brief improves actionability."
    metric: token_count
    secondary_metrics: [discussion_reactions, discussion_replies, output_char_length, run_duration_ms]
    guardrail_metrics:
      - name: empty_output_rate
        direction: min
        threshold: 0
      - name: issue_creation_success_rate
        direction: min
        threshold: 0.8
    min_samples: 15
    owner: "`@team-agents`"
    weight: [34, 33, 33]
    start_date: "2026-05-06"
    analysis_type: mann_whitney
    tags: [output-format, token-cost, engagement, daily]
    issue: 0

Variant descriptions:

  • full_briefing: Current behavior — seven verbose sections (Executive Summary, Pattern Analysis, Trend Intelligence, Notable Findings, Predictions & Recommendations, Actionable Tasks, Source Attribution).
  • executive_brief: Condensed single-pass report — 3-sentence executive summary, a flat list of top-5 patterns/findings with bullet points, and the 7 actionable tasks. No trend prose or prediction section.
  • annotated_brief: Same condensed structure as executive_brief but every finding includes an inline citation (discussion/issue/run URL) directly next to the claim rather than in a separate attribution section.

Workflow Changes Required

Replace the Report Structure section guidance in the prompt body to make it variant-aware:

Before:

## Report Structure

Generate an intelligence briefing with the following sections:

### 🔍 Executive Summary
...
### 📊 Pattern Analysis
...
### 📈 Trend Intelligence
...
### 🚨 Notable Findings
...
### 🔮 Predictions and Recommendations
...
### ✅ Actionable Agentic Tasks (Quick Wins)
...
### 📚 Source Attribution
...

After (using Handlebars conditional blocks):

## Report Structure

{{#if experiments.output_format "executive_brief"}}
Generate a **condensed intelligence brief** with these sections only:
1. **🔍 Executive Summary** — 3 sentences: overall health, top finding, urgent action.
2. **🚨 Top 5 Findings** — Flat bullet list, one line each, most impactful first.
3. **✅ Actionable Agentic Tasks** — Exactly 7 items as before.
{{else}}{{#if experiments.output_format "annotated_brief"}}
Generate a **condensed intelligence brief with inline citations** with these sections only:
1. **🔍 Executive Summary** — 3 sentences with at least one cited source link per sentence.
2. **🚨 Top 5 Findings** — Flat bullet list, one line each, each ending with `([source](url))`.
3. **✅ Actionable Agentic Tasks** — Exactly 7 items as before, each linking its evidence.
{{else}}
Generate an intelligence briefing with the following sections:

### 🔍 Executive Summary
...
### 📊 Pattern Analysis
...
### 📈 Trend Intelligence
...
### 🚨 Notable Findings
...
### 🔮 Predictions and Recommendations
...
### ✅ Actionable Agentic Tasks (Quick Wins)
...
### 📚 Source Attribution
...
{{/if}}{{/if}}

Success Metrics

Metric Type Target
token_count Primary ≥20% reduction vs full_briefing for executive_brief
discussion_reactions Secondary Must not drop vs baseline
output_char_length Secondary Observe directionality
run_duration_ms Secondary Expect reduction for brief variants
empty_output_rate Guardrail Must remain 0
issue_creation_success_rate Guardrail Must stay ≥0.8

Statistical Design

  • Variants: full_briefing (34%), executive_brief (33%), annotated_brief (33%)
  • Assignment: Round-robin via gh-aw experiments runtime (cache-based, repo storage)
  • Minimum runs per variant: 15 (total ≥45 runs)
  • Expected frequency: ~5 runs/week (weekdays only)
  • Expected experiment duration: ~9 weeks from start_date
  • Analysis approach: Mann-Whitney U test on token counts and output length (non-parametric, robust to non-normal distributions)

Implementation Steps

  • Add experiments: section to frontmatter (YAML block above)
  • Add conditional blocks to workflow prompt body using the {{#if experiments.output_format "<variant>"}}...{{else}}...{{/if}} syntax
  • Run gh aw compile deep-report to regenerate lock file
  • Monitor experiment artifact uploaded per run to /tmp/gh-aw/experiments/state.json
  • After 45 total runs (≥15 per variant), analyze variant distribution via workflow run artifacts
  • Document findings and promote winning variant

Infrastructure Status

Infrastructure is complete. All three advanced experiment schema fields (analysis_type, tags, notify) are fully implemented in both pkg/workflow/compiler_experiments.go and actions/setup/js/pick_experiment.cjs. No sub-issue is needed.

References

Generated by Daily A/B Testing Advisor · ● 568.7K ·

  • expires on May 19, 2026, 11:02 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions