[ab-advisor] Experiment campaign for blog-auditor: A/B test prompt_style

### 🧪 Experiment Campaign: blog-auditor

**Workflow file**: `.github/workflows/blog-auditor.md`
**Selected dimension**: `prompt_style`
**Triggered by**: `ab-testing-advisor` on 2026-05-16

---

### Background

The `blog-auditor` workflow performs a weekly automated audit of the GitHub Next Agentic Workflows blog page, validating accessibility, content integrity, and code snippet correctness using Playwright. The current prompt is extremely prescriptive — it includes exact bash commands, explicit code blocks for every step, and full Markdown templates for both pass and fail reports. This level of detail may be consuming unnecessary tokens and forcing the agent into rigid step-by-step execution rather than adapting intelligently. Testing a concise prompt style vs. the current detailed one will reveal whether the extra verbosity adds measurable quality or just cost.

### Hypothesis

**Null hypothesis (H0)**: A concise prompt variant produces the same discussion quality and validation correctness as the current detailed prompt.

**Alternative hypothesis (H1)**: A concise prompt reduces effective token consumption and run duration by ≥20% while maintaining equivalent audit correctness (zero false-negative audits, same keyword-check pass rate).

### Experiment Configuration

Add the following `experiments:` block to the workflow frontmatter:

```yaml
experiments:
  prompt_style:
    variants: [detailed, concise]
    description: "Tests whether a high-level goal-oriented prompt produces the same audit quality as the current step-by-step detailed instructions"
    hypothesis: "H0: no change in audit correctness or discussion quality. H1: concise variant reduces token cost ≥20% with no degradation in validation accuracy"
    metric: effective_token_count
    secondary_metrics: [run_duration_ms, discussion_created, validation_pass_rate]
    guardrail_metrics:
      - name: empty_output_rate
        direction: min
        threshold: 0
      - name: missed_validation_failures
        direction: min
        threshold: 0
    min_samples: 20
    weight: [50, 50]
    start_date: "2026-05-16"
    analysis_type: mann_whitney
    tags: [prompt-engineering, cost-optimization, blog-auditor]
    notify:
      issue: 0
```

> **Note**: Replace `issue: 0` with this issue's number after creation.

**Variant descriptions**:
- `detailed`: Current behavior — full step-by-step instructions with explicit bash commands, exact markdown templates, and exhaustive success criteria checklist.
- `concise`: High-level goal-oriented instructions that describe *what* to achieve (navigate, validate, report) without prescribing exact commands or pre-written markdown templates. The agent selects its own approach.

### Workflow Changes Required

Wrap the detailed instruction body with a conditional block and add a concise alternative. The `experiments.prompt_style` reference is resolved at compile time.

**Before** (current body opening):
```markdown
## Audit Process

### Phase 1: Navigate and Capture Blog Content

Use Playwright to navigate to the target URL and capture the accessibility snapshot:

1. **Navigate to URL**: Run `playwright-cli browser_navigate --url (githubnext.com/redacted) to load the page
...
```

**After** (conditional wrap):
```markdown
{{#if experiments.prompt_style == "concise" }}
## Audit Process

Navigate to `(githubnext.com/redacted) using Playwright, capture the accessibility snapshot, and validate:

- HTTP status is 200
- Final URL is within `githubnext.com` / `www.githubnext.com`
- Content length exceeds 5,000 characters
- All required keywords present: `agentic-workflows`, `GitHub`, `workflow`, `compiler`
- Any YAML/Markdown workflow code snippets pass `gh aw compile --no-emit --validate`

Create a discussion in the **Audits** category titled `[audit] Agentic Workflows blog audit - PASSED` (or `FAILED`). Include a summary table of each check with pass/fail status and the values observed. For failures, add suggested remediation steps.
{{else}}
## Audit Process

### Phase 1: Navigate and Capture Blog Content
... (existing detailed content unchanged)
{{/if}}
```

### Success Metrics

| Metric | Type | Target |
|--------|------|--------|
| effective_token_count | Primary | ≥20% reduction in concise variant |
| run_duration_ms | Secondary | ≥15% reduction in concise variant |
| discussion_created | Secondary | 100% in both variants |
| validation_pass_rate | Secondary | Equal across variants |
| empty_output_rate | Guardrail | Must remain 0% |
| missed_validation_failures | Guardrail | Must remain 0% |

### Statistical Design

- **Variants**: `detailed` (baseline), `concise` (treatment)
- **Assignment**: Round-robin via `gh-aw` experiments runtime (cache-based)
- **Minimum runs per variant**: 20 (workflow runs weekly — ~40 weeks total; consider adding a `workflow_dispatch` trigger during experiment)
- **Expected experiment duration**: ~40 weeks at weekly cadence; recommend adding a manual-dispatch option to accelerate sampling
- **Analysis approach**: Mann-Whitney U test on token count distributions (non-parametric, appropriate for skewed token distributions)

### Implementation Steps

- [ ] Add `experiments:` section to frontmatter (update `issue:` field with this issue number)
- [ ] Add `{{#if experiments.prompt_style == "concise" }}` conditional block around audit instructions
- [ ] Run `gh aw compile blog-auditor` to regenerate lock file
- [ ] Monitor experiment artifact uploaded per run to `/tmp/gh-aw/experiments/state.json`
- [ ] After 20+ runs per variant, analyze token distributions via workflow run artifacts
- [ ] Document findings and promote winning variant

### Infrastructure Status

✅ All three advanced experiment schema fields (`analysis_type`, `tags`, `notify`) are **fully implemented** in `pkg/workflow/compiler_experiments.go` and `actions/setup/js/pick_experiment.cjs`. No infrastructure sub-issue is required.

### References

- [A/B Testing in gh-aw](https://github.com/github/gh-aw/blob/main/.github/aw/github-agentic-workflows.md)
- Workflow file: `.github/workflows/blog-auditor.md`







> Generated by [🧪 Daily A/B Testing Advisor](https://github.com/github/gh-aw/actions/runs/25959988571) · ● 3.6M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fab-testing-advisor%22&type=issues)
> - [x] expires  on May 30, 2026, 10:52 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ab-advisor] Experiment campaign for blog-auditor: A/B test prompt_style #32603

🧪 Experiment Campaign: blog-auditor

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

Infrastructure Status

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Type	Target
effective_token_count	Primary	≥20% reduction in concise variant
run_duration_ms	Secondary	≥15% reduction in concise variant
discussion_created	Secondary	100% in both variants
validation_pass_rate	Secondary	Equal across variants
empty_output_rate	Guardrail	Must remain 0%
missed_validation_failures	Guardrail	Must remain 0%

[ab-advisor] Experiment campaign for blog-auditor: A/B test prompt_style #32603

Description

🧪 Experiment Campaign: blog-auditor

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

Infrastructure Status

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions