Skip to content

[ab-advisor] Experiment campaign for blog-auditor: A/B test prompt_style #32603

@github-actions

Description

@github-actions

🧪 Experiment Campaign: blog-auditor

Workflow file: .github/workflows/blog-auditor.md
Selected dimension: prompt_style
Triggered by: ab-testing-advisor on 2026-05-16


Background

The blog-auditor workflow performs a weekly automated audit of the GitHub Next Agentic Workflows blog page, validating accessibility, content integrity, and code snippet correctness using Playwright. The current prompt is extremely prescriptive — it includes exact bash commands, explicit code blocks for every step, and full Markdown templates for both pass and fail reports. This level of detail may be consuming unnecessary tokens and forcing the agent into rigid step-by-step execution rather than adapting intelligently. Testing a concise prompt style vs. the current detailed one will reveal whether the extra verbosity adds measurable quality or just cost.

Hypothesis

Null hypothesis (H0): A concise prompt variant produces the same discussion quality and validation correctness as the current detailed prompt.

Alternative hypothesis (H1): A concise prompt reduces effective token consumption and run duration by ≥20% while maintaining equivalent audit correctness (zero false-negative audits, same keyword-check pass rate).

Experiment Configuration

Add the following experiments: block to the workflow frontmatter:

experiments:
  prompt_style:
    variants: [detailed, concise]
    description: "Tests whether a high-level goal-oriented prompt produces the same audit quality as the current step-by-step detailed instructions"
    hypothesis: "H0: no change in audit correctness or discussion quality. H1: concise variant reduces token cost ≥20% with no degradation in validation accuracy"
    metric: effective_token_count
    secondary_metrics: [run_duration_ms, discussion_created, validation_pass_rate]
    guardrail_metrics:
      - name: empty_output_rate
        direction: min
        threshold: 0
      - name: missed_validation_failures
        direction: min
        threshold: 0
    min_samples: 20
    weight: [50, 50]
    start_date: "2026-05-16"
    analysis_type: mann_whitney
    tags: [prompt-engineering, cost-optimization, blog-auditor]
    notify:
      issue: 0

Note: Replace issue: 0 with this issue's number after creation.

Variant descriptions:

  • detailed: Current behavior — full step-by-step instructions with explicit bash commands, exact markdown templates, and exhaustive success criteria checklist.
  • concise: High-level goal-oriented instructions that describe what to achieve (navigate, validate, report) without prescribing exact commands or pre-written markdown templates. The agent selects its own approach.

Workflow Changes Required

Wrap the detailed instruction body with a conditional block and add a concise alternative. The experiments.prompt_style reference is resolved at compile time.

Before (current body opening):

## Audit Process

### Phase 1: Navigate and Capture Blog Content

Use Playwright to navigate to the target URL and capture the accessibility snapshot:

1. **Navigate to URL**: Run `playwright-cli browser_navigate --url (githubnext.com/redacted) to load the page
...

After (conditional wrap):

{{#if experiments.prompt_style == "concise" }}
## Audit Process

Navigate to `(githubnext.com/redacted) using Playwright, capture the accessibility snapshot, and validate:

- HTTP status is 200
- Final URL is within `githubnext.com` / `www.githubnext.com`
- Content length exceeds 5,000 characters
- All required keywords present: `agentic-workflows`, `GitHub`, `workflow`, `compiler`
- Any YAML/Markdown workflow code snippets pass `gh aw compile --no-emit --validate`

Create a discussion in the **Audits** category titled `[audit] Agentic Workflows blog audit - PASSED` (or `FAILED`). Include a summary table of each check with pass/fail status and the values observed. For failures, add suggested remediation steps.
{{else}}
## Audit Process

### Phase 1: Navigate and Capture Blog Content
... (existing detailed content unchanged)
{{/if}}

Success Metrics

Metric Type Target
effective_token_count Primary ≥20% reduction in concise variant
run_duration_ms Secondary ≥15% reduction in concise variant
discussion_created Secondary 100% in both variants
validation_pass_rate Secondary Equal across variants
empty_output_rate Guardrail Must remain 0%
missed_validation_failures Guardrail Must remain 0%

Statistical Design

  • Variants: detailed (baseline), concise (treatment)
  • Assignment: Round-robin via gh-aw experiments runtime (cache-based)
  • Minimum runs per variant: 20 (workflow runs weekly — ~40 weeks total; consider adding a workflow_dispatch trigger during experiment)
  • Expected experiment duration: ~40 weeks at weekly cadence; recommend adding a manual-dispatch option to accelerate sampling
  • Analysis approach: Mann-Whitney U test on token count distributions (non-parametric, appropriate for skewed token distributions)

Implementation Steps

  • Add experiments: section to frontmatter (update issue: field with this issue number)
  • Add {{#if experiments.prompt_style == "concise" }} conditional block around audit instructions
  • Run gh aw compile blog-auditor to regenerate lock file
  • Monitor experiment artifact uploaded per run to /tmp/gh-aw/experiments/state.json
  • After 20+ runs per variant, analyze token distributions via workflow run artifacts
  • Document findings and promote winning variant

Infrastructure Status

✅ All three advanced experiment schema fields (analysis_type, tags, notify) are fully implemented in pkg/workflow/compiler_experiments.go and actions/setup/js/pick_experiment.cjs. No infrastructure sub-issue is required.

References

Generated by 🧪 Daily A/B Testing Advisor · ● 3.6M ·

  • expires on May 30, 2026, 10:52 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions