Skip to content

[ab-advisor] Improve experiment infrastructure: schema, reporting & audit #29604

@github-actions

Description

@github-actions

Overview

This sub-issue tracks concrete improvements to the gh-aw experiments infrastructure across three areas: frontmatter schema, reporting/dashboards, and audit/observability integration. It was identified during the A/B testing campaign design for #29602.


Area 1: Frontmatter Schema Enhancements

The current schema supports only a flat experiments: { name: [variant1, variant2] } mapping. This is enough to pick variants but provides no machine-readable metadata for tooling to act on.

Proposed enhanced schema:

experiments:
  prompt_style:
    variants: [concise, verbose]
    description: "Test whether concise vs verbose prompts reduce token consumption without quality loss"
    metric: effective_tokens
    weight: [50, 50]
    issue: 1234
    start_date: "2026-05-01"
    end_date: "2026-06-15"

Field definitions:

  • variants (required): List of variant strings (replaces the current bare array).
  • description (optional): Human-readable hypothesis/purpose, surfaced in step summaries and audit.
  • metric (optional): Primary metric name (links to OTEL attribute or artifact field for automated collection).
  • weight (optional): Traffic allocation percentages (must sum to 100). Enables 80/20 holdout or gradual rollout. Default: equal weight.
  • issue (optional): GitHub issue number that tracks this experiment. Used to auto-close or comment when the experiment concludes.
  • start_date / end_date (optional): ISO-8601 dates. pick_experiment.cjs skips assignment outside this window and uses the control variant.

Backward-compatibility: the bare-array form (prompt_style: [concise, verbose]) should continue to work via a schema migration shim in extractExperimentsFromFrontmatter.


Area 2: Reporting & Dashboards

Currently no automated analysis is run over accumulated experiment state. A new daily-experiment-report workflow (or an extension to an existing reporting workflow) should:

  1. Aggregate state artifacts: Download the experiment artifact from the last N runs of each workflow that declares experiments. Parse each state.json to extract { experiment, variant, run_id, token_count, duration_ms, conclusion }.

  2. Compute running statistics per variant:

    • Mean, variance, and 95% CI for the primary metric
    • Sample size per variant
    • Proportion of successful runs
  3. Detect significance: Apply a Welch t-test (continuous metrics) or two-proportion z-test (binary outcomes). Flag when p-value < 0.05 or the Bayes factor exceeds a threshold.

  4. Generate an ASCII comparison table artifact:

experiment: prompt_style  (daily-community-attribution)
┌─────────┬──────┬────────────┬────────────┬──────────┐
│ variant │  n   │ mean_tok   │ p50_tok    │ p-value  │
├─────────┼──────┼────────────┼────────────┼──────────┤
│ concise │  12  │  8 420     │  8 210     │          │
│ verbose │  13  │ 11 350     │ 11 100     │  0.003 ✅ │
└─────────┴──────┴────────────┴────────────┴──────────┘
Winner: concise  (−25.8% tokens, p=0.003)
  1. Post a discussion comment with the table, current winner, and a recommendation to promote or extend the experiment.

Area 3: Audit & Logs Integration

OTEL span attributes: The compiled workflow step that invokes the agent should set these span attributes when an experiment is active:

experiment.name    = "prompt_style"
experiment.variant = "concise"
experiment.run_seq = 7   # invocation count for this variant

This allows filtering OTEL traces by experiment variant in any compatible backend (e.g., Honeycomb, Jaeger).

gh aw audit integration:

  • Surface experiment assignments in gh aw audit output:
    Run #25229568147  workflow=daily-community-attribution  [experiment: prompt_style=concise]
    
  • Add --experiment NAME and --variant VALUE flags to filter audit log output to a specific experiment or variant, enabling side-by-side comparison of failure modes.

Step summary enhancements in pick_experiment.cjs:

The existing step summary already shows the assignment table. Extend it to include:

  • The description field from the enhanced frontmatter schema
  • A direct link to the tracking issue (if issue: is set)
  • The p-value from the most recent report artifact (if available in the cache)
  • A notice if the experiment window has closed (end_date has passed) so engineers know to promote a winner

Artifact schema additions (state.json):

{
  "counts": { "prompt_style": { "concise": 7, "verbose": 6 } },
  "assignments": [
    {
      "run_id": "25229568147",
      "run_at": "2026-05-01T19:00:00Z",
      "experiment": "prompt_style",
      "variant": "concise",
      "conclusion": "success",
      "duration_ms": 42000,
      "primary_metric": 8420
    }
  ]
}

The assignments array enables the reporting workflow to reconstruct the full history from any single artifact, without needing to download every run's artifact individually (each artifact is cumulative).


Implementation Steps

  • Update extractExperimentsFromFrontmatter in pkg/workflow/compiler_experiments.go to support the enhanced object schema alongside the existing bare-array form
  • Update pick_experiment.cjs to read weight, start_date, end_date, and description from the spec; emit these in the step summary and append to assignments array in state.json
  • Add --experiment and --variant filter flags to gh aw audit
  • Inject experiment.* OTEL span attributes in the compiled workflow step template
  • Create daily-experiment-report.md workflow that aggregates artifacts, computes statistics, and posts to a discussion
  • Update JSON schema for workflow frontmatter (schemas/) to document the new fields
  • Add unit tests for the weight-based variant selection algorithm in pick_experiment.cjs

References

Generated by Daily A/B Testing Advisor · ● 484.4K ·

  • expires on May 15, 2026, 7:35 PM UTC

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions