Overview
This sub-issue tracks concrete improvements to the gh-aw experiments infrastructure across three areas: frontmatter schema, reporting/dashboards, and audit/observability integration. It was identified during the A/B testing campaign design for #29602.
Area 1: Frontmatter Schema Enhancements
The current schema supports only a flat experiments: { name: [variant1, variant2] } mapping. This is enough to pick variants but provides no machine-readable metadata for tooling to act on.
Proposed enhanced schema:
experiments:
prompt_style:
variants: [concise, verbose]
description: "Test whether concise vs verbose prompts reduce token consumption without quality loss"
metric: effective_tokens
weight: [50, 50]
issue: 1234
start_date: "2026-05-01"
end_date: "2026-06-15"
Field definitions:
variants (required): List of variant strings (replaces the current bare array).
description (optional): Human-readable hypothesis/purpose, surfaced in step summaries and audit.
metric (optional): Primary metric name (links to OTEL attribute or artifact field for automated collection).
weight (optional): Traffic allocation percentages (must sum to 100). Enables 80/20 holdout or gradual rollout. Default: equal weight.
issue (optional): GitHub issue number that tracks this experiment. Used to auto-close or comment when the experiment concludes.
start_date / end_date (optional): ISO-8601 dates. pick_experiment.cjs skips assignment outside this window and uses the control variant.
Backward-compatibility: the bare-array form (prompt_style: [concise, verbose]) should continue to work via a schema migration shim in extractExperimentsFromFrontmatter.
Area 2: Reporting & Dashboards
Currently no automated analysis is run over accumulated experiment state. A new daily-experiment-report workflow (or an extension to an existing reporting workflow) should:
-
Aggregate state artifacts: Download the experiment artifact from the last N runs of each workflow that declares experiments. Parse each state.json to extract { experiment, variant, run_id, token_count, duration_ms, conclusion }.
-
Compute running statistics per variant:
- Mean, variance, and 95% CI for the primary metric
- Sample size per variant
- Proportion of successful runs
-
Detect significance: Apply a Welch t-test (continuous metrics) or two-proportion z-test (binary outcomes). Flag when p-value < 0.05 or the Bayes factor exceeds a threshold.
-
Generate an ASCII comparison table artifact:
experiment: prompt_style (daily-community-attribution)
┌─────────┬──────┬────────────┬────────────┬──────────┐
│ variant │ n │ mean_tok │ p50_tok │ p-value │
├─────────┼──────┼────────────┼────────────┼──────────┤
│ concise │ 12 │ 8 420 │ 8 210 │ │
│ verbose │ 13 │ 11 350 │ 11 100 │ 0.003 ✅ │
└─────────┴──────┴────────────┴────────────┴──────────┘
Winner: concise (−25.8% tokens, p=0.003)
- Post a discussion comment with the table, current winner, and a recommendation to promote or extend the experiment.
Area 3: Audit & Logs Integration
OTEL span attributes: The compiled workflow step that invokes the agent should set these span attributes when an experiment is active:
experiment.name = "prompt_style"
experiment.variant = "concise"
experiment.run_seq = 7 # invocation count for this variant
This allows filtering OTEL traces by experiment variant in any compatible backend (e.g., Honeycomb, Jaeger).
gh aw audit integration:
- Surface experiment assignments in
gh aw audit output:
Run #25229568147 workflow=daily-community-attribution [experiment: prompt_style=concise]
- Add
--experiment NAME and --variant VALUE flags to filter audit log output to a specific experiment or variant, enabling side-by-side comparison of failure modes.
Step summary enhancements in pick_experiment.cjs:
The existing step summary already shows the assignment table. Extend it to include:
- The
description field from the enhanced frontmatter schema
- A direct link to the tracking issue (if
issue: is set)
- The p-value from the most recent report artifact (if available in the cache)
- A notice if the experiment window has closed (
end_date has passed) so engineers know to promote a winner
Artifact schema additions (state.json):
{
"counts": { "prompt_style": { "concise": 7, "verbose": 6 } },
"assignments": [
{
"run_id": "25229568147",
"run_at": "2026-05-01T19:00:00Z",
"experiment": "prompt_style",
"variant": "concise",
"conclusion": "success",
"duration_ms": 42000,
"primary_metric": 8420
}
]
}
The assignments array enables the reporting workflow to reconstruct the full history from any single artifact, without needing to download every run's artifact individually (each artifact is cumulative).
Implementation Steps
References
Generated by Daily A/B Testing Advisor · ● 484.4K · ◷
Overview
This sub-issue tracks concrete improvements to the
gh-awexperiments infrastructure across three areas: frontmatter schema, reporting/dashboards, and audit/observability integration. It was identified during the A/B testing campaign design for #29602.Area 1: Frontmatter Schema Enhancements
The current schema supports only a flat
experiments: { name: [variant1, variant2] }mapping. This is enough to pick variants but provides no machine-readable metadata for tooling to act on.Proposed enhanced schema:
Field definitions:
variants(required): List of variant strings (replaces the current bare array).description(optional): Human-readable hypothesis/purpose, surfaced in step summaries and audit.metric(optional): Primary metric name (links to OTEL attribute or artifact field for automated collection).weight(optional): Traffic allocation percentages (must sum to 100). Enables 80/20 holdout or gradual rollout. Default: equal weight.issue(optional): GitHub issue number that tracks this experiment. Used to auto-close or comment when the experiment concludes.start_date/end_date(optional): ISO-8601 dates.pick_experiment.cjsskips assignment outside this window and uses the control variant.Backward-compatibility: the bare-array form (
prompt_style: [concise, verbose]) should continue to work via a schema migration shim inextractExperimentsFromFrontmatter.Area 2: Reporting & Dashboards
Currently no automated analysis is run over accumulated experiment state. A new
daily-experiment-reportworkflow (or an extension to an existing reporting workflow) should:Aggregate state artifacts: Download the
experimentartifact from the last N runs of each workflow that declares experiments. Parse eachstate.jsonto extract{ experiment, variant, run_id, token_count, duration_ms, conclusion }.Compute running statistics per variant:
Detect significance: Apply a Welch t-test (continuous metrics) or two-proportion z-test (binary outcomes). Flag when p-value < 0.05 or the Bayes factor exceeds a threshold.
Generate an ASCII comparison table artifact:
Area 3: Audit & Logs Integration
OTEL span attributes: The compiled workflow step that invokes the agent should set these span attributes when an experiment is active:
This allows filtering OTEL traces by experiment variant in any compatible backend (e.g., Honeycomb, Jaeger).
gh aw auditintegration:gh aw auditoutput:--experiment NAMEand--variant VALUEflags to filter audit log output to a specific experiment or variant, enabling side-by-side comparison of failure modes.Step summary enhancements in
pick_experiment.cjs:The existing step summary already shows the assignment table. Extend it to include:
descriptionfield from the enhanced frontmatter schemaissue:is set)end_datehas passed) so engineers know to promote a winnerArtifact schema additions (
state.json):{ "counts": { "prompt_style": { "concise": 7, "verbose": 6 } }, "assignments": [ { "run_id": "25229568147", "run_at": "2026-05-01T19:00:00Z", "experiment": "prompt_style", "variant": "concise", "conclusion": "success", "duration_ms": 42000, "primary_metric": 8420 } ] }The
assignmentsarray enables the reporting workflow to reconstruct the full history from any single artifact, without needing to download every run's artifact individually (each artifact is cumulative).Implementation Steps
extractExperimentsFromFrontmatterinpkg/workflow/compiler_experiments.goto support the enhanced object schema alongside the existing bare-array formpick_experiment.cjsto readweight,start_date,end_date, anddescriptionfrom the spec; emit these in the step summary and append toassignmentsarray instate.json--experimentand--variantfilter flags togh aw auditexperiment.*OTEL span attributes in the compiled workflow step templatedaily-experiment-report.mdworkflow that aggregates artifacts, computes statistics, and posts to a discussionschemas/) to document the new fieldspick_experiment.cjsReferences
pkg/workflow/compiler_experiments.goactions/setup/js/pick_experiment.cjsRelated to [ab-advisor] Experiment campaign for daily-community-attribution: A/B test prompt_style #29602