[ab-advisor] Improve experiment infrastructure: schema, reporting & audit

### Overview

This sub-issue tracks concrete improvements to the `gh-aw` experiments infrastructure across three areas: frontmatter schema, reporting/dashboards, and audit/observability integration. It was identified during the A/B testing campaign design for #29602.

---

### Area 1: Frontmatter Schema Enhancements

The current schema supports only a flat `experiments: { name: [variant1, variant2] }` mapping. This is enough to pick variants but provides no machine-readable metadata for tooling to act on.

**Proposed enhanced schema**:

```yaml
experiments:
  prompt_style:
    variants: [concise, verbose]
    description: "Test whether concise vs verbose prompts reduce token consumption without quality loss"
    metric: effective_tokens
    weight: [50, 50]
    issue: 1234
    start_date: "2026-05-01"
    end_date: "2026-06-15"
```

**Field definitions**:
- `variants` *(required)*: List of variant strings (replaces the current bare array).
- `description` *(optional)*: Human-readable hypothesis/purpose, surfaced in step summaries and audit.
- `metric` *(optional)*: Primary metric name (links to OTEL attribute or artifact field for automated collection).
- `weight` *(optional)*: Traffic allocation percentages (must sum to 100). Enables 80/20 holdout or gradual rollout. Default: equal weight.
- `issue` *(optional)*: GitHub issue number that tracks this experiment. Used to auto-close or comment when the experiment concludes.
- `start_date` / `end_date` *(optional)*: ISO-8601 dates. `pick_experiment.cjs` skips assignment outside this window and uses the control variant.

**Backward-compatibility**: the bare-array form (`prompt_style: [concise, verbose]`) should continue to work via a schema migration shim in `extractExperimentsFromFrontmatter`.

---

### Area 2: Reporting & Dashboards

Currently no automated analysis is run over accumulated experiment state. A new `daily-experiment-report` workflow (or an extension to an existing reporting workflow) should:

1. **Aggregate state artifacts**: Download the `experiment` artifact from the last N runs of each workflow that declares experiments. Parse each `state.json` to extract `{ experiment, variant, run_id, token_count, duration_ms, conclusion }`.

2. **Compute running statistics** per variant:
   - Mean, variance, and 95% CI for the primary metric
   - Sample size per variant
   - Proportion of successful runs

3. **Detect significance**: Apply a Welch t-test (continuous metrics) or two-proportion z-test (binary outcomes). Flag when p-value < 0.05 or the Bayes factor exceeds a threshold.

4. **Generate an ASCII comparison table** artifact:

```
experiment: prompt_style  (daily-community-attribution)
┌─────────┬──────┬────────────┬────────────┬──────────┐
│ variant │  n   │ mean_tok   │ p50_tok    │ p-value  │
├─────────┼──────┼────────────┼────────────┼──────────┤
│ concise │  12  │  8 420     │  8 210     │          │
│ verbose │  13  │ 11 350     │ 11 100     │  0.003 ✅ │
└─────────┴──────┴────────────┴────────────┴──────────┘
Winner: concise  (−25.8% tokens, p=0.003)
```

5. **Post a discussion comment** with the table, current winner, and a recommendation to promote or extend the experiment.

---

### Area 3: Audit & Logs Integration

**OTEL span attributes**: The compiled workflow step that invokes the agent should set these span attributes when an experiment is active:

```
experiment.name    = "prompt_style"
experiment.variant = "concise"
experiment.run_seq = 7   # invocation count for this variant
```

This allows filtering OTEL traces by experiment variant in any compatible backend (e.g., Honeycomb, Jaeger).

**`gh aw audit` integration**:
- Surface experiment assignments in `gh aw audit` output:
  ```
  Run #25229568147  workflow=daily-community-attribution  [experiment: prompt_style=concise]
  ```
- Add `--experiment NAME` and `--variant VALUE` flags to filter audit log output to a specific experiment or variant, enabling side-by-side comparison of failure modes.

**Step summary enhancements in `pick_experiment.cjs`**:

The existing step summary already shows the assignment table. Extend it to include:
- The `description` field from the enhanced frontmatter schema
- A direct link to the tracking issue (if `issue:` is set)
- The p-value from the most recent report artifact (if available in the cache)
- A notice if the experiment window has closed (`end_date` has passed) so engineers know to promote a winner

**Artifact schema additions** (`state.json`):

```json
{
  "counts": { "prompt_style": { "concise": 7, "verbose": 6 } },
  "assignments": [
    {
      "run_id": "25229568147",
      "run_at": "2026-05-01T19:00:00Z",
      "experiment": "prompt_style",
      "variant": "concise",
      "conclusion": "success",
      "duration_ms": 42000,
      "primary_metric": 8420
    }
  ]
}
```

The `assignments` array enables the reporting workflow to reconstruct the full history from any single artifact, without needing to download every run's artifact individually (each artifact is cumulative).

---

### Implementation Steps

- [ ] Update `extractExperimentsFromFrontmatter` in `pkg/workflow/compiler_experiments.go` to support the enhanced object schema alongside the existing bare-array form
- [ ] Update `pick_experiment.cjs` to read `weight`, `start_date`, `end_date`, and `description` from the spec; emit these in the step summary and append to `assignments` array in `state.json`
- [ ] Add `--experiment` and `--variant` filter flags to `gh aw audit`
- [ ] Inject `experiment.*` OTEL span attributes in the compiled workflow step template
- [ ] Create `daily-experiment-report.md` workflow that aggregates artifacts, computes statistics, and posts to a discussion
- [ ] Update JSON schema for workflow frontmatter (`schemas/`) to document the new fields
- [ ] Add unit tests for the weight-based variant selection algorithm in `pick_experiment.cjs`

### References

- Parent experiment campaign: #29602
- `pkg/workflow/compiler_experiments.go`
- `actions/setup/js/pick_experiment.cjs`
Related to #29602







> Generated by [Daily A/B Testing Advisor](https://github.com/github/gh-aw/actions/runs/25229568147/agentic_workflow) · ● 484.4K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fab-testing-advisor%22&type=issues)
> - [x] expires  on May 15, 2026, 7:35 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ab-advisor] Improve experiment infrastructure: schema, reporting & audit #29604

Overview

Area 1: Frontmatter Schema Enhancements

Area 2: Reporting & Dashboards

Area 3: Audit & Logs Integration

Implementation Steps

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[ab-advisor] Improve experiment infrastructure: schema, reporting & audit #29604

Description

Overview

Area 1: Frontmatter Schema Enhancements

Area 2: Reporting & Dashboards

Area 3: Audit & Logs Integration

Implementation Steps

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions