Skip to content

[ab-advisor] Improve experiment infrastructure: schema, reporting & audit #34635

@github-actions

Description

@github-actions

🔧 Experiment Infrastructure Improvements

Parent campaign: #aw_campaign1
Triggered by: ab-testing-advisor on 2026-05-25

Area 1 gate: All three candidate fields (analysis_type, tags, notify) were checked via field-presence-checker. All three are partially implemented — parsed by the Go compiler but never acted on by pick_experiment.cjs at runtime. This sub-issue is therefore warranted.


Area 1: Frontmatter Schema — Complete the Half-Implemented Fields

The field-presence-checker found all three fields parsed on the Go side but dead on the JS runtime side:

Field Go compiler pick_experiment.cjs Gap
analysis_type ✅ Parsed → cfg.AnalysisType ❌ JSDoc-only, never read Picker never selects or surfaces the declared test type
tags ✅ Parsed → cfg.Tags ❌ JSDoc-only, never filtered Tags cannot be used to filter/group experiments in dashboards
notify ✅ Parsed → cfg.Notify{Discussion,Issue} ❌ JSDoc-only, no alerts dispatched Significance alerts are never sent
Proposed pick_experiment.cjs changes

analysis_type — surface in the step summary so downstream analysis scripts know which test to apply:

// In writeSummary(), add:
if (config.analysis_type) {
  core.summary.addRaw(`- **Analysis type**: \`${config.analysis_type}\`\n`);
}

tags — write tags into the experiment state artifact so dashboards can filter:

// In writeState(), add to the state object:
tags: config.tags ?? [],

notify — dispatch a notification when min_samples is reached for all variants:

// After variant counts are updated, check if experiment is mature:
if (config.notify && allVariantsReachedMinSamples(state, config)) {
  await dispatchNotification(config.notify, experimentName, state);
}

The dispatchNotification function should post a comment to the issue in notify.issue or a discussion reply to notify.discussion, summarising current variant tallies and prompting a human to run analysis.


Area 2: Reporting & Dashboards

Propose a daily-experiment-report workflow that:

  1. Aggregates run artifacts from the last N days: downloads experiments/state.json from each workflow run and merges variant counters per experiment name.
  2. Computes running statistics: for proportion metrics — sample size, observed rate, Wilson confidence interval per variant; for continuous metrics — mean, variance, Mann-Whitney U statistic.
  3. Detects significance: flags when p < 0.05 (two-sample z-test for proportions; Mann-Whitney for continuous). Logs the current leading variant.
  4. Generates ASCII comparison table as a workflow step summary artifact:
Experiment: daily-doc-healer / prompt_style  (n=18 detailed, n=17 concise)
┌──────────┬──────────────────┬──────────┬──────────┐
│ Variant  │ pr_creation_rate │  tokens  │  p-value │
├──────────┼──────────────────┼──────────┼──────────┤
│ detailed │  0.78 [0.55,0.92]│  12 400  │          │
│ concise  │  0.76 [0.53,0.91]│   9 800* │  0.031 ✓ │
└──────────┴──────────────────┴──────────┴──────────┘
* statistically significant at p<0.05
  1. Posts to a discussion with label experiment-results, including the table and the current winning variant.

Implementation checklist:

  • Create .github/workflows/daily-experiment-report.md
  • Script: download artifacts via gh run download --name experiments-state
  • Script: merge JSON state files and compute statistics (Python or Node)
  • Wire analysis_type field so the report uses the declared test (once Area 1 is implemented)
  • Post result to discussion via safeoutputs

Area 3: Audit & OTEL Integration

OTEL span attributes — in pick_experiment.cjs, after a variant is assigned, emit span attributes:

core.exportVariable('OTEL_RESOURCE_ATTRIBUTES',
  `experiment.name=${experimentName},experiment.variant=${chosenVariant}`);

This tags every OTEL span in the run with the experiment context, enabling trace filtering by variant in Grafana/Jaeger.

gh aw audit surface — add experiment assignment to the audit log entry for each run:

run #12345  workflow=daily-doc-healer  variant=concise  experiment=prompt_style

Filterable via gh aw audit --experiment prompt_style --variant concise.

Step summarypick_experiment.cjs should already append to $GITHUB_STEP_SUMMARY; ensure it includes:

| Field | Value |
|---|---|
| Experiment | `prompt_style` |
| Assigned variant | `concise` |
| Analysis type | `proportion_test` |
| Run count (this variant) | 7 / 30 min_samples |

Implementation checklist:

  • Emit OTEL_RESOURCE_ATTRIBUTES in pick_experiment.cjs after variant selection
  • Update gh aw audit to read and display experiment metadata from run annotations
  • Add experiment fields to step summary output in pick_experiment.cjs
  • Document the --experiment / --variant filter flags in gh aw audit --help

References

  • Compiler: pkg/workflow/compiler_experiments.go
  • Picker: actions/setup/js/pick_experiment.cjs
  • Parent campaign: #aw_campaign1
  • A/B Testing in gh-aw

Generated by 🧪 Daily A/B Testing Advisor · sonnet46 1.7M ·

  • expires on Jun 8, 2026, 11:46 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions