🔬 Improve Experiment Infrastructure: Schema, Reporting & Audit
Triggered by: ab-testing-advisor on 2026-05-30
Parent campaign: #aw_filedieta
Background
The field-presence-checker agent verified the current state of three candidate schema fields (analysis_type, tags, notify). Two of the three — tags and notify — are only partially implemented: they are parsed and rendered in the picker summary table but have no downstream behavioral effect. This issue tracks completing those fields and proposes concrete reporting and audit-trail improvements.
Field-presence-checker findings summary
| Field |
Status |
Gap |
analysis_type |
✅ fully implemented |
Consumed by experiments_analyze_statistics.go to select the statistical test — no action needed. |
tags |
⚠️ partial |
Parsed and displayed in picker output, but no Go code filters, routes, or acts on tags at runtime. |
notify |
⚠️ partial |
Parsed into ExperimentNotify{Discussion, Issue} struct and displayed in picker output, but no code delivers notifications. |
Area 1: Frontmatter Schema — Complete tags and notify
1a. tags — Add Runtime Filtering
Current gap: cfg.Tags is populated but never consumed outside display.
Proposed change in pkg/workflow/compiler_experiments.go and the CLI:
- Surface
tags in the compiled lock-file env block so pick_experiment.cjs can filter experiments by tag when a --tag flag is supplied.
- In
pkg/cli/experiments_analyze_statistics.go, allow --tag <label> to restrict analysis to experiments bearing that tag (useful for bulk analysis of cost-reduction or quality campaigns).
- In the daily experiment report workflow, use tags to group experiments by theme in the summary table.
# Example frontmatter usage
experiments:
prompt_style:
variants: [detailed, concise]
tags: [cost-reduction, prompt-engineering]
1b. notify — Implement Notification Delivery
Current gap: cfg.Notify.Discussion and cfg.Notify.Issue are parsed but no code posts the notification.
Proposed change in pkg/cli/experiments_analyze_statistics.go (or a new experiments_notify.go):
// After significance is detected:
if result.PValue < 0.05 && cfg.Notify.Issue != 0 {
postIssueComment(cfg.Notify.Issue, formatSignificanceReport(result))
}
if result.PValue < 0.05 && cfg.Notify.Discussion != 0 {
postDiscussionComment(cfg.Notify.Discussion, formatSignificanceReport(result))
}
The pick_experiment.cjs step summary should also surface notify targets so operators can see at a glance where results will be delivered.
# Example frontmatter usage
experiments:
prompt_style:
variants: [detailed, concise]
notify:
issue: 1234
Area 2: Reporting & Dashboards
Propose a daily-experiment-report workflow (or extension of the existing one) that:
- Aggregates run data — downloads the
experiments/state.json artifact from each recent workflow run via gh run download, extracts variant and outcome metrics per run.
- Computes running statistics — per variant:
n, mean, variance, p_value (using analysis_type to select the right test).
- Detects significance — when
p_value < 0.05 AND n >= min_samples for all variants, marks the experiment as concluded and identifies the winner.
- Generates a visual table — ASCII comparison table artifact:
┌─────────────────────────────────────────────────┐
│ Experiment: prompt_style (daily-file-diet) │
│ Runs: detailed=52 concise=51 total=103 │
├─────────────────┬───────────────┬───────────────┤
│ Metric │ detailed │ concise │
├─────────────────┼───────────────┼───────────────┤
│ Completeness │ 0.91 ± 0.08 │ 0.88 ± 0.11 │
│ Token count │ 4,820 ± 310 │ 3,940 ± 290 ✅ │
│ Duration (ms) │ 48,200 │ 41,100 │
├─────────────────┴───────────────┴───────────────┤
│ p-value: 0.031 Winner: concise (token savings) │
└─────────────────────────────────────────────────┘
- Posts results — uses
cfg.Notify to post the report to the designated discussion or issue, and calls safeoutputs add_comment with the table.
Area 3: Audit & OTEL Integration
3a. OTEL Span Attributes
In pick_experiment.cjs, after assignment, emit OTEL span attributes:
core.exportVariable('OTEL_RESOURCE_ATTRIBUTES',
`experiment.name=${experimentName},experiment.variant=${assignedVariant},` +
process.env.OTEL_RESOURCE_ATTRIBUTES || ''
);
This surfaces experiment.name and experiment.variant on every span in the run, enabling Grafana/Jaeger dashboards to facet traces by experiment without any post-hoc joining.
3b. gh aw audit Integration
- Add
experiment_name and variant columns to the audit log emitted by gh aw audit.
- Enable
gh aw audit --experiment prompt_style --variant concise to filter audit entries to only runs of a given variant.
- This allows direct comparison of failure modes (e.g., noop-without-output errors) across variants without needing to join on run IDs externally.
3c. Step Summary Enrichment
In pick_experiment.cjs, append to the GitHub Actions step summary:
### 🧪 Experiment Assignment
| Field | Value |
|---|---|
| Name | `prompt_style` |
| Variant | `concise` |
| Run # | 37 |
| Notify targets | issue #1234 |
| Tags | cost-reduction, prompt-engineering |
This makes experiment metadata immediately visible in the Actions run summary without needing to download artifacts.
Implementation Steps
References
- A/B Testing in gh-aw
pkg/workflow/compiler_experiments.go
actions/setup/js/pick_experiment.cjs
pkg/cli/experiments_analyze_statistics.go
Generated by 🧪 Daily A/B Testing Advisor · sonnet46 1.7M · ◷
🔬 Improve Experiment Infrastructure: Schema, Reporting & Audit
Triggered by:
ab-testing-advisoron 2026-05-30Parent campaign: #aw_filedieta
Background
The
field-presence-checkeragent verified the current state of three candidate schema fields (analysis_type,tags,notify). Two of the three —tagsandnotify— are only partially implemented: they are parsed and rendered in the picker summary table but have no downstream behavioral effect. This issue tracks completing those fields and proposes concrete reporting and audit-trail improvements.Field-presence-checker findings summary
analysis_typeexperiments_analyze_statistics.goto select the statistical test — no action needed.tagsnotifyExperimentNotify{Discussion, Issue}struct and displayed in picker output, but no code delivers notifications.Area 1: Frontmatter Schema — Complete
tagsandnotify1a.
tags— Add Runtime FilteringCurrent gap:
cfg.Tagsis populated but never consumed outside display.Proposed change in
pkg/workflow/compiler_experiments.goand the CLI:tagsin the compiled lock-fileenvblock sopick_experiment.cjscan filter experiments by tag when a--tagflag is supplied.pkg/cli/experiments_analyze_statistics.go, allow--tag <label>to restrict analysis to experiments bearing that tag (useful for bulk analysis ofcost-reductionorqualitycampaigns).1b.
notify— Implement Notification DeliveryCurrent gap:
cfg.Notify.Discussionandcfg.Notify.Issueare parsed but no code posts the notification.Proposed change in
pkg/cli/experiments_analyze_statistics.go(or a newexperiments_notify.go):The
pick_experiment.cjsstep summary should also surfacenotifytargets so operators can see at a glance where results will be delivered.Area 2: Reporting & Dashboards
Propose a
daily-experiment-reportworkflow (or extension of the existing one) that:experiments/state.jsonartifact from each recent workflow run viagh run download, extractsvariantand outcome metrics per run.n,mean,variance,p_value(usinganalysis_typeto select the right test).p_value < 0.05ANDn >= min_samplesfor all variants, marks the experiment as concluded and identifies the winner.cfg.Notifyto post the report to the designated discussion or issue, and callssafeoutputs add_commentwith the table.Area 3: Audit & OTEL Integration
3a. OTEL Span Attributes
In
pick_experiment.cjs, after assignment, emit OTEL span attributes:This surfaces
experiment.nameandexperiment.varianton every span in the run, enabling Grafana/Jaeger dashboards to facet traces by experiment without any post-hoc joining.3b.
gh aw auditIntegrationexperiment_nameandvariantcolumns to the audit log emitted bygh aw audit.gh aw audit --experiment prompt_style --variant conciseto filter audit entries to only runs of a given variant.3c. Step Summary Enrichment
In
pick_experiment.cjs, append to the GitHub Actions step summary:This makes experiment metadata immediately visible in the Actions run summary without needing to download artifacts.
Implementation Steps
tags: Add--tagfiltering togh aw experiments analyzeand lock-file env expansionnotify: Implement notification delivery inexperiments_analyze_statistics.go(or newexperiments_notify.go)daily-experiment-reportworkflow to aggregate artifacts, compute stats, render ASCII table, and post vianotifyexperiment.name/experiment.variantresource attributes frompick_experiment.cjsgh aw auditoutput and--experiment/--variantfilter flagspick_experiment.cjsstep summary with full experiment metadata tableReferences
pkg/workflow/compiler_experiments.goactions/setup/js/pick_experiment.cjspkg/cli/experiments_analyze_statistics.go