Skip to content

feat: experiment infrastructure — schema additions, per-run state, OTEL attributes, audit accuracy#29985

Merged
pelikhan merged 8 commits intomainfrom
copilot/improve-experiment-infrastructure-again
May 3, 2026
Merged

feat: experiment infrastructure — schema additions, per-run state, OTEL attributes, audit accuracy#29985
pelikhan merged 8 commits intomainfrom
copilot/improve-experiment-infrastructure-again

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 3, 2026

The experiment infrastructure lacked traceability (state.json had no per-run history), OTEL observability, and three useful schema fields (analysis_type, tags, notify).

Schema additions (1a/1b/1c)

New fields on the experiments object form:

experiments:
  prompt_style:
    variants: [concise, detailed]
    analysis_type: t_test          # t_test | mann_whitney | proportion_test | bayesian_ab
    tags: [cost, prompting]
    notify:
      issue: 5678                  # post significance alert here (favors issue over discussion)
  • ExperimentConfig Go struct, main_workflow_schema.json, and JS typedef updated in sync
  • Extracted extractIntField() helper in the Go compiler to remove duplicated numeric coercion; now validates inputs (rejects negatives, non-integral float64, and out-of-range values)

Per-run metadata in state.json (1d)

pick_experiment.cjs now appends a run record to state.runs on every invocation:

{
  "counts": { "prompt_style": { "concise": 18, "detailed": 17 } },
  "runs": [
    { "run_id": "25276903561", "timestamp": "2026-05-03T10:41:00.000Z", "assignments": { "prompt_style": "concise" } }
  ]
}

state.runs is capped to the last 512 entries (MAX_RUN_HISTORY) to prevent unbounded growth in state.json size over time.

extractExperimentData in audit_report_experiments.go now reads the last run record directly when present, replacing the fragile max-count heuristic. Legacy state files (no runs field) fall back to the heuristic transparently.

OTEL resource attributes (3a)

After variant selection, pick_experiment.cjs exports experiment assignments into OTEL_RESOURCE_ATTRIBUTES:

experiment.prompt_style=concise,experiment.model_temp=low

Appends to any pre-existing value so other OTEL instrumentation is not clobbered.

Step summary format (3c)

Summary table now shows Counts (current/total) — count for the selected variant over total assigned runs (derived from state.runs when present, falling back to incremented counts). This correctly reflects all runs including those outside the active date window.

Copilot AI and others added 3 commits May 3, 2026 18:41
…n metadata, and OTEL attributes

- Schema additions (1a/1b/1c): add analysis_type, tags, notify fields to ExperimentConfig Go struct, JSON schema, and JS typedef
- Per-run metadata (1d): extend state.json with runs array tracking run_id, timestamp, and assignments per invocation; update audit extraction to prefer direct run records over count heuristic
- OTEL attributes (3a): emit experiment.NAME=VARIANT as OTEL_RESOURCE_ATTRIBUTES after each run
- Step summary (3c): update table to show Counts (this/total) format
- Tests: update and extend JS and Go unit tests for all new features"

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/0d88e2f1-f86f-43f6-bda4-5e39f2e410a2

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…ify JSDoc, improve comment

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/0d88e2f1-f86f-43f6-bda4-5e39f2e410a2

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…ment, improve summary header wording

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/0d88e2f1-f86f-43f6-bda4-5e39f2e410a2

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Improve experiment infrastructure for schema, reporting, and audit feat: experiment infrastructure — schema additions, per-run state, OTEL attributes, audit accuracy May 3, 2026
Copilot AI requested a review from pelikhan May 3, 2026 18:49
@pelikhan pelikhan marked this pull request as ready for review May 3, 2026 18:52
Copilot AI review requested due to automatic review settings May 3, 2026 18:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enhances the experiment infrastructure to improve traceability, schema richness, observability (OTEL), and audit accuracy by recording per-run assignments and adding new experiment metadata fields.

Changes:

  • Add analysis_type, tags, and notify fields to the experiment rich schema across Go types, JSON schema, and JS typedefs.
  • Persist per-run assignment history in state.json and update audit extraction to use the latest run record when present (fallback to legacy heuristic).
  • Export experiment assignments into OTEL_RESOURCE_ATTRIBUTES and update the step summary table format.
Show a summary per file
File Description
pkg/workflow/frontmatter_types.go Extends ExperimentConfig with analysis_type, tags, and notify (new ExperimentNotify type).
pkg/workflow/compiler_experiments.go Parses the new experiment fields and factors numeric coercion into extractIntField().
pkg/parser/schemas/main_workflow_schema.json Updates the workflow schema to include the new experiment fields and an example.
pkg/cli/workflows/test-experiments-rich-schema.md Updates the rich-schema test workflow frontmatter with new experiment fields.
pkg/cli/compile_experiments_integration_test.go Adds integration assertions for analysis_type, tags, and notify.
pkg/cli/audit_report_experiments_test.go Adds tests for extractExperimentData using the new runs history and legacy fallback.
pkg/cli/audit_report_experiments.go Adds runs support to state parsing and uses last run record when available.
actions/setup/js/pick_experiment.test.cjs Extends JS tests for legacy state handling, per-run run history, and OTEL env exporting.
actions/setup/js/pick_experiment.cjs Writes per-run records into state.runs, exports OTEL resource attributes, and changes the summary table to current/total counts.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 9/9 changed files
  • Comments generated: 5

Comment on lines +97 to +108
// When per-run records are available, use the most recent run's assignments directly
// instead of inferring them from cumulative counts.
if len(state.Runs) > 0 {
lastRun := state.Runs[len(state.Runs)-1]
if len(lastRun.Assignments) > 0 {
experimentDataLog.Printf("Using run record from run_id=%s (timestamp=%s)", lastRun.RunID, lastRun.Timestamp)
return &ExperimentData{
Assignments: lastRun.Assignments,
CumulativeCounts: state.Counts,
}
}
}
Comment thread actions/setup/js/pick_experiment.cjs Outdated
lines.push(`| \`${name}\` | **${selected}** | ${variants.join(", ")} | ${countsStr} |`);
const thisCount = counts[selected] || 0;
// counts values are always numbers (set by recordVariant which uses integer arithmetic).
const countValues = /** @type {number[]} */ Object.values(counts);
Comment on lines 219 to +225
const selected = assignments[name];
const variants = configs[name]?.variants || [];
const counts = state.counts[name] || {};
const countsStr = variants.map(v => `${v}: ${counts[v] || 0}`).join(", ");
lines.push(`| \`${name}\` | **${selected}** | ${variants.join(", ")} | ${countsStr} |`);
const thisCount = counts[selected] || 0;
// counts values are always numbers (set by recordVariant which uses integer arithmetic).
const countValues = /** @type {number[]} */ Object.values(counts);
const totalCount = countValues.reduce((a, b) => a + b, 0);
lines.push(`| \`${name}\` | **${selected}** | ${thisCount} / ${totalCount} |`);
Comment on lines +379 to +387
if (Object.keys(assignments).length > 0) {
// Append a per-run record to state.runs so each assignment is traceable.
const runId = process.env.GITHUB_RUN_ID || "";
const timestamp = new Date().toISOString();
if (!state.runs) {
state.runs = [];
}
state.runs.push({ run_id: runId, timestamp, assignments: { ...assignments } });
}
Comment on lines +171 to +186
// extractIntField converts a numeric any value to int.
// Returns (int(value), true) on success; (0, false) when val is nil or not a supported numeric type.
// Note: uint64 values larger than math.MaxInt are truncated on 32-bit systems (int is 32-bit there).
// Experiment counts and IDs are expected to be well within int range, so this is acceptable.
func extractIntField(val any) (int, bool) {
switch n := val.(type) {
case int:
return n, true
case int64:
return int(n), true
case uint64:
return int(n), true
case float64:
return int(n), true
}
return 0, false
…egration, and schema extensions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

Commit pushed: 33222ed

🏗️ ADR gate enforced by Design Decision Gate 🏗️

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

🏗️ Design Decision Gate — ADR Required

This PR makes significant changes to core business logic (230 new lines in pkg/ and actions/) but does not have a linked Architecture Decision Record (ADR).

AI has analyzed the PR diff and generated a draft ADR to help you get started:

📄 Draft ADR: docs/adr/29985-experiment-per-run-state-otel-integration-and-schema-extensions.md

The draft covers the three interlocking decisions this PR makes:

  1. Per-run state in state.json — each variant selection is appended as a record with run_id, timestamp, and assignments; the Go audit reporter reads the last record directly instead of using the fragile max-count heuristic.
  2. OTEL resource attributes — experiment assignments are exported into OTEL_RESOURCE_ATTRIBUTES so every span in the job carries the cohort for filtering in Honeycomb/Grafana.
  3. Schema extensions (analysis_type, tags, notify) — three new optional fields added to the experiment object form, propagated through the JSON schema, Go structs, and GH_AW_EXPERIMENT_SPEC.

What to do next

  1. Review the draft ADR committed to your branch — it was generated from the PR diff

  2. Complete any missing sections — add context the AI could not infer, refine the decision rationale, and verify the listed alternatives match what was actually considered

  3. Commit the finalized ADR to docs/adr/ on your branch

  4. Reference the ADR in this PR body by adding a line such as:

    ADR: ADR-29985: Experiment Per-Run State, OTEL Integration, and Schema Extensions

Once an ADR is linked in the PR body, this gate will re-run and verify the implementation matches the decision.

Why ADRs Matter

ADRs create a searchable, permanent record of why the codebase looks the way it does. Future contributors (and your future self) will thank you.


📋 Michael Nygard ADR Format Reference

An ADR must contain these four sections to be considered complete:

  • Context — What is the problem? What forces are at play?
  • Decision — What did you decide? Why?
  • Alternatives Considered — What else could have been done?
  • Consequences — What are the trade-offs (positive and negative)?

All ADRs are stored in docs/adr/ as Markdown files numbered by PR number.

🔒 This PR cannot merge until an ADR is linked in the PR body.

References: §25287740913

🏗️ ADR gate enforced by Design Decision Gate 🏗️ · ● 223.7K ·

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

test

🧪 Test quality analysis by Test Quality Sentinel · ● 825.7K ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧪 Test Quality Sentinel Report

Test Quality Score: 78/100

⚠️ Acceptable — some edge-case coverage gaps

Metric Value
New/modified tests analyzed 15
✅ Design tests (behavioral contracts) 15 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 9 (60%)
Duplicate test clusters 0
Test inflation detected Yes — 2 files exceed 2:1 ratio
🚨 Coding-guideline violations 0

Test Classification Details

View all 15 tests
Test File Classification Notes
"returns empty state when file does not exist" pick_experiment.test.cjs ✅ Design Edge case: nonexistent file
"returns empty state on invalid JSON" pick_experiment.test.cjs ✅ Design Edge case: invalid JSON
"round-trips state through save and load" pick_experiment.test.cjs ✅ Design Happy path — save/load round-trip
"initialises runs to [] when loading legacy state without runs field" pick_experiment.test.cjs ✅ Design Edge case: backward-compat migration
"preserves existing runs array when loading state" pick_experiment.test.cjs ✅ Design Happy path — state fidelity
"appends a run record to state.runs after picking variants" pick_experiment.test.cjs ✅ Design Happy path — new behavioral contract
"accumulates run records across multiple runs" pick_experiment.test.cjs ✅ Design Happy path — multi-run accumulation
"does not append a run record when no experiments are assigned" pick_experiment.test.cjs ✅ Design Edge case: empty spec
"exports OTEL_RESOURCE_ATTRIBUTES with experiment assignments" pick_experiment.test.cjs ✅ Design Happy path — OTEL integration
"appends to existing OTEL_RESOURCE_ATTRIBUTES" pick_experiment.test.cjs ✅ Design Edge case: pre-existing env var
"does not export OTEL_RESOURCE_ATTRIBUTES when no experiments are assigned" pick_experiment.test.cjs ✅ Design Edge case: empty spec
TestExtractExperimentDataWithRuns / "uses last run record when runs array is present" audit_report_experiments_test.go ✅ Design Happy path — new lookup strategy
TestExtractExperimentDataWithRuns / "falls back to heuristic when runs array is empty" audit_report_experiments_test.go ✅ Design Edge case: empty runs
TestExtractExperimentDataWithRuns / "falls back to heuristic when runs field is absent (legacy state)" audit_report_experiments_test.go ✅ Design Edge case: legacy state compat
TestExtractExperimentDataWithRuns / "skips last run record with empty assignments" audit_report_experiments_test.go ✅ Design Edge case: empty assignments map

Test Inflation (Advisory)

Two test files have a line-addition ratio > 2:1 compared to their production counterparts:

Test file Test lines added Production lines added Ratio
pick_experiment.test.cjs +108 +42 2.6:1
audit_report_experiments_test.go +90 +23 3.9:1

This is not a quality failure here — the new tests are substantive and cover genuine behavioral contracts, not padding. However, the high ratio for audit_report_experiments_test.go (3.9:1) is worth noting: 90 lines of tests for 23 lines of production code. Inspect whether all four subtests are truly exercising distinct branches, or whether some can be collapsed.

Missing Coverage (Suggestions)

A few scenarios not yet covered by the new tests:

  1. pick_experiment.test.cjs — OTEL attribute format: No test verifies the exact experiment.<name>=<variant> key format for multi-experiment specs (e.g., two features assigned simultaneously). The current tests only exercise single-feature specs.
  2. audit_report_experiments_test.go — multi-feature runs records: TestExtractExperimentDataWithRuns only tests single-feature assignment maps. A test with assignments: {"a": "X", "b": "Y"} would confirm the function handles all features in the last run record.
  3. compile_experiments_integration_test.go — no new test function: The 15 new lines are additional assertions appended to the existing TestCompileExperimentsRichSchema. These are correctly structured and high value; consider splitting into a dedicated sub-test for analysis_type/tags/notify fields to improve isolation.

Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 5 test scenarios (4 via table-driven TestExtractExperimentDataWithRuns + integration assertions) — unit (//go:build !integration) and integration (//go:build integration) build tags both present ✅
  • 🟨 JavaScript (*.test.cjs): 11 tests (vitest, vi.fn() / mockCore for external I/O — legitimate mocking targets)

Verdict

Check passed. 0% of new tests are implementation tests (threshold: 30%). All 15 tests verify behavioral contracts. Test inflation is advisory only — the tests themselves are substantive. No coding-guideline violations detected.


📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References:

@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented May 3, 2026

@copilot merge main and recompile

Review all comments.
Favor issues for notify.

Copilot AI and others added 2 commits May 3, 2026 19:12
…iment-infrastructure-again

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…validate extractIntField, favor issue in notify, robust tests

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/394e4a02-4cb4-4c79-b88f-417d7c638103

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 3, 2026

@copilot merge main and recompile

Review all comments....

Done in b1e7cb8:

  • Merged main and recompiled all workflows
  • Favor issue in notify: updated schema example, test fixture, and integration test to use notify.issue
  • All review comments addressed:
    • extractExperimentData doc comment updated to describe the new runs-based path and legacy fallback
    • extractIntField now rejects negatives, non-integral float64, and out-of-range values
    • totalCount in step summary now counts from state.runs when present (no more 0/0 for date-gated runs)
    • state.runs is capped to last 100 entries (MAX_RUN_HISTORY) to prevent unbounded growth
    • JSDoc cast fixed to (/** @type {number[]} */ (expr)) parenthesized form

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/graphql
    • Triggering command: /usr/bin/gh /usr/bin/gh api graphql -f query=query($owner: String!, $name: String!) { repository(owner: $owner, name: $name) { hasDiscussionsEnabled } } -f owner=github -f name=gh-aw (http block)
    • Triggering command: /usr/bin/gh /usr/bin/gh api graphql -f query=query($owner: String!, $name: String!) { repository(owner: $owner, name: $name) { hasDiscussionsEnabled } } -f owner=github -f name=gh-aw %H %ct %D (http block)
    • Triggering command: /usr/bin/gh /usr/bin/gh api graphql -f query=query($owner: String!, $name: String!) { repository(owner: $owner, name: $name) { hasDiscussionsEnabled } } -f owner=github -f name=gh-aw b/gh-aw/pkg/workapi (http block)
  • https://api.github.com/repos/actions/checkout/git/ref/tags/v6
    • Triggering command: /usr/bin/gh gh api /repos/actions/checkout/git/ref/tags/v6 --jq [.object.sha, .object.type] | @tsv nt (http block)
    • Triggering command: /usr/bin/gh gh api /repos/actions/checkout/git/ref/tags/v6 --jq [.object.sha, .object.type] | @tsv ithub/workflows (http block)
    • Triggering command: /usr/bin/gh gh api /repos/actions/checkout/git/ref/tags/v6 --jq [.object.sha, .object.type] | @tsv /ref/tags/v9 (http block)
  • https://api.github.com/repos/actions/download-artifact/git/ref/tags/v8
    • Triggering command: /usr/bin/gh gh api /repos/actions/download-artifact/git/ref/tags/v8 --jq [.object.sha, .object.type] | @tsv vitest run $name) { hasDiscussionsEnabled } } ignore (http block)
    • Triggering command: /usr/bin/gh gh api /repos/actions/download-artifact/git/ref/tags/v8 --jq [.object.sha, .object.type] | @tsv &#34;vitest&#34; run --no-file-parallelism pick_experiment (http block)
  • https://api.github.com/repos/actions/github-script/git/ref/tags/v9
    • Triggering command: /usr/bin/gh gh api /repos/actions/github-script/git/ref/tags/v9 --jq [.object.sha, .object.type] | @tsv nt (http block)
    • Triggering command: /usr/bin/gh gh api /repos/actions/github-script/git/ref/tags/v9 --jq [.object.sha, .object.type] | @tsv (http block)
    • Triggering command: /usr/bin/gh gh api /repos/actions/github-script/git/ref/tags/v9 --jq [.object.sha, .object.type] | @tsv ns.json and pkg/workflow/data/action_pins.json...&#34; structure-again de r (http block)
  • https://api.github.com/repos/actions/setup-node/git/ref/tags/v6
    • Triggering command: /usr/bin/gh gh api /repos/actions/setup-node/git/ref/tags/v6 --jq [.object.sha, .object.type] | @tsv (http block)
    • Triggering command: /usr/bin/gh gh api /repos/actions/setup-node/git/ref/tags/v6 --jq [.object.sha, .object.type] | @tsv te &#39;scripts/**/*.js&#39; --ignore-path .prettierignore --log-level=e!../../../pkg/workflow/js/**/*.jinfocmp (http block)
  • https://api.github.com/repos/astral-sh/setup-uv/git/ref/tags/eac588ad8def6316056a12d4907a9d4d84ff7a3b
    • Triggering command: /usr/bin/gh gh api /repos/astral-sh/setup-uv/git/ref/tags/eac588ad8def6316056a12d4907a9d4d84ff7a3b --jq [.object.sha, .object.type] | @tsv w/js/**/*.json&#39; --ignore-path (http block)
  • https://api.github.com/repos/docker/metadata-action/git/ref/tags/v6
    • Triggering command: /usr/bin/gh gh api /repos/docker/metadata-action/git/ref/tags/v6 --jq [.object.sha, .object.type] | @tsv .js&#39; --ignore-paremote.origin.url (http block)
    • Triggering command: /usr/bin/gh gh api /repos/docker/metadata-action/git/ref/tags/v6 --jq [.object.sha, .object.type] | @tsv .js&#39; --ignore-path .prettierignoremote.origin.url (http block)
  • https://api.github.com/repos/docker/setup-buildx-action/git/ref/tags/v4
    • Triggering command: /usr/bin/gh gh api /repos/docker/setup-buildx-action/git/ref/tags/v4 --jq [.object.sha, .object.type] | @tsv .js&#39; --ignore-path .prettierignore --log-level=error security ode -nxv (http block)
    • Triggering command: /usr/bin/gh gh api /repos/docker/setup-buildx-action/git/ref/tags/v4 --jq [.object.sha, .object.type] | @tsv .js&#39; --ignore-path .prettierigno-f nux-x86_64 ules/.bin/node (http block)
  • https://api.github.com/repos/github/gh-aw
    • Triggering command: /usr/bin/gh gh api /repos/github/gh-aw --jq .default_branch ithub/workflows --no-file-parallowner=github /home/REDACTED/.lo-f (http block)

If you need me to access, download, or install something from one of these locations, you can either:

@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented May 3, 2026

@copilot set MAX_RUN_HISTORY to 512

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/9acb139f-07b1-4e87-bc05-7d0026cfdaad

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 3, 2026

@copilot set MAX_RUN_HISTORY to 512

Done in 505ee93MAX_RUN_HISTORY updated to 512.

@pelikhan pelikhan merged commit bc8b014 into main May 3, 2026
@pelikhan pelikhan deleted the copilot/improve-experiment-infrastructure-again branch May 3, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ab-advisor] Improve experiment infrastructure: schema, reporting & audit

3 participants