Conversation
…n metadata, and OTEL attributes - Schema additions (1a/1b/1c): add analysis_type, tags, notify fields to ExperimentConfig Go struct, JSON schema, and JS typedef - Per-run metadata (1d): extend state.json with runs array tracking run_id, timestamp, and assignments per invocation; update audit extraction to prefer direct run records over count heuristic - OTEL attributes (3a): emit experiment.NAME=VARIANT as OTEL_RESOURCE_ATTRIBUTES after each run - Step summary (3c): update table to show Counts (this/total) format - Tests: update and extend JS and Go unit tests for all new features" Agent-Logs-Url: https://github.com/github/gh-aw/sessions/0d88e2f1-f86f-43f6-bda4-5e39f2e410a2 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…ify JSDoc, improve comment Agent-Logs-Url: https://github.com/github/gh-aw/sessions/0d88e2f1-f86f-43f6-bda4-5e39f2e410a2 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…ment, improve summary header wording Agent-Logs-Url: https://github.com/github/gh-aw/sessions/0d88e2f1-f86f-43f6-bda4-5e39f2e410a2 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Enhances the experiment infrastructure to improve traceability, schema richness, observability (OTEL), and audit accuracy by recording per-run assignments and adding new experiment metadata fields.
Changes:
- Add
analysis_type,tags, andnotifyfields to the experiment rich schema across Go types, JSON schema, and JS typedefs. - Persist per-run assignment history in
state.jsonand update audit extraction to use the latest run record when present (fallback to legacy heuristic). - Export experiment assignments into
OTEL_RESOURCE_ATTRIBUTESand update the step summary table format.
Show a summary per file
| File | Description |
|---|---|
| pkg/workflow/frontmatter_types.go | Extends ExperimentConfig with analysis_type, tags, and notify (new ExperimentNotify type). |
| pkg/workflow/compiler_experiments.go | Parses the new experiment fields and factors numeric coercion into extractIntField(). |
| pkg/parser/schemas/main_workflow_schema.json | Updates the workflow schema to include the new experiment fields and an example. |
| pkg/cli/workflows/test-experiments-rich-schema.md | Updates the rich-schema test workflow frontmatter with new experiment fields. |
| pkg/cli/compile_experiments_integration_test.go | Adds integration assertions for analysis_type, tags, and notify. |
| pkg/cli/audit_report_experiments_test.go | Adds tests for extractExperimentData using the new runs history and legacy fallback. |
| pkg/cli/audit_report_experiments.go | Adds runs support to state parsing and uses last run record when available. |
| actions/setup/js/pick_experiment.test.cjs | Extends JS tests for legacy state handling, per-run run history, and OTEL env exporting. |
| actions/setup/js/pick_experiment.cjs | Writes per-run records into state.runs, exports OTEL resource attributes, and changes the summary table to current/total counts. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 9/9 changed files
- Comments generated: 5
| // When per-run records are available, use the most recent run's assignments directly | ||
| // instead of inferring them from cumulative counts. | ||
| if len(state.Runs) > 0 { | ||
| lastRun := state.Runs[len(state.Runs)-1] | ||
| if len(lastRun.Assignments) > 0 { | ||
| experimentDataLog.Printf("Using run record from run_id=%s (timestamp=%s)", lastRun.RunID, lastRun.Timestamp) | ||
| return &ExperimentData{ | ||
| Assignments: lastRun.Assignments, | ||
| CumulativeCounts: state.Counts, | ||
| } | ||
| } | ||
| } |
| lines.push(`| \`${name}\` | **${selected}** | ${variants.join(", ")} | ${countsStr} |`); | ||
| const thisCount = counts[selected] || 0; | ||
| // counts values are always numbers (set by recordVariant which uses integer arithmetic). | ||
| const countValues = /** @type {number[]} */ Object.values(counts); |
| const selected = assignments[name]; | ||
| const variants = configs[name]?.variants || []; | ||
| const counts = state.counts[name] || {}; | ||
| const countsStr = variants.map(v => `${v}: ${counts[v] || 0}`).join(", "); | ||
| lines.push(`| \`${name}\` | **${selected}** | ${variants.join(", ")} | ${countsStr} |`); | ||
| const thisCount = counts[selected] || 0; | ||
| // counts values are always numbers (set by recordVariant which uses integer arithmetic). | ||
| const countValues = /** @type {number[]} */ Object.values(counts); | ||
| const totalCount = countValues.reduce((a, b) => a + b, 0); | ||
| lines.push(`| \`${name}\` | **${selected}** | ${thisCount} / ${totalCount} |`); |
| if (Object.keys(assignments).length > 0) { | ||
| // Append a per-run record to state.runs so each assignment is traceable. | ||
| const runId = process.env.GITHUB_RUN_ID || ""; | ||
| const timestamp = new Date().toISOString(); | ||
| if (!state.runs) { | ||
| state.runs = []; | ||
| } | ||
| state.runs.push({ run_id: runId, timestamp, assignments: { ...assignments } }); | ||
| } |
| // extractIntField converts a numeric any value to int. | ||
| // Returns (int(value), true) on success; (0, false) when val is nil or not a supported numeric type. | ||
| // Note: uint64 values larger than math.MaxInt are truncated on 32-bit systems (int is 32-bit there). | ||
| // Experiment counts and IDs are expected to be well within int range, so this is acceptable. | ||
| func extractIntField(val any) (int, bool) { | ||
| switch n := val.(type) { | ||
| case int: | ||
| return n, true | ||
| case int64: | ||
| return int(n), true | ||
| case uint64: | ||
| return int(n), true | ||
| case float64: | ||
| return int(n), true | ||
| } | ||
| return 0, false |
…egration, and schema extensions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Commit pushed:
|
🏗️ Design Decision Gate — ADR RequiredThis PR makes significant changes to core business logic (230 new lines in AI has analyzed the PR diff and generated a draft ADR to help you get started: 📄 Draft ADR: The draft covers the three interlocking decisions this PR makes:
What to do next
Once an ADR is linked in the PR body, this gate will re-run and verify the implementation matches the decision. Why ADRs MatterADRs create a searchable, permanent record of why the codebase looks the way it does. Future contributors (and your future self) will thank you. 📋 Michael Nygard ADR Format ReferenceAn ADR must contain these four sections to be considered complete:
All ADRs are stored in
References: §25287740913
|
|
test
|
There was a problem hiding this comment.
🧪 Test Quality Sentinel Report
Test Quality Score: 78/100
| Metric | Value |
|---|---|
| New/modified tests analyzed | 15 |
| ✅ Design tests (behavioral contracts) | 15 (100%) |
| 0 (0%) | |
| Tests with error/edge cases | 9 (60%) |
| Duplicate test clusters | 0 |
| Test inflation detected | Yes — 2 files exceed 2:1 ratio |
| 🚨 Coding-guideline violations | 0 |
Test Classification Details
View all 15 tests
| Test | File | Classification | Notes |
|---|---|---|---|
| "returns empty state when file does not exist" | pick_experiment.test.cjs |
✅ Design | Edge case: nonexistent file |
| "returns empty state on invalid JSON" | pick_experiment.test.cjs |
✅ Design | Edge case: invalid JSON |
| "round-trips state through save and load" | pick_experiment.test.cjs |
✅ Design | Happy path — save/load round-trip |
| "initialises runs to [] when loading legacy state without runs field" | pick_experiment.test.cjs |
✅ Design | Edge case: backward-compat migration |
| "preserves existing runs array when loading state" | pick_experiment.test.cjs |
✅ Design | Happy path — state fidelity |
| "appends a run record to state.runs after picking variants" | pick_experiment.test.cjs |
✅ Design | Happy path — new behavioral contract |
| "accumulates run records across multiple runs" | pick_experiment.test.cjs |
✅ Design | Happy path — multi-run accumulation |
| "does not append a run record when no experiments are assigned" | pick_experiment.test.cjs |
✅ Design | Edge case: empty spec |
| "exports OTEL_RESOURCE_ATTRIBUTES with experiment assignments" | pick_experiment.test.cjs |
✅ Design | Happy path — OTEL integration |
| "appends to existing OTEL_RESOURCE_ATTRIBUTES" | pick_experiment.test.cjs |
✅ Design | Edge case: pre-existing env var |
| "does not export OTEL_RESOURCE_ATTRIBUTES when no experiments are assigned" | pick_experiment.test.cjs |
✅ Design | Edge case: empty spec |
TestExtractExperimentDataWithRuns / "uses last run record when runs array is present" |
audit_report_experiments_test.go |
✅ Design | Happy path — new lookup strategy |
TestExtractExperimentDataWithRuns / "falls back to heuristic when runs array is empty" |
audit_report_experiments_test.go |
✅ Design | Edge case: empty runs |
TestExtractExperimentDataWithRuns / "falls back to heuristic when runs field is absent (legacy state)" |
audit_report_experiments_test.go |
✅ Design | Edge case: legacy state compat |
TestExtractExperimentDataWithRuns / "skips last run record with empty assignments" |
audit_report_experiments_test.go |
✅ Design | Edge case: empty assignments map |
Test Inflation (Advisory)
Two test files have a line-addition ratio > 2:1 compared to their production counterparts:
| Test file | Test lines added | Production lines added | Ratio |
|---|---|---|---|
pick_experiment.test.cjs |
+108 | +42 | 2.6:1 |
audit_report_experiments_test.go |
+90 | +23 | 3.9:1 |
This is not a quality failure here — the new tests are substantive and cover genuine behavioral contracts, not padding. However, the high ratio for audit_report_experiments_test.go (3.9:1) is worth noting: 90 lines of tests for 23 lines of production code. Inspect whether all four subtests are truly exercising distinct branches, or whether some can be collapsed.
Missing Coverage (Suggestions)
A few scenarios not yet covered by the new tests:
pick_experiment.test.cjs— OTEL attribute format: No test verifies the exactexperiment.<name>=<variant>key format for multi-experiment specs (e.g., two features assigned simultaneously). The current tests only exercise single-feature specs.audit_report_experiments_test.go— multi-featurerunsrecords:TestExtractExperimentDataWithRunsonly tests single-feature assignment maps. A test withassignments: {"a": "X", "b": "Y"}would confirm the function handles all features in the last run record.compile_experiments_integration_test.go— no new test function: The 15 new lines are additional assertions appended to the existingTestCompileExperimentsRichSchema. These are correctly structured and high value; consider splitting into a dedicated sub-test foranalysis_type/tags/notifyfields to improve isolation.
Language Support
Tests analyzed:
- 🐹 Go (
*_test.go): 5 test scenarios (4 via table-drivenTestExtractExperimentDataWithRuns+ integration assertions) — unit (//go:build !integration) and integration (//go:build integration) build tags both present ✅ - 🟨 JavaScript (
*.test.cjs): 11 tests (vitest,vi.fn()/mockCorefor external I/O — legitimate mocking targets)
Verdict
✅ Check passed. 0% of new tests are implementation tests (threshold: 30%). All 15 tests verify behavioral contracts. Test inflation is advisory only — the tests themselves are substantive. No coding-guideline violations detected.
📖 Understanding Test Classifications
Design Tests (High Value) verify what the system does:
- Assert on observable outputs, return values, or state changes
- Cover error paths and boundary conditions
- Would catch a behavioral regression if deleted
- Remain valid even after internal refactoring
Implementation Tests (Low Value) verify how the system does it:
- Assert on internal function calls (mocking internals)
- Only test the happy path with typical inputs
- Break during legitimate refactoring even when behavior is correct
- Give false assurance: they pass even when the system is wrong
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.
References:
|
@copilot merge main and recompile Review all comments. |
…iment-infrastructure-again Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…validate extractIntField, favor issue in notify, robust tests Agent-Logs-Url: https://github.com/github/gh-aw/sessions/394e4a02-4cb4-4c79-b88f-417d7c638103 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done in b1e7cb8:
Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
@copilot set MAX_RUN_HISTORY to 512 |
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/9acb139f-07b1-4e87-bc05-7d0026cfdaad Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
The experiment infrastructure lacked traceability (state.json had no per-run history), OTEL observability, and three useful schema fields (
analysis_type,tags,notify).Schema additions (1a/1b/1c)
New fields on the
experimentsobject form:ExperimentConfigGo struct,main_workflow_schema.json, and JS typedef updated in syncextractIntField()helper in the Go compiler to remove duplicated numeric coercion; now validates inputs (rejects negatives, non-integralfloat64, and out-of-range values)Per-run metadata in
state.json(1d)pick_experiment.cjsnow appends a run record tostate.runson every invocation:{ "counts": { "prompt_style": { "concise": 18, "detailed": 17 } }, "runs": [ { "run_id": "25276903561", "timestamp": "2026-05-03T10:41:00.000Z", "assignments": { "prompt_style": "concise" } } ] }state.runsis capped to the last 512 entries (MAX_RUN_HISTORY) to prevent unbounded growth in state.json size over time.extractExperimentDatainaudit_report_experiments.gonow reads the last run record directly when present, replacing the fragile max-count heuristic. Legacy state files (norunsfield) fall back to the heuristic transparently.OTEL resource attributes (3a)
After variant selection,
pick_experiment.cjsexports experiment assignments intoOTEL_RESOURCE_ATTRIBUTES:Appends to any pre-existing value so other OTEL instrumentation is not clobbered.
Step summary format (3c)
Summary table now shows
Counts (current/total)— count for the selected variant over total assigned runs (derived fromstate.runswhen present, falling back to incremented counts). This correctly reflects all runs including those outside the active date window.