test(voice-evals): add golden fixture adapters#773
Conversation
Checkpoint: step 1 adds deterministic support-billing voice fixtures and fake adapters for issue #757.
Greptile SummaryIntroduces
Confidence Score: 4/5Safe to merge; the package is entirely self-contained with no external dependencies, and the deterministic test validates byte-stable output across repeated runs. The backend/internal/voicefixtures/fixtures.go — the seed guard and hardcoded timing metric value warrant a closer look before dependent PRs build on this API.
|
| Filename | Overview |
|---|---|
| backend/internal/voicefixtures/fixtures.go | Adds all fixture types, fake adapters, and RunSupportBillingScenario. The seed parameter is accepted but only the constant 42 is valid; timing metric value is hardcoded rather than derived from segment data. |
| backend/internal/voicefixtures/fixtures_test.go | Deterministic golden test comparing two runs and committed files byte-for-byte; no -update flag to regenerate committed goldens. |
| backend/internal/voicefixtures/testdata/support_billing/challenge_pack.yaml | Well-formed voice challenge pack YAML; passes challengepack.ParseYAML validation as tested. |
| backend/internal/voicefixtures/testdata/support_billing/expected_trace.json | 8-segment trace golden with alphabetically-sorted JSON keys in embedded payloads, consistent with decodeCanonical normalization. |
| backend/internal/voicefixtures/testdata/support_billing/expected_scorecard.json | Scorecard golden matches what supportBillingScorecard() generates; all checks pass with correct keys and values. |
| backend/internal/voicefixtures/testdata/support_billing/scripted_user_turns.json | Single scripted turn for the duplicate-charge billing scenario; consistent with fixtures_test expectations for exactly one turn. |
Sequence Diagram
sequenceDiagram
participant Test as TestGoldensAreDeterministic
participant FS as embed.FS (testdata)
participant Sim as FakeUserSimulator
participant Agent as FakeVoiceAgentDeployment
participant Tool as FakeToolEndpoint
participant Media as FakeMediaTransport
participant Clock as FakeClock
Test->>FS: LoadSupportBillingFixture()
FS-->>Test: SupportBillingFixture (8 files)
Test->>Test: "RunSupportBillingScenario(seed=42) x2"
Note over Test,Clock: Inside RunSupportBillingScenario
Test->>Sim: Turns()[0]
Sim-->>Test: ScriptedUserTurn
Test->>Agent: Respond(turn)
Agent->>Tool: Call(expectedCall)
Tool-->>Agent: ToolResultFixture
Agent-->>Test: AgentResponse
Test->>Clock: AtOffset(0ms)
Test->>Media: "ReceiveUserAudio(turn, t=0ms)"
Test->>Clock: AtOffset(6s)
Test->>Media: "SendAgentAudio(response, t=6s)"
Test->>Test: Build multimodaltrace.Trace (8 segments)
Test->>Test: marshalGolden(trace) → TraceJSON
Test->>Test: marshalGolden(scorecard) → ScorecardJSON
Note over Test,FS: Assertions
Test->>Test: assertBytesEqual(run1 vs run2)
Test->>Test: assertBytesEqual(generated vs FS goldens)
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
backend/internal/voicefixtures/fixtures.go:255-262
**`seed` parameter provides no actual behavioral variation**
`RunSupportBillingScenario(seed int64)` accepts a `seed` parameter but immediately rejects anything other than the hardcoded constant `SupportBillingSeed = 42`. The value isn't used to seed any RNG — all outputs are fully hardcoded. Callers from later voice-eval PRs who discover the constraint only at runtime (via an opaque error message) may be surprised. If only one seed is ever valid, consider removing the parameter entirely or accepting `SupportBillingSeed` as a named parameter so the contract is obvious at the call site.
### Issue 2 of 3
backend/internal/voicefixtures/fixtures.go:401-406
**Hardcoded timing metric diverges from fixture data**
`TimingMarkerPayload.ValueMS: 3200` is correct today — it's derived from `seg-005.OccurredAt (5 s) − (turn.OccurredAtOffsetMS + turn.DurationMS) (0 + 1800 ms) = 3200 ms` — but the computation lives only in a developer's head. If the scripted user turn's `duration_ms` or `occurred_at_offset_ms` values are ever updated in `scripted_user_turns.json`, the hardcoded constant will silently become wrong, causing the golden scorecard to diverge without a compile-time or immediate test failure at the source of the inconsistency. Consider deriving the value from the segment timestamps already available in the trace.
### Issue 3 of 3
backend/internal/voicefixtures/fixtures_test.go:11-41
**No golden-update mechanism for committed fixture files**
The test does byte-exact comparison against the committed golden files but there is no `-update` flag (the standard Go pattern: `var update = flag.Bool("update", false, "update golden files")`). If the trace schema, scorecard schema, or any fixture format ever changes, developers must manually regenerate and re-commit all eight golden files in lock-step. A missing update path is easy to overlook and risks committed goldens drifting from what the code would actually generate.
Reviews (1): Last reviewed commit: "test(voice-evals): add golden fixture ad..." | Re-trigger Greptile
| func RunSupportBillingScenario(seed int64) (ScenarioRun, error) { | ||
| fixture, err := LoadSupportBillingFixture() | ||
| if err != nil { | ||
| return ScenarioRun{}, err | ||
| } | ||
| if seed != SupportBillingSeed { | ||
| return ScenarioRun{}, fmt.Errorf("unsupported support billing fixture seed %d", seed) | ||
| } |
There was a problem hiding this comment.
seed parameter provides no actual behavioral variation
RunSupportBillingScenario(seed int64) accepts a seed parameter but immediately rejects anything other than the hardcoded constant SupportBillingSeed = 42. The value isn't used to seed any RNG — all outputs are fully hardcoded. Callers from later voice-eval PRs who discover the constraint only at runtime (via an opaque error message) may be surprised. If only one seed is ever valid, consider removing the parameter entirely or accepting SupportBillingSeed as a named parameter so the contract is obvious at the call site.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/internal/voicefixtures/fixtures.go
Line: 255-262
Comment:
**`seed` parameter provides no actual behavioral variation**
`RunSupportBillingScenario(seed int64)` accepts a `seed` parameter but immediately rejects anything other than the hardcoded constant `SupportBillingSeed = 42`. The value isn't used to seed any RNG — all outputs are fully hardcoded. Callers from later voice-eval PRs who discover the constraint only at runtime (via an opaque error message) may be surprised. If only one seed is ever valid, consider removing the parameter entirely or accepting `SupportBillingSeed` as a named parameter so the contract is obvious at the call site.
How can I resolve this? If you propose a fix, please make it concise.| Actor: multimodaltrace.ActorEvaluator, | ||
| OccurredAt: clock.AtOffset(7 * time.Second), | ||
| TimingMarker: &multimodaltrace.TimingMarkerPayload{ | ||
| Key: "end_of_speech_to_first_agent_text", | ||
| ValueMS: 3200, | ||
| }, |
There was a problem hiding this comment.
Hardcoded timing metric diverges from fixture data
TimingMarkerPayload.ValueMS: 3200 is correct today — it's derived from seg-005.OccurredAt (5 s) − (turn.OccurredAtOffsetMS + turn.DurationMS) (0 + 1800 ms) = 3200 ms — but the computation lives only in a developer's head. If the scripted user turn's duration_ms or occurred_at_offset_ms values are ever updated in scripted_user_turns.json, the hardcoded constant will silently become wrong, causing the golden scorecard to diverge without a compile-time or immediate test failure at the source of the inconsistency. Consider deriving the value from the segment timestamps already available in the trace.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/internal/voicefixtures/fixtures.go
Line: 401-406
Comment:
**Hardcoded timing metric diverges from fixture data**
`TimingMarkerPayload.ValueMS: 3200` is correct today — it's derived from `seg-005.OccurredAt (5 s) − (turn.OccurredAtOffsetMS + turn.DurationMS) (0 + 1800 ms) = 3200 ms` — but the computation lives only in a developer's head. If the scripted user turn's `duration_ms` or `occurred_at_offset_ms` values are ever updated in `scripted_user_turns.json`, the hardcoded constant will silently become wrong, causing the golden scorecard to diverge without a compile-time or immediate test failure at the source of the inconsistency. Consider deriving the value from the segment timestamps already available in the trace.
How can I resolve this? If you propose a fix, please make it concise.| func TestSupportBillingScenarioGoldensAreDeterministic(t *testing.T) { | ||
| fixture, err := LoadSupportBillingFixture() | ||
| if err != nil { | ||
| t.Fatalf("LoadSupportBillingFixture returned error: %v", err) | ||
| } | ||
| if _, err := challengepack.ParseYAML(fixture.ChallengePackYAML); err != nil { | ||
| t.Fatalf("challenge pack fixture failed validation: %v", err) | ||
| } | ||
|
|
||
| first, err := RunSupportBillingScenario(SupportBillingSeed) | ||
| if err != nil { | ||
| t.Fatalf("first RunSupportBillingScenario returned error: %v", err) | ||
| } | ||
| second, err := RunSupportBillingScenario(SupportBillingSeed) | ||
| if err != nil { | ||
| t.Fatalf("second RunSupportBillingScenario returned error: %v", err) | ||
| } | ||
|
|
||
| assertBytesEqual(t, "trace output repeated run", first.TraceJSON, second.TraceJSON) | ||
| assertBytesEqual(t, "scorecard output repeated run", first.ScorecardJSON, second.ScorecardJSON) | ||
| if !reflect.DeepEqual(first.EventTimestamps, second.EventTimestamps) { | ||
| t.Fatalf("event timestamps mismatch\nwant: %#v\n got: %#v", first.EventTimestamps, second.EventTimestamps) | ||
| } | ||
| assertBytesEqual(t, "tool-call arguments repeated run", first.ToolCallArgumentsJSON, second.ToolCallArgumentsJSON) | ||
| assertBytesEqual(t, "tool call golden", fixture.ExpectedToolCallJSON, first.ToolCallJSON) | ||
| assertBytesEqual(t, "tool result golden", fixture.ExpectedToolResultJSON, first.ToolResultJSON) | ||
| assertBytesEqual(t, "agent text output golden", fixture.ExpectedAgentTextOutput, first.AgentTextOutput) | ||
| assertBytesEqual(t, "structured output golden", fixture.ExpectedStructuredJSON, first.StructuredOutputJSON) | ||
| assertBytesEqual(t, "trace golden", fixture.ExpectedTraceJSON, first.TraceJSON) | ||
| assertBytesEqual(t, "scorecard golden", fixture.ExpectedScorecardJSON, first.ScorecardJSON) | ||
| } |
There was a problem hiding this comment.
No golden-update mechanism for committed fixture files
The test does byte-exact comparison against the committed golden files but there is no -update flag (the standard Go pattern: var update = flag.Bool("update", false, "update golden files")). If the trace schema, scorecard schema, or any fixture format ever changes, developers must manually regenerate and re-commit all eight golden files in lock-step. A missing update path is easy to overlook and risks committed goldens drifting from what the code would actually generate.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/internal/voicefixtures/fixtures_test.go
Line: 11-41
Comment:
**No golden-update mechanism for committed fixture files**
The test does byte-exact comparison against the committed golden files but there is no `-update` flag (the standard Go pattern: `var update = flag.Bool("update", false, "update golden files")`). If the trace schema, scorecard schema, or any fixture format ever changes, developers must manually regenerate and re-commit all eight golden files in lock-step. A missing update path is easy to overlook and risks committed goldens drifting from what the code would actually generate.
How can I resolve this? If you propose a fix, please make it concise.Checkpoint: step 2 removes misleading seed parameter, derives timing metrics, and adds golden update support.
Closes #757
Parent: #754
Depends on: #755, #756
Summary
backend/internal/voicefixtures, a reusable deterministic support-billing voice fixture package for later voice-evals PRs.Testing
cd backend && go test ./internal/voicefixturescd backend && go test ./internal/voicefixtures ./internal/challengepack ./internal/multimodaltracecd backend && go test ./...Review Checkpoint
The review-checkpoint contract and JSON were intentionally kept in
/tmpand not committed. The JSON/YAML files committed in this PR are product fixtures required by #757, not checkpoint scratch files.Test Contract
Review Update: Greptile Comments
Addressed Greptile review feedback in commit
5597c9a2:seedparameter fromRunSupportBillingScenario; the fixture seed remains explicit in the generated scorecard viaSupportBillingSeed.end_of_speech_to_first_agent_text_msfrom fake-clock timestamps and scripted user audio duration instead of hardcoding3200.go test ./internal/voicefixtures -updatesupport for generated golden outputs.Checkpoint status: pass. Focused fixture tests, the review-contract smoke set, full backend tests, and the new
-updatepath all passed.Checkpoint JSON
{ "branch": "codex/voice-golden-fixtures", "test_contract": "/tmp/codex-voice-golden-fixtures-test-contract.md", "created_at": "2026-05-13T10:44:55Z", "steps": [ { "step_number": 1, "title": "Add deterministic voice fixtures and fake adapters", "timestamp": "2026-05-13T10:48:08Z", "files_changed": [ "backend/internal/voicefixtures/fixtures.go", "backend/internal/voicefixtures/fixtures_test.go", "backend/internal/voicefixtures/testdata/support_billing/challenge_pack.yaml", "backend/internal/voicefixtures/testdata/support_billing/scripted_user_turns.json", "backend/internal/voicefixtures/testdata/support_billing/expected_tool_call.json", "backend/internal/voicefixtures/testdata/support_billing/expected_tool_result.json", "backend/internal/voicefixtures/testdata/support_billing/expected_agent_text_output.txt", "backend/internal/voicefixtures/testdata/support_billing/expected_structured_output.json", "backend/internal/voicefixtures/testdata/support_billing/expected_trace.json", "backend/internal/voicefixtures/testdata/support_billing/expected_scorecard.json" ], "what_changed": "Added a reusable voicefixtures package with embedded support-billing goldens, fake user simulator, fake voice-agent deployment, fake tool endpoint, fake clock, and fake media transport. The deterministic runner emits trace, scorecard, tool, structured-output, text-output, and timestamp results from a fixed seed.", "review_instructions": "Verify fixtures are committed because #757 explicitly requires product JSON/YAML goldens, while review-checkpoint scratch files stay in /tmp. Check that no fake uses network, API keys, LLMs, STT/TTS, telephony, or environment variables. Confirm repeated runs are byte-stable and the challenge pack parses.", "review_result": { "status": "pass", "issues_found": [], "notes": "The package is self-contained under backend/internal/voicefixtures. The test compares repeated generated bytes and generated bytes against committed goldens." }, "cumulative_review": { "previous_steps_still_valid": true, "integration_issues": [], "notes": "Single-step implementation; it depends on the merged multimodal trace and voice pack validation contracts from #755 and #756." } }, { "step_number": "final", "title": "Final review against test contract", "timestamp": "2026-05-13T10:48:08Z", "test_contract_review": { "functional_behavior": "pass - committed readable support-billing fixtures and reusable deterministic fake adapters for simulator, deployment, tool endpoint, clock, and media transport.", "unit_tests": "pass - TestSupportBillingScenarioGoldensAreDeterministic runs the fake scenario twice and verifies trace, scorecard, timestamps, tool-call arguments, tool result, agent text output, structured output, and challenge-pack parsing.", "integration_tests": "N/A - local fixtures/adapters only.", "smoke_tests": "pass - go test ./internal/voicefixtures ./internal/challengepack ./internal/multimodaltrace and go test ./... passed from backend/.", "e2e_tests": "N/A - execution mode is deferred.", "manual_tests": "N/A - no API endpoint behavior changed." }, "overall_verdict": "ready", "blocking_issues": [] } ] }