Skip to content

test(voice-evals): add golden fixture adapters#773

Merged
Atharva-Kanherkar merged 2 commits into
mainfrom
codex/voice-golden-fixtures
May 13, 2026
Merged

test(voice-evals): add golden fixture adapters#773
Atharva-Kanherkar merged 2 commits into
mainfrom
codex/voice-golden-fixtures

Conversation

@Atharva-Kanherkar
Copy link
Copy Markdown
Collaborator

@Atharva-Kanherkar Atharva-Kanherkar commented May 13, 2026

Closes #757
Parent: #754
Depends on: #755, #756

Summary

  • Adds backend/internal/voicefixtures, a reusable deterministic support-billing voice fixture package for later voice-evals PRs.
  • Commits the issue-required product fixtures: challenge-pack YAML, scripted user turns, expected tool call/result, expected agent text output, expected structured output, expected trace JSON, and expected scorecard JSON.
  • Adds fake adapters for user simulation, voice-agent deployment, tool endpoint, fake clock, and fake media transport.
  • Pins deterministic behavior with a test that runs the fake scenario twice and byte-compares generated outputs against committed goldens.

Testing

  • cd backend && go test ./internal/voicefixtures
  • cd backend && go test ./internal/voicefixtures ./internal/challengepack ./internal/multimodaltrace
  • cd backend && go test ./...

Review Checkpoint

The review-checkpoint contract and JSON were intentionally kept in /tmp and not committed. The JSON/YAML files committed in this PR are product fixtures required by #757, not checkpoint scratch files.

Test Contract

# codex/voice-golden-fixtures - Test Contract

## Functional Behavior

Implement issue #757: add deterministic golden fixtures and fake adapters for a voice support-agent billing scenario.

The implementation must:

- Commit small, readable fixtures for a billing refund voice scenario.
- Include a voice challenge-pack YAML fixture that parses with current challenge-pack validation.
- Include scripted user turns, expected tool call, expected tool result, expected agent text output, expected structured output, expected trace JSON, and expected scorecard JSON fixtures.
- Provide reusable fake adapters for a user simulator, voice-agent deployment, tool endpoint, clock, and media transport.
- Ensure all fake adapters are deterministic for a fixed seed and use the fake clock for timestamps.
- Avoid real LLM, STT, TTS, telephony, media provider, API key, network, or environment-variable dependencies.

## Unit Tests

- `TestSupportBillingScenarioGoldensAreDeterministic` - runs the fake scenario twice and proves identical trace JSON, scorecard JSON, event timestamps, tool-call arguments, tool result, agent text output, and structured output.
- The same test compares generated trace and scorecard bytes against committed golden fixtures.
- The voice challenge-pack YAML fixture must parse through `challengepack.ParseYAML`.

## Integration / Functional Tests

N/A - this slice only adds local fixtures and fake adapters for later voice-evals PRs.

## Smoke Tests

Run from `backend/`:

`go test ./internal/voicefixtures ./internal/challengepack ./internal/multimodaltrace`

Expected: pass with no network, API keys, live LLMs, STT/TTS providers, or telephony providers.

## E2E Tests

N/A - execution-mode integration is deferred to later issues.

## Manual / cURL Tests

N/A - no API endpoint behavior changes.

Review Update: Greptile Comments

Addressed Greptile review feedback in commit 5597c9a2:

  • Removed the misleading seed parameter from RunSupportBillingScenario; the fixture seed remains explicit in the generated scorecard via SupportBillingSeed.
  • Derived end_of_speech_to_first_agent_text_ms from fake-clock timestamps and scripted user audio duration instead of hardcoding 3200.
  • Added go test ./internal/voicefixtures -update support for generated golden outputs.

Checkpoint status: pass. Focused fixture tests, the review-contract smoke set, full backend tests, and the new -update path all passed.

Checkpoint JSON

{
  "branch": "codex/voice-golden-fixtures",
  "test_contract": "/tmp/codex-voice-golden-fixtures-test-contract.md",
  "created_at": "2026-05-13T10:44:55Z",
  "steps": [
    {
      "step_number": 1,
      "title": "Add deterministic voice fixtures and fake adapters",
      "timestamp": "2026-05-13T10:48:08Z",
      "files_changed": [
        "backend/internal/voicefixtures/fixtures.go",
        "backend/internal/voicefixtures/fixtures_test.go",
        "backend/internal/voicefixtures/testdata/support_billing/challenge_pack.yaml",
        "backend/internal/voicefixtures/testdata/support_billing/scripted_user_turns.json",
        "backend/internal/voicefixtures/testdata/support_billing/expected_tool_call.json",
        "backend/internal/voicefixtures/testdata/support_billing/expected_tool_result.json",
        "backend/internal/voicefixtures/testdata/support_billing/expected_agent_text_output.txt",
        "backend/internal/voicefixtures/testdata/support_billing/expected_structured_output.json",
        "backend/internal/voicefixtures/testdata/support_billing/expected_trace.json",
        "backend/internal/voicefixtures/testdata/support_billing/expected_scorecard.json"
      ],
      "what_changed": "Added a reusable voicefixtures package with embedded support-billing goldens, fake user simulator, fake voice-agent deployment, fake tool endpoint, fake clock, and fake media transport. The deterministic runner emits trace, scorecard, tool, structured-output, text-output, and timestamp results from a fixed seed.",
      "review_instructions": "Verify fixtures are committed because #757 explicitly requires product JSON/YAML goldens, while review-checkpoint scratch files stay in /tmp. Check that no fake uses network, API keys, LLMs, STT/TTS, telephony, or environment variables. Confirm repeated runs are byte-stable and the challenge pack parses.",
      "review_result": {
        "status": "pass",
        "issues_found": [],
        "notes": "The package is self-contained under backend/internal/voicefixtures. The test compares repeated generated bytes and generated bytes against committed goldens."
      },
      "cumulative_review": {
        "previous_steps_still_valid": true,
        "integration_issues": [],
        "notes": "Single-step implementation; it depends on the merged multimodal trace and voice pack validation contracts from #755 and #756."
      }
    },
    {
      "step_number": "final",
      "title": "Final review against test contract",
      "timestamp": "2026-05-13T10:48:08Z",
      "test_contract_review": {
        "functional_behavior": "pass - committed readable support-billing fixtures and reusable deterministic fake adapters for simulator, deployment, tool endpoint, clock, and media transport.",
        "unit_tests": "pass - TestSupportBillingScenarioGoldensAreDeterministic runs the fake scenario twice and verifies trace, scorecard, timestamps, tool-call arguments, tool result, agent text output, structured output, and challenge-pack parsing.",
        "integration_tests": "N/A - local fixtures/adapters only.",
        "smoke_tests": "pass - go test ./internal/voicefixtures ./internal/challengepack ./internal/multimodaltrace and go test ./... passed from backend/.",
        "e2e_tests": "N/A - execution mode is deferred.",
        "manual_tests": "N/A - no API endpoint behavior changed."
      },
      "overall_verdict": "ready",
      "blocking_issues": []
    }
  ]
}

Checkpoint: step 1 adds deterministic support-billing voice fixtures and fake adapters for issue #757.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Greptile Summary

Introduces backend/internal/voicefixtures, a self-contained package of deterministic golden fixtures and fake adapters for a voice support-agent billing scenario, laying the groundwork for later voice-eval PRs (#757).

  • Fake adapters added: FakeClock, FakeUserSimulator, FakeToolEndpoint, FakeVoiceAgentDeployment, and FakeMediaTransport — all fully in-process with no network, LLM, STT/TTS, or env-var dependencies.
  • Golden files committed: challenge-pack YAML, scripted user turns, expected tool call/result, agent text output, structured output, trace JSON, and scorecard JSON — all embedded via embed.FS and byte-compared in TestSupportBillingScenarioGoldensAreDeterministic.
  • decodeCanonical normalizes embedded json.RawMessage fields through an untyped JSON round-trip, which sorts object keys alphabetically; the committed golden files were written to match this ordering.

Confidence Score: 4/5

Safe to merge; the package is entirely self-contained with no external dependencies, and the deterministic test validates byte-stable output across repeated runs.

The seed parameter accepts only the single constant 42, which misrepresents the API's flexibility to downstream consumers. The timing marker value (3200 ms) is hardcoded rather than derived from the segment timestamps already in scope, so a change to the user-turn duration in the JSON fixture would silently produce a wrong metric without an obvious failure at the site of the inconsistency. Neither issue blocks the current test, but both create fragility for future voice-eval PRs that build on this package.

backend/internal/voicefixtures/fixtures.go — the seed guard and hardcoded timing metric value warrant a closer look before dependent PRs build on this API.

Important Files Changed

Filename Overview
backend/internal/voicefixtures/fixtures.go Adds all fixture types, fake adapters, and RunSupportBillingScenario. The seed parameter is accepted but only the constant 42 is valid; timing metric value is hardcoded rather than derived from segment data.
backend/internal/voicefixtures/fixtures_test.go Deterministic golden test comparing two runs and committed files byte-for-byte; no -update flag to regenerate committed goldens.
backend/internal/voicefixtures/testdata/support_billing/challenge_pack.yaml Well-formed voice challenge pack YAML; passes challengepack.ParseYAML validation as tested.
backend/internal/voicefixtures/testdata/support_billing/expected_trace.json 8-segment trace golden with alphabetically-sorted JSON keys in embedded payloads, consistent with decodeCanonical normalization.
backend/internal/voicefixtures/testdata/support_billing/expected_scorecard.json Scorecard golden matches what supportBillingScorecard() generates; all checks pass with correct keys and values.
backend/internal/voicefixtures/testdata/support_billing/scripted_user_turns.json Single scripted turn for the duplicate-charge billing scenario; consistent with fixtures_test expectations for exactly one turn.

Sequence Diagram

sequenceDiagram
    participant Test as TestGoldensAreDeterministic
    participant FS as embed.FS (testdata)
    participant Sim as FakeUserSimulator
    participant Agent as FakeVoiceAgentDeployment
    participant Tool as FakeToolEndpoint
    participant Media as FakeMediaTransport
    participant Clock as FakeClock

    Test->>FS: LoadSupportBillingFixture()
    FS-->>Test: SupportBillingFixture (8 files)
    Test->>Test: "RunSupportBillingScenario(seed=42) x2"

    Note over Test,Clock: Inside RunSupportBillingScenario
    Test->>Sim: Turns()[0]
    Sim-->>Test: ScriptedUserTurn
    Test->>Agent: Respond(turn)
    Agent->>Tool: Call(expectedCall)
    Tool-->>Agent: ToolResultFixture
    Agent-->>Test: AgentResponse

    Test->>Clock: AtOffset(0ms)
    Test->>Media: "ReceiveUserAudio(turn, t=0ms)"
    Test->>Clock: AtOffset(6s)
    Test->>Media: "SendAgentAudio(response, t=6s)"

    Test->>Test: Build multimodaltrace.Trace (8 segments)
    Test->>Test: marshalGolden(trace) → TraceJSON
    Test->>Test: marshalGolden(scorecard) → ScorecardJSON

    Note over Test,FS: Assertions
    Test->>Test: assertBytesEqual(run1 vs run2)
    Test->>Test: assertBytesEqual(generated vs FS goldens)
Loading

Fix All in Codex

Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
backend/internal/voicefixtures/fixtures.go:255-262
**`seed` parameter provides no actual behavioral variation**

`RunSupportBillingScenario(seed int64)` accepts a `seed` parameter but immediately rejects anything other than the hardcoded constant `SupportBillingSeed = 42`. The value isn't used to seed any RNG — all outputs are fully hardcoded. Callers from later voice-eval PRs who discover the constraint only at runtime (via an opaque error message) may be surprised. If only one seed is ever valid, consider removing the parameter entirely or accepting `SupportBillingSeed` as a named parameter so the contract is obvious at the call site.

### Issue 2 of 3
backend/internal/voicefixtures/fixtures.go:401-406
**Hardcoded timing metric diverges from fixture data**

`TimingMarkerPayload.ValueMS: 3200` is correct today — it's derived from `seg-005.OccurredAt (5 s) − (turn.OccurredAtOffsetMS + turn.DurationMS) (0 + 1800 ms) = 3200 ms` — but the computation lives only in a developer's head. If the scripted user turn's `duration_ms` or `occurred_at_offset_ms` values are ever updated in `scripted_user_turns.json`, the hardcoded constant will silently become wrong, causing the golden scorecard to diverge without a compile-time or immediate test failure at the source of the inconsistency. Consider deriving the value from the segment timestamps already available in the trace.

### Issue 3 of 3
backend/internal/voicefixtures/fixtures_test.go:11-41
**No golden-update mechanism for committed fixture files**

The test does byte-exact comparison against the committed golden files but there is no `-update` flag (the standard Go pattern: `var update = flag.Bool("update", false, "update golden files")`). If the trace schema, scorecard schema, or any fixture format ever changes, developers must manually regenerate and re-commit all eight golden files in lock-step. A missing update path is easy to overlook and risks committed goldens drifting from what the code would actually generate.

Reviews (1): Last reviewed commit: "test(voice-evals): add golden fixture ad..." | Re-trigger Greptile

Comment on lines +255 to +262
func RunSupportBillingScenario(seed int64) (ScenarioRun, error) {
fixture, err := LoadSupportBillingFixture()
if err != nil {
return ScenarioRun{}, err
}
if seed != SupportBillingSeed {
return ScenarioRun{}, fmt.Errorf("unsupported support billing fixture seed %d", seed)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 seed parameter provides no actual behavioral variation

RunSupportBillingScenario(seed int64) accepts a seed parameter but immediately rejects anything other than the hardcoded constant SupportBillingSeed = 42. The value isn't used to seed any RNG — all outputs are fully hardcoded. Callers from later voice-eval PRs who discover the constraint only at runtime (via an opaque error message) may be surprised. If only one seed is ever valid, consider removing the parameter entirely or accepting SupportBillingSeed as a named parameter so the contract is obvious at the call site.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/internal/voicefixtures/fixtures.go
Line: 255-262

Comment:
**`seed` parameter provides no actual behavioral variation**

`RunSupportBillingScenario(seed int64)` accepts a `seed` parameter but immediately rejects anything other than the hardcoded constant `SupportBillingSeed = 42`. The value isn't used to seed any RNG — all outputs are fully hardcoded. Callers from later voice-eval PRs who discover the constraint only at runtime (via an opaque error message) may be surprised. If only one seed is ever valid, consider removing the parameter entirely or accepting `SupportBillingSeed` as a named parameter so the contract is obvious at the call site.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex

Comment on lines +401 to +406
Actor: multimodaltrace.ActorEvaluator,
OccurredAt: clock.AtOffset(7 * time.Second),
TimingMarker: &multimodaltrace.TimingMarkerPayload{
Key: "end_of_speech_to_first_agent_text",
ValueMS: 3200,
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Hardcoded timing metric diverges from fixture data

TimingMarkerPayload.ValueMS: 3200 is correct today — it's derived from seg-005.OccurredAt (5 s) − (turn.OccurredAtOffsetMS + turn.DurationMS) (0 + 1800 ms) = 3200 ms — but the computation lives only in a developer's head. If the scripted user turn's duration_ms or occurred_at_offset_ms values are ever updated in scripted_user_turns.json, the hardcoded constant will silently become wrong, causing the golden scorecard to diverge without a compile-time or immediate test failure at the source of the inconsistency. Consider deriving the value from the segment timestamps already available in the trace.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/internal/voicefixtures/fixtures.go
Line: 401-406

Comment:
**Hardcoded timing metric diverges from fixture data**

`TimingMarkerPayload.ValueMS: 3200` is correct today — it's derived from `seg-005.OccurredAt (5 s) − (turn.OccurredAtOffsetMS + turn.DurationMS) (0 + 1800 ms) = 3200 ms` — but the computation lives only in a developer's head. If the scripted user turn's `duration_ms` or `occurred_at_offset_ms` values are ever updated in `scripted_user_turns.json`, the hardcoded constant will silently become wrong, causing the golden scorecard to diverge without a compile-time or immediate test failure at the source of the inconsistency. Consider deriving the value from the segment timestamps already available in the trace.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex

Comment on lines +11 to +41
func TestSupportBillingScenarioGoldensAreDeterministic(t *testing.T) {
fixture, err := LoadSupportBillingFixture()
if err != nil {
t.Fatalf("LoadSupportBillingFixture returned error: %v", err)
}
if _, err := challengepack.ParseYAML(fixture.ChallengePackYAML); err != nil {
t.Fatalf("challenge pack fixture failed validation: %v", err)
}

first, err := RunSupportBillingScenario(SupportBillingSeed)
if err != nil {
t.Fatalf("first RunSupportBillingScenario returned error: %v", err)
}
second, err := RunSupportBillingScenario(SupportBillingSeed)
if err != nil {
t.Fatalf("second RunSupportBillingScenario returned error: %v", err)
}

assertBytesEqual(t, "trace output repeated run", first.TraceJSON, second.TraceJSON)
assertBytesEqual(t, "scorecard output repeated run", first.ScorecardJSON, second.ScorecardJSON)
if !reflect.DeepEqual(first.EventTimestamps, second.EventTimestamps) {
t.Fatalf("event timestamps mismatch\nwant: %#v\n got: %#v", first.EventTimestamps, second.EventTimestamps)
}
assertBytesEqual(t, "tool-call arguments repeated run", first.ToolCallArgumentsJSON, second.ToolCallArgumentsJSON)
assertBytesEqual(t, "tool call golden", fixture.ExpectedToolCallJSON, first.ToolCallJSON)
assertBytesEqual(t, "tool result golden", fixture.ExpectedToolResultJSON, first.ToolResultJSON)
assertBytesEqual(t, "agent text output golden", fixture.ExpectedAgentTextOutput, first.AgentTextOutput)
assertBytesEqual(t, "structured output golden", fixture.ExpectedStructuredJSON, first.StructuredOutputJSON)
assertBytesEqual(t, "trace golden", fixture.ExpectedTraceJSON, first.TraceJSON)
assertBytesEqual(t, "scorecard golden", fixture.ExpectedScorecardJSON, first.ScorecardJSON)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No golden-update mechanism for committed fixture files

The test does byte-exact comparison against the committed golden files but there is no -update flag (the standard Go pattern: var update = flag.Bool("update", false, "update golden files")). If the trace schema, scorecard schema, or any fixture format ever changes, developers must manually regenerate and re-commit all eight golden files in lock-step. A missing update path is easy to overlook and risks committed goldens drifting from what the code would actually generate.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/internal/voicefixtures/fixtures_test.go
Line: 11-41

Comment:
**No golden-update mechanism for committed fixture files**

The test does byte-exact comparison against the committed golden files but there is no `-update` flag (the standard Go pattern: `var update = flag.Bool("update", false, "update golden files")`). If the trace schema, scorecard schema, or any fixture format ever changes, developers must manually regenerate and re-commit all eight golden files in lock-step. A missing update path is easy to overlook and risks committed goldens drifting from what the code would actually generate.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex

Checkpoint: step 2 removes misleading seed parameter, derives timing metrics, and adds golden update support.
@Atharva-Kanherkar Atharva-Kanherkar merged commit 1f4a9c7 into main May 13, 2026
3 checks passed
@Atharva-Kanherkar Atharva-Kanherkar deleted the codex/voice-golden-fixtures branch May 13, 2026 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Voice evals 03] Add golden voice eval fixtures and fake adapters

1 participant