test(voice-evals): add golden fixture adapters by Atharva-Kanherkar · Pull Request #773 · agentclash/agentclash

Atharva-Kanherkar · 2026-05-13T10:49:06Z

Closes #757
Parent: #754
Depends on: #755, #756

Summary

Adds backend/internal/voicefixtures, a reusable deterministic support-billing voice fixture package for later voice-evals PRs.
Commits the issue-required product fixtures: challenge-pack YAML, scripted user turns, expected tool call/result, expected agent text output, expected structured output, expected trace JSON, and expected scorecard JSON.
Adds fake adapters for user simulation, voice-agent deployment, tool endpoint, fake clock, and fake media transport.
Pins deterministic behavior with a test that runs the fake scenario twice and byte-compares generated outputs against committed goldens.

Testing

cd backend && go test ./internal/voicefixtures
cd backend && go test ./internal/voicefixtures ./internal/challengepack ./internal/multimodaltrace
cd backend && go test ./...

Review Checkpoint

The review-checkpoint contract and JSON were intentionally kept in /tmp and not committed. The JSON/YAML files committed in this PR are product fixtures required by #757, not checkpoint scratch files.

Test Contract

# codex/voice-golden-fixtures - Test Contract

## Functional Behavior

Implement issue #757: add deterministic golden fixtures and fake adapters for a voice support-agent billing scenario.

The implementation must:

- Commit small, readable fixtures for a billing refund voice scenario.
- Include a voice challenge-pack YAML fixture that parses with current challenge-pack validation.
- Include scripted user turns, expected tool call, expected tool result, expected agent text output, expected structured output, expected trace JSON, and expected scorecard JSON fixtures.
- Provide reusable fake adapters for a user simulator, voice-agent deployment, tool endpoint, clock, and media transport.
- Ensure all fake adapters are deterministic for a fixed seed and use the fake clock for timestamps.
- Avoid real LLM, STT, TTS, telephony, media provider, API key, network, or environment-variable dependencies.

## Unit Tests

- `TestSupportBillingScenarioGoldensAreDeterministic` - runs the fake scenario twice and proves identical trace JSON, scorecard JSON, event timestamps, tool-call arguments, tool result, agent text output, and structured output.
- The same test compares generated trace and scorecard bytes against committed golden fixtures.
- The voice challenge-pack YAML fixture must parse through `challengepack.ParseYAML`.

## Integration / Functional Tests

N/A - this slice only adds local fixtures and fake adapters for later voice-evals PRs.

## Smoke Tests

Run from `backend/`:

`go test ./internal/voicefixtures ./internal/challengepack ./internal/multimodaltrace`

Expected: pass with no network, API keys, live LLMs, STT/TTS providers, or telephony providers.

## E2E Tests

N/A - execution-mode integration is deferred to later issues.

## Manual / cURL Tests

N/A - no API endpoint behavior changes.

Review Update: Greptile Comments

Addressed Greptile review feedback in commit 5597c9a2:

Removed the misleading seed parameter from RunSupportBillingScenario; the fixture seed remains explicit in the generated scorecard via SupportBillingSeed.
Derived end_of_speech_to_first_agent_text_ms from fake-clock timestamps and scripted user audio duration instead of hardcoding 3200.
Added go test ./internal/voicefixtures -update support for generated golden outputs.

Checkpoint status: pass. Focused fixture tests, the review-contract smoke set, full backend tests, and the new -update path all passed.

Checkpoint JSON

{
  "branch": "codex/voice-golden-fixtures",
  "test_contract": "/tmp/codex-voice-golden-fixtures-test-contract.md",
  "created_at": "2026-05-13T10:44:55Z",
  "steps": [
    {
      "step_number": 1,
      "title": "Add deterministic voice fixtures and fake adapters",
      "timestamp": "2026-05-13T10:48:08Z",
      "files_changed": [
        "backend/internal/voicefixtures/fixtures.go",
        "backend/internal/voicefixtures/fixtures_test.go",
        "backend/internal/voicefixtures/testdata/support_billing/challenge_pack.yaml",
        "backend/internal/voicefixtures/testdata/support_billing/scripted_user_turns.json",
        "backend/internal/voicefixtures/testdata/support_billing/expected_tool_call.json",
        "backend/internal/voicefixtures/testdata/support_billing/expected_tool_result.json",
        "backend/internal/voicefixtures/testdata/support_billing/expected_agent_text_output.txt",
        "backend/internal/voicefixtures/testdata/support_billing/expected_structured_output.json",
        "backend/internal/voicefixtures/testdata/support_billing/expected_trace.json",
        "backend/internal/voicefixtures/testdata/support_billing/expected_scorecard.json"
      ],
      "what_changed": "Added a reusable voicefixtures package with embedded support-billing goldens, fake user simulator, fake voice-agent deployment, fake tool endpoint, fake clock, and fake media transport. The deterministic runner emits trace, scorecard, tool, structured-output, text-output, and timestamp results from a fixed seed.",
      "review_instructions": "Verify fixtures are committed because #757 explicitly requires product JSON/YAML goldens, while review-checkpoint scratch files stay in /tmp. Check that no fake uses network, API keys, LLMs, STT/TTS, telephony, or environment variables. Confirm repeated runs are byte-stable and the challenge pack parses.",
      "review_result": {
        "status": "pass",
        "issues_found": [],
        "notes": "The package is self-contained under backend/internal/voicefixtures. The test compares repeated generated bytes and generated bytes against committed goldens."
      },
      "cumulative_review": {
        "previous_steps_still_valid": true,
        "integration_issues": [],
        "notes": "Single-step implementation; it depends on the merged multimodal trace and voice pack validation contracts from #755 and #756."
      }
    },
    {
      "step_number": "final",
      "title": "Final review against test contract",
      "timestamp": "2026-05-13T10:48:08Z",
      "test_contract_review": {
        "functional_behavior": "pass - committed readable support-billing fixtures and reusable deterministic fake adapters for simulator, deployment, tool endpoint, clock, and media transport.",
        "unit_tests": "pass - TestSupportBillingScenarioGoldensAreDeterministic runs the fake scenario twice and verifies trace, scorecard, timestamps, tool-call arguments, tool result, agent text output, structured output, and challenge-pack parsing.",
        "integration_tests": "N/A - local fixtures/adapters only.",
        "smoke_tests": "pass - go test ./internal/voicefixtures ./internal/challengepack ./internal/multimodaltrace and go test ./... passed from backend/.",
        "e2e_tests": "N/A - execution mode is deferred.",
        "manual_tests": "N/A - no API endpoint behavior changed."
      },
      "overall_verdict": "ready",
      "blocking_issues": []
    }
  ]
}

Checkpoint: step 1 adds deterministic support-billing voice fixtures and fake adapters for issue #757.

greptile-apps · 2026-05-13T10:55:01Z

Greptile Summary

Introduces backend/internal/voicefixtures, a self-contained package of deterministic golden fixtures and fake adapters for a voice support-agent billing scenario, laying the groundwork for later voice-eval PRs (#757).

Fake adapters added: FakeClock, FakeUserSimulator, FakeToolEndpoint, FakeVoiceAgentDeployment, and FakeMediaTransport — all fully in-process with no network, LLM, STT/TTS, or env-var dependencies.
Golden files committed: challenge-pack YAML, scripted user turns, expected tool call/result, agent text output, structured output, trace JSON, and scorecard JSON — all embedded via embed.FS and byte-compared in TestSupportBillingScenarioGoldensAreDeterministic.
decodeCanonical normalizes embedded json.RawMessage fields through an untyped JSON round-trip, which sorts object keys alphabetically; the committed golden files were written to match this ordering.

Confidence Score: 4/5

Safe to merge; the package is entirely self-contained with no external dependencies, and the deterministic test validates byte-stable output across repeated runs.

The seed parameter accepts only the single constant 42, which misrepresents the API's flexibility to downstream consumers. The timing marker value (3200 ms) is hardcoded rather than derived from the segment timestamps already in scope, so a change to the user-turn duration in the JSON fixture would silently produce a wrong metric without an obvious failure at the site of the inconsistency. Neither issue blocks the current test, but both create fragility for future voice-eval PRs that build on this package.

backend/internal/voicefixtures/fixtures.go — the seed guard and hardcoded timing metric value warrant a closer look before dependent PRs build on this API.

Important Files Changed

Filename	Overview
backend/internal/voicefixtures/fixtures.go	Adds all fixture types, fake adapters, and RunSupportBillingScenario. The seed parameter is accepted but only the constant 42 is valid; timing metric value is hardcoded rather than derived from segment data.
backend/internal/voicefixtures/fixtures_test.go	Deterministic golden test comparing two runs and committed files byte-for-byte; no -update flag to regenerate committed goldens.
backend/internal/voicefixtures/testdata/support_billing/challenge_pack.yaml	Well-formed voice challenge pack YAML; passes challengepack.ParseYAML validation as tested.
backend/internal/voicefixtures/testdata/support_billing/expected_trace.json	8-segment trace golden with alphabetically-sorted JSON keys in embedded payloads, consistent with decodeCanonical normalization.
backend/internal/voicefixtures/testdata/support_billing/expected_scorecard.json	Scorecard golden matches what supportBillingScorecard() generates; all checks pass with correct keys and values.
backend/internal/voicefixtures/testdata/support_billing/scripted_user_turns.json	Single scripted turn for the duplicate-charge billing scenario; consistent with fixtures_test expectations for exactly one turn.

Sequence Diagram

sequenceDiagram
    participant Test as TestGoldensAreDeterministic
    participant FS as embed.FS (testdata)
    participant Sim as FakeUserSimulator
    participant Agent as FakeVoiceAgentDeployment
    participant Tool as FakeToolEndpoint
    participant Media as FakeMediaTransport
    participant Clock as FakeClock

    Test->>FS: LoadSupportBillingFixture()
    FS-->>Test: SupportBillingFixture (8 files)
    Test->>Test: "RunSupportBillingScenario(seed=42) x2"

    Note over Test,Clock: Inside RunSupportBillingScenario
    Test->>Sim: Turns()[0]
    Sim-->>Test: ScriptedUserTurn
    Test->>Agent: Respond(turn)
    Agent->>Tool: Call(expectedCall)
    Tool-->>Agent: ToolResultFixture
    Agent-->>Test: AgentResponse

    Test->>Clock: AtOffset(0ms)
    Test->>Media: "ReceiveUserAudio(turn, t=0ms)"
    Test->>Clock: AtOffset(6s)
    Test->>Media: "SendAgentAudio(response, t=6s)"

    Test->>Test: Build multimodaltrace.Trace (8 segments)
    Test->>Test: marshalGolden(trace) → TraceJSON
    Test->>Test: marshalGolden(scorecard) → ScorecardJSON

    Note over Test,FS: Assertions
    Test->>Test: assertBytesEqual(run1 vs run2)
    Test->>Test: assertBytesEqual(generated vs FS goldens)

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
backend/internal/voicefixtures/fixtures.go:255-262
**`seed` parameter provides no actual behavioral variation**

`RunSupportBillingScenario(seed int64)` accepts a `seed` parameter but immediately rejects anything other than the hardcoded constant `SupportBillingSeed = 42`. The value isn't used to seed any RNG — all outputs are fully hardcoded. Callers from later voice-eval PRs who discover the constraint only at runtime (via an opaque error message) may be surprised. If only one seed is ever valid, consider removing the parameter entirely or accepting `SupportBillingSeed` as a named parameter so the contract is obvious at the call site.

### Issue 2 of 3
backend/internal/voicefixtures/fixtures.go:401-406
**Hardcoded timing metric diverges from fixture data**

`TimingMarkerPayload.ValueMS: 3200` is correct today — it's derived from `seg-005.OccurredAt (5 s) − (turn.OccurredAtOffsetMS + turn.DurationMS) (0 + 1800 ms) = 3200 ms` — but the computation lives only in a developer's head. If the scripted user turn's `duration_ms` or `occurred_at_offset_ms` values are ever updated in `scripted_user_turns.json`, the hardcoded constant will silently become wrong, causing the golden scorecard to diverge without a compile-time or immediate test failure at the source of the inconsistency. Consider deriving the value from the segment timestamps already available in the trace.

### Issue 3 of 3
backend/internal/voicefixtures/fixtures_test.go:11-41
**No golden-update mechanism for committed fixture files**

The test does byte-exact comparison against the committed golden files but there is no `-update` flag (the standard Go pattern: `var update = flag.Bool("update", false, "update golden files")`). If the trace schema, scorecard schema, or any fixture format ever changes, developers must manually regenerate and re-commit all eight golden files in lock-step. A missing update path is easy to overlook and risks committed goldens drifting from what the code would actually generate.

_{Reviews (1): Last reviewed commit: "test(voice-evals): add golden fixture ad..." | Re-trigger Greptile}

greptile-apps · 2026-05-13T10:55:05Z

+func RunSupportBillingScenario(seed int64) (ScenarioRun, error) {
+	fixture, err := LoadSupportBillingFixture()
+	if err != nil {
+		return ScenarioRun{}, err
+	}
+	if seed != SupportBillingSeed {
+		return ScenarioRun{}, fmt.Errorf("unsupported support billing fixture seed %d", seed)
+	}


seed parameter provides no actual behavioral variation

RunSupportBillingScenario(seed int64) accepts a seed parameter but immediately rejects anything other than the hardcoded constant SupportBillingSeed = 42. The value isn't used to seed any RNG — all outputs are fully hardcoded. Callers from later voice-eval PRs who discover the constraint only at runtime (via an opaque error message) may be surprised. If only one seed is ever valid, consider removing the parameter entirely or accepting SupportBillingSeed as a named parameter so the contract is obvious at the call site.

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/internal/voicefixtures/fixtures.go Line: 255-262 Comment: **`seed` parameter provides no actual behavioral variation** `RunSupportBillingScenario(seed int64)` accepts a `seed` parameter but immediately rejects anything other than the hardcoded constant `SupportBillingSeed = 42`. The value isn't used to seed any RNG — all outputs are fully hardcoded. Callers from later voice-eval PRs who discover the constraint only at runtime (via an opaque error message) may be surprised. If only one seed is ever valid, consider removing the parameter entirely or accepting `SupportBillingSeed` as a named parameter so the contract is obvious at the call site. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-13T10:55:06Z

+				Actor:          multimodaltrace.ActorEvaluator,
+				OccurredAt:     clock.AtOffset(7 * time.Second),
+				TimingMarker: &multimodaltrace.TimingMarkerPayload{
+					Key:     "end_of_speech_to_first_agent_text",
+					ValueMS: 3200,
+				},


Hardcoded timing metric diverges from fixture data

TimingMarkerPayload.ValueMS: 3200 is correct today — it's derived from seg-005.OccurredAt (5 s) − (turn.OccurredAtOffsetMS + turn.DurationMS) (0 + 1800 ms) = 3200 ms — but the computation lives only in a developer's head. If the scripted user turn's duration_ms or occurred_at_offset_ms values are ever updated in scripted_user_turns.json, the hardcoded constant will silently become wrong, causing the golden scorecard to diverge without a compile-time or immediate test failure at the source of the inconsistency. Consider deriving the value from the segment timestamps already available in the trace.

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/internal/voicefixtures/fixtures.go Line: 401-406 Comment: **Hardcoded timing metric diverges from fixture data** `TimingMarkerPayload.ValueMS: 3200` is correct today — it's derived from `seg-005.OccurredAt (5 s) − (turn.OccurredAtOffsetMS + turn.DurationMS) (0 + 1800 ms) = 3200 ms` — but the computation lives only in a developer's head. If the scripted user turn's `duration_ms` or `occurred_at_offset_ms` values are ever updated in `scripted_user_turns.json`, the hardcoded constant will silently become wrong, causing the golden scorecard to diverge without a compile-time or immediate test failure at the source of the inconsistency. Consider deriving the value from the segment timestamps already available in the trace. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-13T10:55:07Z

+func TestSupportBillingScenarioGoldensAreDeterministic(t *testing.T) {
+	fixture, err := LoadSupportBillingFixture()
+	if err != nil {
+		t.Fatalf("LoadSupportBillingFixture returned error: %v", err)
+	}
+	if _, err := challengepack.ParseYAML(fixture.ChallengePackYAML); err != nil {
+		t.Fatalf("challenge pack fixture failed validation: %v", err)
+	}
+
+	first, err := RunSupportBillingScenario(SupportBillingSeed)
+	if err != nil {
+		t.Fatalf("first RunSupportBillingScenario returned error: %v", err)
+	}
+	second, err := RunSupportBillingScenario(SupportBillingSeed)
+	if err != nil {
+		t.Fatalf("second RunSupportBillingScenario returned error: %v", err)
+	}
+
+	assertBytesEqual(t, "trace output repeated run", first.TraceJSON, second.TraceJSON)
+	assertBytesEqual(t, "scorecard output repeated run", first.ScorecardJSON, second.ScorecardJSON)
+	if !reflect.DeepEqual(first.EventTimestamps, second.EventTimestamps) {
+		t.Fatalf("event timestamps mismatch\nwant: %#v\n got: %#v", first.EventTimestamps, second.EventTimestamps)
+	}
+	assertBytesEqual(t, "tool-call arguments repeated run", first.ToolCallArgumentsJSON, second.ToolCallArgumentsJSON)
+	assertBytesEqual(t, "tool call golden", fixture.ExpectedToolCallJSON, first.ToolCallJSON)
+	assertBytesEqual(t, "tool result golden", fixture.ExpectedToolResultJSON, first.ToolResultJSON)
+	assertBytesEqual(t, "agent text output golden", fixture.ExpectedAgentTextOutput, first.AgentTextOutput)
+	assertBytesEqual(t, "structured output golden", fixture.ExpectedStructuredJSON, first.StructuredOutputJSON)
+	assertBytesEqual(t, "trace golden", fixture.ExpectedTraceJSON, first.TraceJSON)
+	assertBytesEqual(t, "scorecard golden", fixture.ExpectedScorecardJSON, first.ScorecardJSON)
+}


No golden-update mechanism for committed fixture files

The test does byte-exact comparison against the committed golden files but there is no -update flag (the standard Go pattern: var update = flag.Bool("update", false, "update golden files")). If the trace schema, scorecard schema, or any fixture format ever changes, developers must manually regenerate and re-commit all eight golden files in lock-step. A missing update path is easy to overlook and risks committed goldens drifting from what the code would actually generate.

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/internal/voicefixtures/fixtures_test.go Line: 11-41 Comment: **No golden-update mechanism for committed fixture files** The test does byte-exact comparison against the committed golden files but there is no `-update` flag (the standard Go pattern: `var update = flag.Bool("update", false, "update golden files")`). If the trace schema, scorecard schema, or any fixture format ever changes, developers must manually regenerate and re-commit all eight golden files in lock-step. A missing update path is easy to overlook and risks committed goldens drifting from what the code would actually generate. How can I resolve this? If you propose a fix, please make it concise.

Checkpoint: step 2 removes misleading seed parameter, derives timing metrics, and adds golden update support.

test(voice-evals): add golden fixture adapters

03fc523

Checkpoint: step 1 adds deterministic support-billing voice fixtures and fake adapters for issue #757.

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

test(voice-evals): address fixture review feedback

5597c9a

Checkpoint: step 2 removes misleading seed parameter, derives timing metrics, and adds golden update support.

Atharva-Kanherkar merged commit 1f4a9c7 into main May 13, 2026
3 checks passed

Atharva-Kanherkar deleted the codex/voice-golden-fixtures branch May 13, 2026 11:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(voice-evals): add golden fixture adapters#773

test(voice-evals): add golden fixture adapters#773
Atharva-Kanherkar merged 2 commits into
mainfrom
codex/voice-golden-fixtures

Atharva-Kanherkar commented May 13, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 13, 2026

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Atharva-Kanherkar commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Review Checkpoint

Test Contract

Review Update: Greptile Comments

Checkpoint JSON

Uh oh!

greptile-apps Bot commented May 13, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Atharva-Kanherkar commented May 13, 2026 •

edited

Loading