Skip to content

[Voice evals 16] Add support-agent end-to-end smoke test#787

Merged
Atharva-Kanherkar merged 2 commits into
mainfrom
codex/voice-support-e2e-smoke
May 13, 2026
Merged

[Voice evals 16] Add support-agent end-to-end smoke test#787
Atharva-Kanherkar merged 2 commits into
mainfrom
codex/voice-support-e2e-smoke

Conversation

@Atharva-Kanherkar
Copy link
Copy Markdown
Collaborator

Summary

  • Adds backend/internal/voicee2e with a deterministic support-agent voice eval smoke test.
  • The smoke loads the existing billing/refund voice pack, runs text_sim with fake deployment/tool behavior, validates canonical events plus artifact manifest checksums, builds replay projection, generates scorecards, and evaluates voice compare gates.
  • Documents the one-command verification path in the test contract and test file: go test ./internal/voicee2e -run TestSupportAgentVoiceEvalLoopSmoke -count=1 from backend/.

Closes #770.
Parent: #754.

Tests

  • go test ./internal/voicee2e -run TestSupportAgentVoiceEvalLoopSmoke -count=1
  • go test ./internal/voicee2e ./internal/voicetextsim ./internal/voicereplay ./internal/voicescorecard ./internal/releasegate
  • go test ./...

Test Contract

codex/voice-support-e2e-smoke — Test Contract

Functional Behavior

  • Add a deterministic support-agent voice smoke test using the existing support billing/refund fixture pack.
  • The test must exercise modality: voice, text_sim mode, scripted user simulator data, fake voice-agent deployment, fake tool call, canonical voice events, artifact manifest validation, replay projection, scorecard generation, and voice compare gate evaluation.
  • A developer can verify the full first voice-evals slice after merge with one command: go test ./internal/voicee2e -run TestSupportAgentVoiceEvalLoopSmoke -count=1 from backend/.
  • The sample must not require an LLM API key, telephony/WebRTC, object storage, or external network calls.
  • The happy path must produce a passing scorecard and non-empty replay projection.
  • A companion failing candidate must fail the compare gate deterministically.

Unit Tests

  • TestSupportAgentVoiceEvalLoopSmoke loads the support-agent pack, runs text-sim with fake adapters, validates events/artifact manifest, builds replay projection, generates scorecards, and evaluates pass/fail compare gates.

Integration / Functional Tests

  • go test ./internal/voicee2e from backend/ verifies the full deterministic smoke path.

Smoke Tests

  • go test ./internal/voicee2e -run TestSupportAgentVoiceEvalLoopSmoke -count=1 from backend/.
  • go test ./internal/voicee2e ./internal/voicetextsim ./internal/voicereplay ./internal/voicescorecard ./internal/releasegate from backend/.
  • go test ./... from backend/ if focused tests pass.

E2E Tests

  • Covered by the deterministic internal smoke test; no external service is required.

Manual / cURL Tests

  • N/A — this is a Go smoke test, not an HTTP route.

Review Checkpoint JSON

{
  "branch": "codex/voice-support-e2e-smoke",
  "test_contract": "/tmp/codex-voice-support-e2e-smoke-test-contract.md",
  "created_at": "2026-05-13T17:45:00Z",
  "steps": [
    {
      "step_number": 1,
      "title": "Add support-agent voice eval smoke test",
      "timestamp": "2026-05-13T17:52:00Z",
      "files_changed": [
        "backend/internal/voicee2e/support_agent_smoke_test.go"
      ],
      "what_changed": "Added a deterministic support-agent voice evaluation smoke test that loads the support billing fixture pack, runs text-sim with fake deployment/tool call behavior, validates canonical events and artifact manifest checksums, builds a replay projection, generates voice scorecards, and evaluates the voice compare gate for both happy and known-bad candidates.",
      "review_instructions": "Verify the test exercises the full #770 contract without external LLM, telephony/WebRTC, object storage, or network calls; verify the documented command works and that both the happy pass and failing gate paths are deterministic.",
      "review_result": {
        "status": "pass",
        "issues_found": [],
        "notes": "Self-review checked the one-command smoke path, fixture pack modality/text_sim checks, fake deployment/tool call wiring, event validation, manifest checksum verification, replay projection, scorecard pass, and compare gate failure. Focused, neighboring, and full backend tests passed."
      },
      "cumulative_review": {
        "previous_steps_still_valid": true,
        "integration_issues": [],
        "notes": "The smoke test composes existing voice packages without changing their behavior."
      }
    },
    {
      "step_number": "final",
      "title": "Final review against test contract",
      "timestamp": "2026-05-13T17:53:00Z",
      "test_contract_review": {
        "functional_behavior": "pass - the smoke test covers the support billing voice pack, text_sim, fake deployment/tool call, canonical events, manifest checksums, replay projection, scorecards, and pass/fail compare gate paths without external services.",
        "unit_tests": "pass - TestSupportAgentVoiceEvalLoopSmoke is present and covers the contract.",
        "integration_tests": "pass - go test ./internal/voicee2e verifies the full deterministic internal smoke path.",
        "smoke_tests": "pass - documented command, neighboring package set, and go test ./... from backend passed.",
        "e2e_tests": "pass - deterministic internal E2E-style smoke test covers the final voice-evals slice.",
        "manual_tests": "N/A - no HTTP/manual route in this slice."
      },
      "overall_verdict": "ready",
      "blocking_issues": []
    }
  ]
}

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR adds backend/internal/voicee2e/support_agent_smoke_test.go, a single deterministic smoke test that exercises the full voice-eval pipeline — fixture loading, text-sim execution, canonical event validation, replay projection, scorecard generation, and compare-gate evaluation — without any external services, LLM keys, or telephony.

  • Introduces TestSupportAgentVoiceEvalLoopSmoke covering both the happy-path pass verdict and a deterministic fail-gate scenario using voicedeployment.OutcomeFail.
  • Wires together voicetextsim, voicedeployment, voicereplay, voicescorecard, and releasegate packages end-to-end with the existing support_billing fixture pack.

Confidence Score: 4/5

The change is a test-only addition with no production code changes; it is safe to merge and will not affect runtime behavior.

The end-to-end flow is logically correct: all key timing, latency-fallback, and scorecard-pass conditions trace through cleanly against the fixture data. Two gaps hold the score below clean: the local contains helper silently disagrees with voicetextsim.contains on whitespace handling, and five fixture fields loaded via loadSupportFixture (including ExpectedAgentTextOutput) are never consumed, leaving the canonical fixture expectations unverified by the test.

backend/internal/voicee2e/support_agent_smoke_test.go — the only changed file; focus on the unused fixture fields and the contains inconsistency.

Important Files Changed

Filename Overview
backend/internal/voicee2e/support_agent_smoke_test.go New deterministic smoke test; end-to-end flow is logically sound, but the local contains helper diverges from voicetextsim.contains in whitespace handling, and several loaded fixture fields (ExpectedAgentTextOutput, ExpectedTraceJSON, etc.) are never consumed, creating silent drift risk.

Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
backend/internal/voicee2e/support_agent_smoke_test.go:280-285
**Local `contains` silently disagrees with `voicetextsim.contains`**

The local helper does exact string comparison, while `voicetextsim.validateInput` trims whitespace before comparing. If the fixture YAML ever has a transport value padded with whitespace (e.g., `" text_sim"`), `parseSupportPack` would fatalf with "want text_sim transport" even though the actual `voicetextsim.Run` call would succeed — making the failure message actively misleading.

### Issue 2 of 2
backend/internal/voicee2e/support_agent_smoke_test.go:85-105
**Several fixture fields are loaded but never consumed**

`loadSupportFixture` populates `fixture.ExpectedAgentTextOutput`, `fixture.ExpectedTraceJSON`, `fixture.ExpectedScorecardJSON`, `fixture.ScriptedUserTurnsJSON`, and `fixture.ExpectedToolResultJSON`, but none of these are referenced anywhere in the test. In particular, `fakeDeploymentScript` sources the happy-path agent text from `script.Steps[0].ExpectedAgentText` (the separate `voicesim` script file), completely bypassing the canonical `ExpectedAgentTextOutput` embedded in the fixture pack. If these two sources drift apart, the smoke test will keep passing while the fixtures are inconsistent.

Reviews (1): Last reviewed commit: "test(voice): add support eval smoke" | Re-trigger Greptile

Comment on lines +280 to +285
for _, value := range values {
if value == want {
return true
}
}
return false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Local contains silently disagrees with voicetextsim.contains

The local helper does exact string comparison, while voicetextsim.validateInput trims whitespace before comparing. If the fixture YAML ever has a transport value padded with whitespace (e.g., " text_sim"), parseSupportPack would fatalf with "want text_sim transport" even though the actual voicetextsim.Run call would succeed — making the failure message actively misleading.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/internal/voicee2e/support_agent_smoke_test.go
Line: 280-285

Comment:
**Local `contains` silently disagrees with `voicetextsim.contains`**

The local helper does exact string comparison, while `voicetextsim.validateInput` trims whitespace before comparing. If the fixture YAML ever has a transport value padded with whitespace (e.g., `" text_sim"`), `parseSupportPack` would fatalf with "want text_sim transport" even though the actual `voicetextsim.Run` call would succeed — making the failure message actively misleading.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex

Comment on lines +85 to +105
if err != nil {
t.Fatalf("LoadSupportBillingFixture returned error: %v", err)
}
return fixture
}

func parseSupportPack(t *testing.T, fixture voicefixtures.SupportBillingFixture) challengepack.Bundle {
t.Helper()
bundle, err := challengepack.ParseYAML(fixture.ChallengePackYAML)
if err != nil {
t.Fatalf("ParseYAML returned error: %v", err)
}
if bundle.Modality != challengepack.ModalityVoice {
t.Fatalf("bundle modality = %q, want voice", bundle.Modality)
}
if bundle.InterfaceSpec == nil || !contains(bundle.InterfaceSpec.Transports, "text_sim") {
t.Fatalf("bundle interface spec = %+v, want text_sim transport", bundle.InterfaceSpec)
}
return bundle
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Several fixture fields are loaded but never consumed

loadSupportFixture populates fixture.ExpectedAgentTextOutput, fixture.ExpectedTraceJSON, fixture.ExpectedScorecardJSON, fixture.ScriptedUserTurnsJSON, and fixture.ExpectedToolResultJSON, but none of these are referenced anywhere in the test. In particular, fakeDeploymentScript sources the happy-path agent text from script.Steps[0].ExpectedAgentText (the separate voicesim script file), completely bypassing the canonical ExpectedAgentTextOutput embedded in the fixture pack. If these two sources drift apart, the smoke test will keep passing while the fixtures are inconsistent.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/internal/voicee2e/support_agent_smoke_test.go
Line: 85-105

Comment:
**Several fixture fields are loaded but never consumed**

`loadSupportFixture` populates `fixture.ExpectedAgentTextOutput`, `fixture.ExpectedTraceJSON`, `fixture.ExpectedScorecardJSON`, `fixture.ScriptedUserTurnsJSON`, and `fixture.ExpectedToolResultJSON`, but none of these are referenced anywhere in the test. In particular, `fakeDeploymentScript` sources the happy-path agent text from `script.Steps[0].ExpectedAgentText` (the separate `voicesim` script file), completely bypassing the canonical `ExpectedAgentTextOutput` embedded in the fixture pack. If these two sources drift apart, the smoke test will keep passing while the fixtures are inconsistent.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex

@Atharva-Kanherkar
Copy link
Copy Markdown
Collaborator Author

Verdict: approve

Blocking issues: none found.

Step review:

  1. Add support-agent voice eval smoke test - pass
    Notes: Verified against the actual single-file diff in backend/internal/voicee2e/support_agent_smoke_test.go. The test loads the support billing/refund fixture pack, asserts modality: voice and text_sim, uses scripted simulator data, builds a fake voice-agent deployment with the expected fake tool call, validates persisted canonical voice events, verifies local artifact manifest checksums, builds a non-empty replay projection, generates passing scorecards for baseline/candidate, passes the voice compare gate for the happy path, and fails the known-bad candidate deterministically with scorecard_not_passed.
    Issues: none.

Final test contract review:

  • Functional behavior: pass - Covers issue [Voice evals 16] Add support-agent sample pack and end-to-end smoke test #770's deterministic support-agent voice eval loop using the billing/refund fixture and the required voice/text-sim/fake deployment/tool/replay/scorecard/gate pieces.
  • Unit tests: pass - TestSupportAgentVoiceEvalLoopSmoke is present and covers the contract.
  • Integration tests: pass - go test ./internal/voicee2e path is covered by the documented focused command.
  • Smoke tests: pass - Documented focused command, neighboring package set, and full backend suite passed locally.
  • E2E tests: pass - The deterministic internal smoke test covers the requested end-to-end voice-evals slice without external services.
  • Manual tests: N/A - No HTTP/manual route in this slice.

External dependency check:

  • pass - The added test imports local/internal packages only and uses voicedeployment.NewFake; no LLM API key, telephony/WebRTC, object storage, or external network dependency is introduced.
  • pass - Fixture artifact manifest entries are all local_path; checksum validation reads local testdata.
  • pass - Searched the new test and relevant voice packages for API-key/network/telephony/object-storage usage. Only object-storage validation code exists in the generic manifest package; the smoke fixture does not use it.

Commands run from backend/:

  • go test ./internal/voicee2e -run TestSupportAgentVoiceEvalLoopSmoke -count=1 - passed
  • go test ./internal/voicee2e ./internal/voicetextsim ./internal/voicereplay ./internal/voicescorecard ./internal/releasegate - passed
  • go test ./... - passed

Review JSON:

{
  "steps": [
    {
      "step_number": 1,
      "title": "Add support-agent voice eval smoke test",
      "review_result": {
        "status": "pass",
        "issues_found": [],
        "notes": "Independently verified the actual diff and local tests. The smoke covers the #770 contract, uses fake/local-only voice evaluation pieces, validates deterministic happy-path output, and proves the known-bad candidate fails the compare gate."
      },
      "cumulative_review": {
        "previous_steps_still_valid": true,
        "integration_issues": [],
        "notes": "No integration drift found across fixture loading, text_sim execution, artifact verification, replay, scorecard generation, or releasegate comparison."
      }
    },
    {
      "step_number": "final",
      "title": "Final review against test contract",
      "test_contract_review": {
        "functional_behavior": "pass - issue #770 contract verified against the actual diff and local tests.",
        "unit_tests": "pass - TestSupportAgentVoiceEvalLoopSmoke covers the required smoke path.",
        "integration_tests": "pass - focused package smoke passed.",
        "smoke_tests": "pass - documented command, neighboring packages, and go test ./... passed.",
        "e2e_tests": "pass - deterministic internal E2E-style smoke covers the requested slice without external services.",
        "manual_tests": "N/A - no manual HTTP route required."
      },
      "overall_verdict": "approve",
      "blocking_issues": []
    }
  ]
}

@Atharva-Kanherkar
Copy link
Copy Markdown
Collaborator Author

Verdict: approve

Blocking issues: none found.

Note: GitHub rejected a formal approving review from this local token because it belongs to the PR author, so I am posting the independent review as a PR conversation comment.

Step review:

  1. Add support-agent voice eval smoke test - pass
    Notes: Reviewed the actual current diff at head 1a2a0986 against issue [Voice evals 16] Add support-agent sample pack and end-to-end smoke test #770 and the PR test contract. The new smoke test loads the support billing voice fixture pack, asserts modality: voice and text_sim, runs voicetextsim through voicedeployment.NewFake, validates deterministic trace/events output, verifies canonical persisted events, checks local artifact manifest checksums, builds a non-empty replay projection, generates passing happy-path scorecards, and proves the known-bad candidate fails the compare gate with scorecard_not_passed.
    Issues: none.

Greptile follow-up verification:

  • pass - 1a2a0986 addresses fixture-field usage by adding assertSupportFixtureGoldens, consuming ExpectedToolResultJSON, ExpectedAgentTextOutput, ExpectedTraceJSON, ExpectedScorecardJSON, and ScriptedUserTurnsJSON, and asserting the generated trace text equals the fixture agent text.
  • pass - fakeDeploymentScript now sources happy-path agent text from fixture.ExpectedAgentTextOutput, so the smoke no longer bypasses the canonical fixture output.
  • pass - the local contains helper now trims whitespace before matching, matching the production voicetextsim validation behavior.

Final test contract review:

  • Functional behavior: pass - Covers issue [Voice evals 16] Add support-agent sample pack and end-to-end smoke test #770's deterministic support-agent voice eval loop using the required voice/text-sim/scripted simulator/fake deployment/fake tool/canonical events/artifact manifest/replay/scorecard/compare-gate pieces.
  • Unit tests: pass - TestSupportAgentVoiceEvalLoopSmoke is present and covers the contract.
  • Integration tests: pass - Focused and neighboring backend package tests passed uncached.
  • Smoke tests: pass - The documented one-command smoke test passed locally.
  • E2E tests: pass - The deterministic internal smoke covers the requested end-to-end voice-evals slice without external services.
  • Manual tests: N/A - This is a Go smoke test, not an HTTP route.

External dependency check:

  • pass - The changed test imports only local/internal packages and uses fake deployment/tool behavior; no LLM API key, telephony/WebRTC, object storage, or network call requirement was introduced.
  • pass - The artifact manifest used by the smoke contains local-path artifacts; checksum validation reads local testdata.

Commands run from backend/:

  • go test ./internal/voicee2e -run TestSupportAgentVoiceEvalLoopSmoke -count=1 - passed
  • go test ./internal/voicee2e ./internal/voicetextsim ./internal/voicereplay ./internal/voicescorecard ./internal/releasegate - passed, cached
  • go test ./... - passed, cached
  • go test -count=1 ./internal/voicee2e ./internal/voicetextsim ./internal/voicereplay ./internal/voicescorecard ./internal/releasegate - passed
  • go test -count=1 ./... - passed
  • git diff --check origin/main...HEAD - passed

Review JSON:

{"steps":[{"step_number":1,"title":"Add support-agent voice eval smoke test","review_result":{"status":"pass","issues_found":[],"notes":"Actual diff and local tests satisfy issue #770 and the PR test contract. Greptile follow-up commit 1a2a0986 addressed fixture-field usage and whitespace matching."},"cumulative_review":{"previous_steps_still_valid":true,"integration_issues":[],"notes":"No integration drift found across fixture loading, text-sim execution, artifact verification, replay, scorecard generation, or releasegate comparison."}},{"step_number":"final","title":"Final review against test contract","test_contract_review":{"functional_behavior":"pass","unit_tests":"pass","integration_tests":"pass","smoke_tests":"pass","e2e_tests":"pass","manual_tests":"N/A"},"overall_verdict":"approve","blocking_issues":[]}]}

@Atharva-Kanherkar Atharva-Kanherkar merged commit a3b9a36 into main May 13, 2026
3 checks passed
@Atharva-Kanherkar Atharva-Kanherkar deleted the codex/voice-support-e2e-smoke branch May 13, 2026 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Voice evals 16] Add support-agent sample pack and end-to-end smoke test

1 participant