[Voice evals 15] Add production-call import and redaction contract#786
Conversation
|
Verdict: approve Blocking issues: None found. Step review:
Final test contract review:
Commands run:
Review JSON: {
"steps": [
{
"step_number": 1,
"title": "Add deterministic production-call import and redaction contract",
"review_result": {
"status": "pass",
"issues_found": [],
"notes": "Verified actual PR diff, issue #769 contract, strict import validation, trace/manifest conversion, approved-only promotion, artifact path/checksum preservation via manifest, deterministic labels/artifact ordering, and absence of provider/network/LLM dependencies. Focused, neighboring, and full backend tests passed."
},
"cumulative_review": {
"previous_steps_still_valid": true,
"integration_issues": [],
"notes": "No drift found between the new voiceimport package and existing multimodaltrace, voiceartifacts, or challengepack contracts."
}
},
{
"step_number": "final",
"title": "Final review against test contract",
"test_contract_review": {
"functional_behavior": "pass - satisfies #769 import format, redaction metadata/status, promotion gating, trace/manifest conversion, and deterministic challenge case requirements.",
"unit_tests": "pass - named contract tests exist and pass locally.",
"integration_tests": "N/A - internal contract/conversion helper only.",
"smoke_tests": "pass - go test ./internal/voiceimport, neighboring package tests, and backend go test ./... passed locally.",
"e2e_tests": "N/A - no production provider import in this slice.",
"manual_tests": "N/A - deterministic unit tests cover the stated contract."
},
"overall_verdict": "approve",
"blocking_issues": []
}
]
} |
Greptile SummaryThis PR adds the
Confidence Score: 3/5The package introduces a well-structured, provider-neutral import contract, but a gap in Fixture.Validate() means callers cannot rely on validation as a stable pre-flight check before calling ToTrace(). A fixture with duplicate segment IDs passes Fixture.Validate() but fails inside ToTrace() when trace.Validate() catches the duplicate, breaking the stated contract that validated imports produce valid traces. backend/internal/voiceimport/import.go needs a cross-entry uniqueness check in Fixture.Validate() for all segment IDs contributed to the trace.
|
| Filename | Overview |
|---|---|
| backend/internal/voiceimport/import.go | New voiceimport package defining the production call import contract; has a gap where Fixture.Validate() omits duplicate segment-ID checks, letting fixtures that validate still fail in ToTrace(), plus redundant triple-validation in PromoteToChallengeCase(). |
| backend/internal/voiceimport/import_test.go | Five unit tests covering the main import/promote/validate paths; determinism check marshals the same struct twice and is trivially true rather than testing cross-invocation determinism. |
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
backend/internal/voiceimport/import.go:164-180
**Segment ID uniqueness not checked in `Fixture.Validate()`**
`Fixture.Validate()` iterates over transcript entries and validates each in isolation, but never checks that `TranscriptEntry.SegmentID` values are unique across the slice, that `AudioReference.SegmentID` values are unique, or that an audio segment ID doesn't collide with a transcript segment ID. Both sets of IDs feed directly into `trace.Segments` in `ToTrace()`, so a fixture with duplicate segment IDs will pass `Fixture.Validate()` and then fail inside `ToTrace()` when `trace.Validate()` rejects the duplicate — contradicting the documented contract: *"Parsed imports must validate into the existing models"* and defeating the purpose of a standalone `Validate()` gate.
### Issue 2 of 3
backend/internal/voiceimport/import_test.go:111-128
**Determinism check is trivially true and proves nothing**
The test serializes the same `got` value twice and asserts the two byte strings match. Since `got` is the same struct instance both times and Go's `json.Marshal` is deterministic for the same in-memory value, this comparison will always pass regardless of whether `PromoteToChallengeCase()` is actually deterministic across independent calls. A meaningful test would call `PromoteToChallengeCase()` a second time on an equivalent freshly-decoded fixture and compare the two serialized outputs — that would catch any ordering-instability introduced by, for example, an unsorted `map[string]any` or a non-stable sort.
### Issue 3 of 3
backend/internal/voiceimport/import.go:211-237
**`PromoteToChallengeCase()` triggers `f.Validate()` three times**
`PromoteToChallengeCase()` calls `f.Validate()` once at line 212, then delegates to `f.ToTrace()` (which calls `f.Validate()` again at line 185) and `f.ToArtifactManifest()` (which calls `f.Validate()` a third time at line 200). Each `Validate()` also re-runs `validateArtifactManifest()`. For the happy-path the work is wasted; for large manifests or transcript slices this could be measurable. Internal helper variants (e.g., `toTraceUnchecked`) or inlining the post-validation logic would avoid the redundancy.
Reviews (1): Last reviewed commit: "feat(voice): add call import redaction c..." | Re-trigger Greptile
| for idx, event := range f.ProviderEvents { | ||
| if err := event.Validate(); err != nil { | ||
| return fmt.Errorf("provider_events[%d]: %w", idx, err) | ||
| } | ||
| } | ||
| if f.Redaction == nil { | ||
| return ErrMissingRedaction | ||
| } | ||
| if err := f.Redaction.Validate(); err != nil { | ||
| return err | ||
| } | ||
| if strings.TrimSpace(f.ExpectedOutcome) == "" { | ||
| return errors.New("expected_outcome is required") | ||
| } | ||
| if strings.TrimSpace(f.FailureCategory) == "" { | ||
| return errors.New("failure_category is required") | ||
| } |
There was a problem hiding this comment.
Segment ID uniqueness not checked in
Fixture.Validate()
Fixture.Validate() iterates over transcript entries and validates each in isolation, but never checks that TranscriptEntry.SegmentID values are unique across the slice, that AudioReference.SegmentID values are unique, or that an audio segment ID doesn't collide with a transcript segment ID. Both sets of IDs feed directly into trace.Segments in ToTrace(), so a fixture with duplicate segment IDs will pass Fixture.Validate() and then fail inside ToTrace() when trace.Validate() rejects the duplicate — contradicting the documented contract: "Parsed imports must validate into the existing models" and defeating the purpose of a standalone Validate() gate.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/internal/voiceimport/import.go
Line: 164-180
Comment:
**Segment ID uniqueness not checked in `Fixture.Validate()`**
`Fixture.Validate()` iterates over transcript entries and validates each in isolation, but never checks that `TranscriptEntry.SegmentID` values are unique across the slice, that `AudioReference.SegmentID` values are unique, or that an audio segment ID doesn't collide with a transcript segment ID. Both sets of IDs feed directly into `trace.Segments` in `ToTrace()`, so a fixture with duplicate segment IDs will pass `Fixture.Validate()` and then fail inside `ToTrace()` when `trace.Validate()` rejects the duplicate — contradicting the documented contract: *"Parsed imports must validate into the existing models"* and defeating the purpose of a standalone `Validate()` gate.
How can I resolve this? If you propose a fix, please make it concise.| if originalByKey[key].ChecksumSHA256 != importedByKey[key].ChecksumSHA256 { | ||
| t.Fatalf("%s checksum = %q, want %q", key, importedByKey[key].ChecksumSHA256, originalByKey[key].ChecksumSHA256) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| func assertCaseShape(t *testing.T, got challengepack.CaseDefinition) { | ||
| t.Helper() | ||
| if got.ChallengeKey != "voice-support-regression" || got.CaseKey != "prod-call-001" || got.ItemKey != "prod-call-001" { | ||
| t.Fatalf("case identity = %s/%s/%s, want voice-support-regression/prod-call-001/prod-call-001", got.ChallengeKey, got.CaseKey, got.ItemKey) | ||
| } | ||
| if got.Payload["source_import_id"] != "import-prod-call-001" || | ||
| got.Payload["source_provider"] != "acme-contact-center" || | ||
| got.Payload["source_call_id"] != "acme-call-001" || | ||
| got.Payload["redaction_status"] != string(RedactionStatusApprovedForRegression) || | ||
| got.Payload["failure_category"] != "billing_refund_policy" || | ||
| got.Payload["promotion_pack_slug"] != "voice-support-regressions" || | ||
| got.Payload["promotion_input_set"] != "approved-production-calls" { |
There was a problem hiding this comment.
Determinism check is trivially true and proves nothing
The test serializes the same got value twice and asserts the two byte strings match. Since got is the same struct instance both times and Go's json.Marshal is deterministic for the same in-memory value, this comparison will always pass regardless of whether PromoteToChallengeCase() is actually deterministic across independent calls. A meaningful test would call PromoteToChallengeCase() a second time on an equivalent freshly-decoded fixture and compare the two serialized outputs — that would catch any ordering-instability introduced by, for example, an unsorted map[string]any or a non-stable sort.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/internal/voiceimport/import_test.go
Line: 111-128
Comment:
**Determinism check is trivially true and proves nothing**
The test serializes the same `got` value twice and asserts the two byte strings match. Since `got` is the same struct instance both times and Go's `json.Marshal` is deterministic for the same in-memory value, this comparison will always pass regardless of whether `PromoteToChallengeCase()` is actually deterministic across independent calls. A meaningful test would call `PromoteToChallengeCase()` a second time on an equivalent freshly-decoded fixture and compare the two serialized outputs — that would catch any ordering-instability introduced by, for example, an unsorted `map[string]any` or a non-stable sort.
How can I resolve this? If you propose a fix, please make it concise.| Kind: audioKindForSpeaker(entry.Speaker), | ||
| Actor: actorForSpeaker(entry.Speaker), | ||
| OccurredAt: entry.OccurredAt.UTC(), | ||
| Audio: &multimodaltrace.AudioPayload{ | ||
| ArtifactRef: strings.TrimSpace(entry.Audio.ArtifactRef), | ||
| Format: strings.TrimSpace(entry.Audio.Format), | ||
| Channel: strings.TrimSpace(entry.Audio.Channel), | ||
| DurationMS: entry.Audio.DurationMS, | ||
| }, | ||
| }) | ||
| } | ||
| sequence++ | ||
| kind := multimodaltrace.SegmentKindTranscriptPartial | ||
| if entry.Final { | ||
| kind = multimodaltrace.SegmentKindTranscriptFinal | ||
| } | ||
| trace.Segments = append(trace.Segments, multimodaltrace.Segment{ | ||
| SegmentID: strings.TrimSpace(entry.SegmentID), | ||
| SequenceNumber: sequence, | ||
| Kind: kind, | ||
| Actor: actorForSpeaker(entry.Speaker), | ||
| OccurredAt: entry.OccurredAt.UTC(), | ||
| Transcript: &multimodaltrace.TranscriptPayload{ | ||
| Text: entry.Text, | ||
| Language: strings.TrimSpace(entry.Language), | ||
| Confidence: entry.Confidence, | ||
| SourceSegmentID: sourceSegmentID, |
There was a problem hiding this comment.
PromoteToChallengeCase() triggers f.Validate() three times
PromoteToChallengeCase() calls f.Validate() once at line 212, then delegates to f.ToTrace() (which calls f.Validate() again at line 185) and f.ToArtifactManifest() (which calls f.Validate() a third time at line 200). Each Validate() also re-runs validateArtifactManifest(). For the happy-path the work is wasted; for large manifests or transcript slices this could be measurable. Internal helper variants (e.g., toTraceUnchecked) or inlining the post-validation logic would avoid the redundancy.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/internal/voiceimport/import.go
Line: 211-237
Comment:
**`PromoteToChallengeCase()` triggers `f.Validate()` three times**
`PromoteToChallengeCase()` calls `f.Validate()` once at line 212, then delegates to `f.ToTrace()` (which calls `f.Validate()` again at line 185) and `f.ToArtifactManifest()` (which calls `f.Validate()` a third time at line 200). Each `Validate()` also re-runs `validateArtifactManifest()`. For the happy-path the work is wasted; for large manifests or transcript slices this could be measurable. Internal helper variants (e.g., `toTraceUnchecked`) or inlining the post-validation logic would avoid the redundancy.
How can I resolve this? If you propose a fix, please make it concise.|
Verdict: approve No blocking issues found. Step review:
Final test contract review:
Commands run:
Review JSON: {
"overall_verdict": "approve",
"blocking_issues": [],
"steps": [
{
"step_number": 1,
"title": "Add deterministic production-call import and redaction contract",
"review_result": {
"status": "pass",
"issues_found": [],
"notes": "Verified implementation against issue #769 and PR contract; Greptile fixes are present; focused and full backend tests passed."
},
"cumulative_review": {
"previous_steps_still_valid": true,
"integration_issues": [],
"notes": "No drift found across multimodaltrace, voiceartifacts, or challengepack integration."
}
},
{
"step_number": "final",
"title": "Final review against test contract",
"test_contract_review": {
"functional_behavior": "pass",
"unit_tests": "pass",
"integration_tests": "N/A",
"smoke_tests": "pass",
"e2e_tests": "N/A",
"manual_tests": "N/A"
},
"overall_verdict": "approve",
"blocking_issues": []
}
]
} |
Summary
backend/internal/voiceimportwith a strict JSON contract for externally captured production voice calls.multimodaltrace.Traceandvoiceartifacts.Manifestmodels while preserving refs/checksums.approved_for_regressionredaction status and emits deterministicchallengepack.CaseDefinitioncases.Closes #769.
Parent: #754.
Tests
go test ./internal/voiceimportgo test ./internal/voiceimport ./internal/multimodaltrace ./internal/voiceartifacts ./internal/challengepackgo test ./...Test Contract
codex/voice-call-import-redaction — Test Contract
Functional Behavior
multimodaltrace.Traceandvoiceartifacts.Manifestmodels without mutating artifact references or checksums.unreviewed,redacted,approved_for_regression, orrejected.approved_for_regressionimports with explicit redaction metadata.Unit Tests
TestImportValidRedactedCallFixtureparses a valid redacted fixture and validates the derived trace and artifact manifest.TestPromotionRejectsUnreviewedCallasserts unreviewed imports cannot produce regression cases.TestImportRejectsMissingRedactionMetadataasserts missing PII/redaction metadata fails validation.TestApprovedFixtureProducesDeterministicChallengeCaseasserts approved imports produce the exact deterministic challenge case shape.TestImportPreservesOriginalArtifactReferencesAndChecksumsasserts imported artifact refs and SHA-256 checksums are preserved.Integration / Functional Tests
Smoke Tests
go test ./internal/voiceimportfrombackend/.go test ./internal/voiceimport ./internal/multimodaltrace ./internal/voiceartifacts ./internal/challengepackfrombackend/.go test ./...frombackend/if focused tests pass.E2E Tests
Manual / cURL Tests
Review Checkpoint JSON
{ "branch": "codex/voice-call-import-redaction", "test_contract": "/tmp/codex-voice-call-import-redaction-test-contract.md", "created_at": "2026-05-13T17:31:00Z", "steps": [ { "step_number": 1, "title": "Add deterministic production-call import and redaction contract", "timestamp": "2026-05-13T17:41:00Z", "files_changed": [ "backend/internal/voiceimport/import.go", "backend/internal/voiceimport/import_test.go" ], "what_changed": "Added the voiceimport package with a strict JSON fixture contract for externally captured calls, including transcript entries, audio artifact refs, provider event fragments, voice artifact manifests, explicit redaction metadata/status, reviewer labels, expected outcome, failure category, and promotion targets. Added conversion to multimodaltrace.Trace, manifest preservation, and approved-only promotion into deterministic challengepack.CaseDefinition values.", "review_instructions": "Verify the import format is provider-neutral, has no provider/network/LLM dependency, validates redaction metadata explicitly, rejects non-approved promotion, preserves original artifact refs/checksums, and produces deterministic trace/manifest/challenge case outputs.", "review_result": { "status": "pass", "issues_found": [], "notes": "Self-review checked strict decode, redaction status validation, trace reference ordering, manifest preservation, approved-only promotion, provider event metadata, and the exact tests. Focused, neighboring, and full backend tests passed." }, "cumulative_review": { "previous_steps_still_valid": true, "integration_issues": [], "notes": "This package builds on existing multimodaltrace, voiceartifacts, and challengepack contracts without changing their behavior." } }, { "step_number": "final", "title": "Final review against test contract", "timestamp": "2026-05-13T17:42:00Z", "test_contract_review": { "functional_behavior": "pass - the import contract includes transcript, audio artifact refs, provider events, redaction metadata/status, reviewer labels, expected outcome, failure category, and promotion target; it validates into trace and artifact models and only approved imports can promote.", "unit_tests": "pass - tests cover valid redacted import, unreviewed promotion rejection, missing redaction rejection, deterministic approved challenge case, and artifact ref/checksum preservation.", "integration_tests": "N/A - internal contract package only.", "smoke_tests": "pass - go test ./internal/voiceimport, go test ./internal/voiceimport ./internal/multimodaltrace ./internal/voiceartifacts ./internal/challengepack, and go test ./... from backend passed.", "e2e_tests": "N/A - no production provider import in this slice.", "manual_tests": "N/A - deterministic unit tests cover the contract." }, "overall_verdict": "ready", "blocking_issues": [] } ] }