test(#752,#792): unique table names + require.* in pkg/repl#804
Merged
Conversation
…/repl block#752: setupTestTables now derives unique table names from t.Name() so one test cannot inherit a stale schema from another in the package. This is the structural fix for the column-index-3 row-image-length-2 flake on TestBufferedMapFlushUnderLockBypassesWatermark. block#792: migrate setup-path and dereference-after-check assertions from assert.* to require.* so transient failures fail fast at the source rather than cascading into confusing follow-on errors. Part of block#779. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nd table-driven loops keep assert Per block#779 follow-up: assert.* is now allowed ONLY inside goroutine bodies (where require.FailNow is unsafe) and inside table-driven test loops (where seeing all case failures together helps). Everything else uses require.*. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aparajon
approved these changes
May 2, 2026
…kg-repl # Conflicts: # pkg/repl/subscription_buffered_test.go # pkg/repl/subscription_map_test.go
morgo
added a commit
to morgo/spirit
that referenced
this pull request
May 3, 2026
…t prior workarounds Composite chunker had the same IsNil-returns-TRUE bug in two places: - The "no chunkPtrs and no checkpoint" branch (mirror of optimistic chunker). - The "no chunkPtrs but above checkpointHighPtr" branch (resume path). Both flip to returning FALSE for the same reason: the doc comment on KeyAboveHighWatermark says ambiguity should return FALSE, and these two branches were exactly the ambiguous "we haven't dispatched yet" state. Tests updated to match: chunker_composite_test.go's two assertions that encoded the buggy behavior (`KeyAboveHighWatermark` returns TRUE before Open() / before OpenAtWatermark()) now assert FALSE. Also reverts two prior workarounds that the issue-block#746 root cause makes unnecessary: - cutover_test.go: replaces `require.Eventually(5s, 250ms, ...)` around the _old/_new checksum compare with a strict single-shot `require.Equal`. The original wrapper was added under the theory that "Docker MySQL has variable I/O latency that delays final event delivery" (commits 9ab22fe, f0f44fb), but a 1-row deficit cannot be masked by 5 s of polling once cutover has completed. With the IsNil fix the deficit doesn't occur in the first place, so the polling is pure noise. - cutover.go: removes the second back-to-back `FlushUnderTableLock` call introduced by 3001db2. Each call already does flush → BlockWait → flush internally, so the flush's own binlog events (REPLACE INTO _new / DELETE FROM _new) are read back in the same call's BlockWait. The doubled call was a speculative "rare case" hedge for the same divergence the IsNil bug actually caused. A single call is sufficient and matches BUG.md's protocol. All four changes verified: - go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/ — 10/10 PASS. - Full pkg/migration suite — PASS (~4 min). - Full pkg/table, pkg/repl, pkg/copier suites — PASS. Refs block#746, block#801, block#797, block#795, block#804. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morgo
added a commit
to morgo/spirit
that referenced
this pull request
May 3, 2026
… state `chunkerOptimistic.KeyAboveHighWatermark` returned TRUE when both `chunkPtr` and `checkpointHighPtr` were Nil, with the comment "every key is above because we haven't started copying." This caused the deltaMap and bufferedMap HasChanged paths to silently drop binlog events for every key during the window between `runner.go`'s `SetWatermarkOptimization(true)` (which enables the filter, just before copier startup) and the first `chunker.Next()` call (which sets `chunkPtr` non-nil). The intent was: "the chunker's later SELECT will pick the row up from the source table." That holds only for rows committed before the SELECT's snapshot. A writer that commits AFTER the snapshot timestamp but BEFORE `chunkPtr` is updated is in a dead zone — invisible to the SELECT (too late) AND silently dropped by the applier (KeyAboveHighWatermark returns TRUE). The row is lost. This is the root cause of issue block#746. PR block#764 disabled FixDifferences in TestCutoverAtomicityWithConcurrentWrites so copy-phase loss surfaces at the checksum step instead of being masked by repair. After that, the failure surfaced deterministically across block#801, block#797, block#795, block#804 with a consistent signature: a contiguous run of 3-5 PKs missing from chunk `[1, 1001)` (`pk=62,63,64,65`; `pk=102-105`; etc.) — exactly the shape of "several writer transactions committed during the dead zone, autoinc funnels them through in PK order, all silently dropped." Fix: return FALSE in the IsNil branch, per this method's own contract above ("if there is any ambiguity, it's important to return FALSE"). The change is buffered into the delta map. When the chunker advances and the relevant chunk's `Feedback` opens up `KeyBelowLowWatermark`, the next flush applies it via `REPLACE INTO _new ... SELECT FROM original WHERE id IN (...)`. Idempotent with anything the chunker's own SELECT picked up. Also updates two tests that asserted the buggy behavior: - `TestReplClientComplex` previously expected `GetDeltaLen() == 0` after a DELETE issued before the chunker's first `Next()`. Now the deltas are buffered and drained correctly when feedback advances the watermark; the post-final-flush count assertion is unchanged. - `TestOptimisticChunkerBasic` previously expected `KeyAboveHighWatermark(1) == true` immediately after `chunker.Open()` (before the first `Next()`). Now that returns FALSE for the same reason. Verified locally: - `go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/` — 10/10 PASS. - Full `pkg/migration` suite — PASS (~3 min). - Full `pkg/repl`, `pkg/copier`, `pkg/table` suites — PASS. Refs block#746, block#801, block#797, block#795, block#804. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morgo
added a commit
to morgo/spirit
that referenced
this pull request
May 3, 2026
…t prior workarounds Composite chunker had the same IsNil-returns-TRUE bug in two places: - The "no chunkPtrs and no checkpoint" branch (mirror of optimistic chunker). - The "no chunkPtrs but above checkpointHighPtr" branch (resume path). Both flip to returning FALSE for the same reason: the doc comment on KeyAboveHighWatermark says ambiguity should return FALSE, and these two branches were exactly the ambiguous "we haven't dispatched yet" state. Tests updated to match: chunker_composite_test.go's two assertions that encoded the buggy behavior (`KeyAboveHighWatermark` returns TRUE before Open() / before OpenAtWatermark()) now assert FALSE. Also reverts two prior workarounds that the issue-block#746 root cause makes unnecessary: - cutover_test.go: replaces `require.Eventually(5s, 250ms, ...)` around the _old/_new checksum compare with a strict single-shot `require.Equal`. The original wrapper was added under the theory that "Docker MySQL has variable I/O latency that delays final event delivery" (commits 9ab22fe, f0f44fb), but a 1-row deficit cannot be masked by 5 s of polling once cutover has completed. With the IsNil fix the deficit doesn't occur in the first place, so the polling is pure noise. - cutover.go: removes the second back-to-back `FlushUnderTableLock` call introduced by 3001db2. Each call already does flush → BlockWait → flush internally, so the flush's own binlog events (REPLACE INTO _new / DELETE FROM _new) are read back in the same call's BlockWait. The doubled call was a speculative "rare case" hedge for the same divergence the IsNil bug actually caused. A single call is sufficient and matches BUG.md's protocol. All four changes verified: - go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/ — 10/10 PASS. - Full pkg/migration suite — PASS (~4 min). - Full pkg/table, pkg/repl, pkg/copier suites — PASS. Refs block#746, block#801, block#797, block#795, block#804. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morgo
added a commit
to morgo/spirit
that referenced
this pull request
May 3, 2026
… state `chunkerOptimistic.KeyAboveHighWatermark` returned TRUE when both `chunkPtr` and `checkpointHighPtr` were Nil, with the comment "every key is above because we haven't started copying." This caused the deltaMap and bufferedMap HasChanged paths to silently drop binlog events for every key during the window between `runner.go`'s `SetWatermarkOptimization(true)` (which enables the filter, just before copier startup) and the first `chunker.Next()` call (which sets `chunkPtr` non-nil). The intent was: "the chunker's later SELECT will pick the row up from the source table." That holds only for rows committed before the SELECT's snapshot. A writer that commits AFTER the snapshot timestamp but BEFORE `chunkPtr` is updated is in a dead zone — invisible to the SELECT (too late) AND silently dropped by the applier (KeyAboveHighWatermark returns TRUE). The row is lost. This is the root cause of issue block#746. PR block#764 disabled FixDifferences in TestCutoverAtomicityWithConcurrentWrites so copy-phase loss surfaces at the checksum step instead of being masked by repair. After that, the failure surfaced deterministically across block#801, block#797, block#795, block#804 with a consistent signature: a contiguous run of 3-5 PKs missing from chunk `[1, 1001)` (`pk=62,63,64,65`; `pk=102-105`; etc.) — exactly the shape of "several writer transactions committed during the dead zone, autoinc funnels them through in PK order, all silently dropped." Fix: return FALSE in the IsNil branch, per this method's own contract above ("if there is any ambiguity, it's important to return FALSE"). The change is buffered into the delta map. When the chunker advances and the relevant chunk's `Feedback` opens up `KeyBelowLowWatermark`, the next flush applies it via `REPLACE INTO _new ... SELECT FROM original WHERE id IN (...)`. Idempotent with anything the chunker's own SELECT picked up. Also updates two tests that asserted the buggy behavior: - `TestReplClientComplex` previously expected `GetDeltaLen() == 0` after a DELETE issued before the chunker's first `Next()`. Now the deltas are buffered and drained correctly when feedback advances the watermark; the post-final-flush count assertion is unchanged. - `TestOptimisticChunkerBasic` previously expected `KeyAboveHighWatermark(1) == true` immediately after `chunker.Open()` (before the first `Next()`). Now that returns FALSE for the same reason. Verified locally: - `go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/` — 10/10 PASS. - Full `pkg/migration` suite — PASS (~3 min). - Full `pkg/repl`, `pkg/copier`, `pkg/table` suites — PASS. Refs block#746, block#801, block#797, block#795, block#804. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morgo
added a commit
to morgo/spirit
that referenced
this pull request
May 3, 2026
…t prior workarounds Composite chunker had the same IsNil-returns-TRUE bug in two places: - The "no chunkPtrs and no checkpoint" branch (mirror of optimistic chunker). - The "no chunkPtrs but above checkpointHighPtr" branch (resume path). Both flip to returning FALSE for the same reason: the doc comment on KeyAboveHighWatermark says ambiguity should return FALSE, and these two branches were exactly the ambiguous "we haven't dispatched yet" state. Tests updated to match: chunker_composite_test.go's two assertions that encoded the buggy behavior (`KeyAboveHighWatermark` returns TRUE before Open() / before OpenAtWatermark()) now assert FALSE. Also reverts two prior workarounds that the issue-block#746 root cause makes unnecessary: - cutover_test.go: replaces `require.Eventually(5s, 250ms, ...)` around the _old/_new checksum compare with a strict single-shot `require.Equal`. The original wrapper was added under the theory that "Docker MySQL has variable I/O latency that delays final event delivery" (commits 9ab22fe, f0f44fb), but a 1-row deficit cannot be masked by 5 s of polling once cutover has completed. With the IsNil fix the deficit doesn't occur in the first place, so the polling is pure noise. - cutover.go: removes the second back-to-back `FlushUnderTableLock` call introduced by 3001db2. Each call already does flush → BlockWait → flush internally, so the flush's own binlog events (REPLACE INTO _new / DELETE FROM _new) are read back in the same call's BlockWait. The doubled call was a speculative "rare case" hedge for the same divergence the IsNil bug actually caused. A single call is sufficient and matches BUG.md's protocol. All four changes verified: - go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/ — 10/10 PASS. - Full pkg/migration suite — PASS (~4 min). - Full pkg/table, pkg/repl, pkg/copier suites — PASS. Refs block#746, block#801, block#797, block#795, block#804. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morgo
added a commit
to morgo/spirit
that referenced
this pull request
May 3, 2026
…fered+buffered The cutover atomicity test (TestCutoverAtomicityWithConcurrentWrites) has proven its value as a probe for issue block#746 — it surfaced the KeyAboveHighWatermark IsNil bug across multiple unrelated PRs (block#801, block#797, block#795, block#804) once FixDifferences was disabled for it. Until now it only exercised the optimistic chunker in unbuffered mode. Both the composite chunker and the buffered copier path have the same shape and the same dead-zone window between SetWatermarkOptimization(true) and the chunker's first Next(), so they should be probed too. Refactor the test body into a helper runCutoverAtomicityTest(t, tableName, schema, buffered) and wrap it in four subtests: - optimistic_unbuffered (table t1concurrent_oub) - optimistic_buffered (table t1concurrent_obu) - composite_unbuffered (table t1concurrent_cub) - composite_buffered (table t1concurrent_cbu) The composite variants use a (id, x_token) PK so NewChunker selects the composite chunker (optimistic requires single-column auto_increment). Buffered variants opt in via WithBuffered(true); they Skip on the minimal-RBR test runner since buffered requires binlog_row_image=FULL. The four subtests run in parallel under t.Parallel() with unique table names so they don't collide on _<name>_old / _<name>_new artifacts. The failure-diagnostic block (max(id), missing PKs, diverged rows) now parameterizes the table name so each subtest's logs name themselves correctly. migrationConcurrentWriteThread / doOneMigrationWriteLoop now take a tableName parameter and build their SQL with fmt.Sprintf instead of hard-coding `t1concurrent`. Verified locally: all 4 variants pass standalone and together (go test -count=5 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/), full pkg/migration suite passes (~3 min). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #752. Closes #792. Part of #779.
Summary
setupTestTablesnow derives unique table names fromt.Name(), so a previous test's schema can't bleed into the next via the sharedsubscription_test/_subscription_test_newnames. Eliminates the "column index 3 exceeds row image length 2" failure mode by construction.assert.*→require.*inpkg/repl/*_test.gowhere appropriate (setup, dereference-after-check). Loop assertions kept asassert.*where useful.Test plan
🤖 Generated with Claude Code