Skip to content

test(#752,#792): unique table names + require.* in pkg/repl#804

Merged
morgo merged 8 commits into
block:mainfrom
morgo:fix-issue-752-792-pkg-repl
May 3, 2026
Merged

test(#752,#792): unique table names + require.* in pkg/repl#804
morgo merged 8 commits into
block:mainfrom
morgo:fix-issue-752-792-pkg-repl

Conversation

@morgo
Copy link
Copy Markdown
Collaborator

@morgo morgo commented May 2, 2026

Closes #752. Closes #792. Part of #779.

Summary

Test plan

🤖 Generated with Claude Code

morgo and others added 3 commits May 1, 2026 20:35
…/repl

block#752: setupTestTables now derives unique table names from t.Name() so
one test cannot inherit a stale schema from another in the package.
This is the structural fix for the column-index-3 row-image-length-2
flake on TestBufferedMapFlushUnderLockBypassesWatermark.

block#792: migrate setup-path and dereference-after-check assertions from
assert.* to require.* so transient failures fail fast at the source
rather than cascading into confusing follow-on errors. Part of block#779.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nd table-driven loops keep assert

Per block#779 follow-up: assert.* is now allowed ONLY inside goroutine bodies
(where require.FailNow is unsafe) and inside table-driven test loops
(where seeing all case failures together helps). Everything else uses
require.*.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@morgo morgo marked this pull request as ready for review May 2, 2026 11:39
…kg-repl

# Conflicts:
#	pkg/repl/subscription_buffered_test.go
#	pkg/repl/subscription_map_test.go
@morgo morgo enabled auto-merge May 3, 2026 00:28
morgo added a commit to morgo/spirit that referenced this pull request May 3, 2026
…t prior workarounds

Composite chunker had the same IsNil-returns-TRUE bug in two places:
  - The "no chunkPtrs and no checkpoint" branch (mirror of optimistic chunker).
  - The "no chunkPtrs but above checkpointHighPtr" branch (resume path).

Both flip to returning FALSE for the same reason: the doc comment on
KeyAboveHighWatermark says ambiguity should return FALSE, and these two
branches were exactly the ambiguous "we haven't dispatched yet" state.

Tests updated to match: chunker_composite_test.go's two assertions that
encoded the buggy behavior (`KeyAboveHighWatermark` returns TRUE before
Open() / before OpenAtWatermark()) now assert FALSE.

Also reverts two prior workarounds that the issue-block#746 root cause makes
unnecessary:

  - cutover_test.go: replaces `require.Eventually(5s, 250ms, ...)` around
    the _old/_new checksum compare with a strict single-shot
    `require.Equal`. The original wrapper was added under the theory that
    "Docker MySQL has variable I/O latency that delays final event
    delivery" (commits 9ab22fe, f0f44fb), but a 1-row deficit cannot be
    masked by 5 s of polling once cutover has completed. With the IsNil
    fix the deficit doesn't occur in the first place, so the polling is
    pure noise.

  - cutover.go: removes the second back-to-back `FlushUnderTableLock`
    call introduced by 3001db2. Each call already does
    flush → BlockWait → flush internally, so the flush's own
    binlog events (REPLACE INTO _new / DELETE FROM _new) are read back
    in the same call's BlockWait. The doubled call was a speculative
    "rare case" hedge for the same divergence the IsNil bug actually
    caused. A single call is sufficient and matches BUG.md's protocol.

All four changes verified:
  - go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites
    ./pkg/migration/ — 10/10 PASS.
  - Full pkg/migration suite — PASS (~4 min).
  - Full pkg/table, pkg/repl, pkg/copier suites — PASS.

Refs block#746, block#801, block#797, block#795, block#804.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morgo added a commit to morgo/spirit that referenced this pull request May 3, 2026
… state

`chunkerOptimistic.KeyAboveHighWatermark` returned TRUE when both `chunkPtr`
and `checkpointHighPtr` were Nil, with the comment "every key is above
because we haven't started copying." This caused the deltaMap and bufferedMap
HasChanged paths to silently drop binlog events for every key during the
window between `runner.go`'s `SetWatermarkOptimization(true)` (which enables
the filter, just before copier startup) and the first `chunker.Next()` call
(which sets `chunkPtr` non-nil).

The intent was: "the chunker's later SELECT will pick the row up from the
source table." That holds only for rows committed before the SELECT's
snapshot. A writer that commits AFTER the snapshot timestamp but BEFORE
`chunkPtr` is updated is in a dead zone — invisible to the SELECT (too late)
AND silently dropped by the applier (KeyAboveHighWatermark returns TRUE).
The row is lost.

This is the root cause of issue block#746. PR block#764 disabled FixDifferences in
TestCutoverAtomicityWithConcurrentWrites so copy-phase loss surfaces at the
checksum step instead of being masked by repair. After that, the failure
surfaced deterministically across block#801, block#797, block#795, block#804 with a consistent
signature: a contiguous run of 3-5 PKs missing from chunk `[1, 1001)`
(`pk=62,63,64,65`; `pk=102-105`; etc.) — exactly the shape of "several writer
transactions committed during the dead zone, autoinc funnels them through in
PK order, all silently dropped."

Fix: return FALSE in the IsNil branch, per this method's own contract above
("if there is any ambiguity, it's important to return FALSE"). The change is
buffered into the delta map. When the chunker advances and the relevant
chunk's `Feedback` opens up `KeyBelowLowWatermark`, the next flush applies it
via `REPLACE INTO _new ... SELECT FROM original WHERE id IN (...)`.
Idempotent with anything the chunker's own SELECT picked up.

Also updates two tests that asserted the buggy behavior:

  - `TestReplClientComplex` previously expected `GetDeltaLen() == 0` after
    a DELETE issued before the chunker's first `Next()`. Now the deltas
    are buffered and drained correctly when feedback advances the
    watermark; the post-final-flush count assertion is unchanged.

  - `TestOptimisticChunkerBasic` previously expected
    `KeyAboveHighWatermark(1) == true` immediately after `chunker.Open()`
    (before the first `Next()`). Now that returns FALSE for the same reason.

Verified locally:
  - `go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites
    ./pkg/migration/` — 10/10 PASS.
  - Full `pkg/migration` suite — PASS (~3 min).
  - Full `pkg/repl`, `pkg/copier`, `pkg/table` suites — PASS.

Refs block#746, block#801, block#797, block#795, block#804.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morgo added a commit to morgo/spirit that referenced this pull request May 3, 2026
…t prior workarounds

Composite chunker had the same IsNil-returns-TRUE bug in two places:
  - The "no chunkPtrs and no checkpoint" branch (mirror of optimistic chunker).
  - The "no chunkPtrs but above checkpointHighPtr" branch (resume path).

Both flip to returning FALSE for the same reason: the doc comment on
KeyAboveHighWatermark says ambiguity should return FALSE, and these two
branches were exactly the ambiguous "we haven't dispatched yet" state.

Tests updated to match: chunker_composite_test.go's two assertions that
encoded the buggy behavior (`KeyAboveHighWatermark` returns TRUE before
Open() / before OpenAtWatermark()) now assert FALSE.

Also reverts two prior workarounds that the issue-block#746 root cause makes
unnecessary:

  - cutover_test.go: replaces `require.Eventually(5s, 250ms, ...)` around
    the _old/_new checksum compare with a strict single-shot
    `require.Equal`. The original wrapper was added under the theory that
    "Docker MySQL has variable I/O latency that delays final event
    delivery" (commits 9ab22fe, f0f44fb), but a 1-row deficit cannot be
    masked by 5 s of polling once cutover has completed. With the IsNil
    fix the deficit doesn't occur in the first place, so the polling is
    pure noise.

  - cutover.go: removes the second back-to-back `FlushUnderTableLock`
    call introduced by 3001db2. Each call already does
    flush → BlockWait → flush internally, so the flush's own
    binlog events (REPLACE INTO _new / DELETE FROM _new) are read back
    in the same call's BlockWait. The doubled call was a speculative
    "rare case" hedge for the same divergence the IsNil bug actually
    caused. A single call is sufficient and matches BUG.md's protocol.

All four changes verified:
  - go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites
    ./pkg/migration/ — 10/10 PASS.
  - Full pkg/migration suite — PASS (~4 min).
  - Full pkg/table, pkg/repl, pkg/copier suites — PASS.

Refs block#746, block#801, block#797, block#795, block#804.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@morgo morgo merged commit b7eaeb3 into block:main May 3, 2026
12 checks passed
@morgo morgo deleted the fix-issue-752-792-pkg-repl branch May 3, 2026 02:20
morgo added a commit to morgo/spirit that referenced this pull request May 3, 2026
… state

`chunkerOptimistic.KeyAboveHighWatermark` returned TRUE when both `chunkPtr`
and `checkpointHighPtr` were Nil, with the comment "every key is above
because we haven't started copying." This caused the deltaMap and bufferedMap
HasChanged paths to silently drop binlog events for every key during the
window between `runner.go`'s `SetWatermarkOptimization(true)` (which enables
the filter, just before copier startup) and the first `chunker.Next()` call
(which sets `chunkPtr` non-nil).

The intent was: "the chunker's later SELECT will pick the row up from the
source table." That holds only for rows committed before the SELECT's
snapshot. A writer that commits AFTER the snapshot timestamp but BEFORE
`chunkPtr` is updated is in a dead zone — invisible to the SELECT (too late)
AND silently dropped by the applier (KeyAboveHighWatermark returns TRUE).
The row is lost.

This is the root cause of issue block#746. PR block#764 disabled FixDifferences in
TestCutoverAtomicityWithConcurrentWrites so copy-phase loss surfaces at the
checksum step instead of being masked by repair. After that, the failure
surfaced deterministically across block#801, block#797, block#795, block#804 with a consistent
signature: a contiguous run of 3-5 PKs missing from chunk `[1, 1001)`
(`pk=62,63,64,65`; `pk=102-105`; etc.) — exactly the shape of "several writer
transactions committed during the dead zone, autoinc funnels them through in
PK order, all silently dropped."

Fix: return FALSE in the IsNil branch, per this method's own contract above
("if there is any ambiguity, it's important to return FALSE"). The change is
buffered into the delta map. When the chunker advances and the relevant
chunk's `Feedback` opens up `KeyBelowLowWatermark`, the next flush applies it
via `REPLACE INTO _new ... SELECT FROM original WHERE id IN (...)`.
Idempotent with anything the chunker's own SELECT picked up.

Also updates two tests that asserted the buggy behavior:

  - `TestReplClientComplex` previously expected `GetDeltaLen() == 0` after
    a DELETE issued before the chunker's first `Next()`. Now the deltas
    are buffered and drained correctly when feedback advances the
    watermark; the post-final-flush count assertion is unchanged.

  - `TestOptimisticChunkerBasic` previously expected
    `KeyAboveHighWatermark(1) == true` immediately after `chunker.Open()`
    (before the first `Next()`). Now that returns FALSE for the same reason.

Verified locally:
  - `go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites
    ./pkg/migration/` — 10/10 PASS.
  - Full `pkg/migration` suite — PASS (~3 min).
  - Full `pkg/repl`, `pkg/copier`, `pkg/table` suites — PASS.

Refs block#746, block#801, block#797, block#795, block#804.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morgo added a commit to morgo/spirit that referenced this pull request May 3, 2026
…t prior workarounds

Composite chunker had the same IsNil-returns-TRUE bug in two places:
  - The "no chunkPtrs and no checkpoint" branch (mirror of optimistic chunker).
  - The "no chunkPtrs but above checkpointHighPtr" branch (resume path).

Both flip to returning FALSE for the same reason: the doc comment on
KeyAboveHighWatermark says ambiguity should return FALSE, and these two
branches were exactly the ambiguous "we haven't dispatched yet" state.

Tests updated to match: chunker_composite_test.go's two assertions that
encoded the buggy behavior (`KeyAboveHighWatermark` returns TRUE before
Open() / before OpenAtWatermark()) now assert FALSE.

Also reverts two prior workarounds that the issue-block#746 root cause makes
unnecessary:

  - cutover_test.go: replaces `require.Eventually(5s, 250ms, ...)` around
    the _old/_new checksum compare with a strict single-shot
    `require.Equal`. The original wrapper was added under the theory that
    "Docker MySQL has variable I/O latency that delays final event
    delivery" (commits 9ab22fe, f0f44fb), but a 1-row deficit cannot be
    masked by 5 s of polling once cutover has completed. With the IsNil
    fix the deficit doesn't occur in the first place, so the polling is
    pure noise.

  - cutover.go: removes the second back-to-back `FlushUnderTableLock`
    call introduced by 3001db2. Each call already does
    flush → BlockWait → flush internally, so the flush's own
    binlog events (REPLACE INTO _new / DELETE FROM _new) are read back
    in the same call's BlockWait. The doubled call was a speculative
    "rare case" hedge for the same divergence the IsNil bug actually
    caused. A single call is sufficient and matches BUG.md's protocol.

All four changes verified:
  - go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites
    ./pkg/migration/ — 10/10 PASS.
  - Full pkg/migration suite — PASS (~4 min).
  - Full pkg/table, pkg/repl, pkg/copier suites — PASS.

Refs block#746, block#801, block#797, block#795, block#804.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morgo added a commit to morgo/spirit that referenced this pull request May 3, 2026
…fered+buffered

The cutover atomicity test (TestCutoverAtomicityWithConcurrentWrites) has
proven its value as a probe for issue block#746 — it surfaced the
KeyAboveHighWatermark IsNil bug across multiple unrelated PRs (block#801,
block#797, block#795, block#804) once FixDifferences was disabled for it. Until now it
only exercised the optimistic chunker in unbuffered mode. Both the
composite chunker and the buffered copier path have the same shape and
the same dead-zone window between SetWatermarkOptimization(true) and
the chunker's first Next(), so they should be probed too.

Refactor the test body into a helper runCutoverAtomicityTest(t,
tableName, schema, buffered) and wrap it in four subtests:

  - optimistic_unbuffered  (table t1concurrent_oub)
  - optimistic_buffered    (table t1concurrent_obu)
  - composite_unbuffered   (table t1concurrent_cub)
  - composite_buffered     (table t1concurrent_cbu)

The composite variants use a (id, x_token) PK so NewChunker selects the
composite chunker (optimistic requires single-column auto_increment).
Buffered variants opt in via WithBuffered(true); they Skip on the
minimal-RBR test runner since buffered requires binlog_row_image=FULL.

The four subtests run in parallel under t.Parallel() with unique table
names so they don't collide on _<name>_old / _<name>_new artifacts. The
failure-diagnostic block (max(id), missing PKs, diverged rows) now
parameterizes the table name so each subtest's logs name themselves
correctly.

migrationConcurrentWriteThread / doOneMigrationWriteLoop now take a
tableName parameter and build their SQL with fmt.Sprintf instead of
hard-coding `t1concurrent`.

Verified locally: all 4 variants pass standalone and together
(go test -count=5 -run TestCutoverAtomicityWithConcurrentWrites
./pkg/migration/), full pkg/migration suite passes (~3 min).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants