Skip to content

check: require binlog_order_commits=ON#819

Merged
morgo merged 5 commits into
block:mainfrom
morgo:require-binlog-order-commits
May 4, 2026
Merged

check: require binlog_order_commits=ON#819
morgo merged 5 commits into
block:mainfrom
morgo:require-binlog-order-commits

Conversation

@morgo
Copy link
Copy Markdown
Collaborator

@morgo morgo commented May 3, 2026

Closes #818.

Summary

Two changes:

  1. Production check (the original purpose of this PR). Both pkg/migration/check/configuration.go and pkg/move/check/configuration.go now query @@global.binlog_order_commits alongside the other binlog-related variables they already validate, and reject startup with a clear error if it isn't 1.

  2. Test cleanup (added in the second commit, broader than originally scoped). Removed all SET GLOBAL tests from both check packages — the binlog_row_image=NOBLOB test, the new binlog_order_commits=OFF tests this PR initially added, and the two TestPrivilegesWithRDSSuperuserRole variants that flipped activate_all_roles_on_login. See "Why both" below.

Why the production check (item 1)

With binlog_order_commits=OFF, MySQL is allowed to commit transactions to InnoDB in a different order than they appear in the binary log: a transaction T's row events can be written to the binlog before T's commit becomes visible to fresh SELECTs on other connections.

Spirit's binlog applier path assumes the opposite. When the streamer delivers a row event to HasChanged, the row is expected to be visible to deltaMap/bufferedMap.Flush's REPLACE INTO _new ... SELECT FROM original WHERE pk IN (...) running on a separate autocommit connection. With binlog_order_commits=OFF that assumption breaks and rows can be silently lost during cutover.

This is exactly the failure shape on the still-open second issue-#746 bug observed in CI on PRs #813 and #814 (Flush stmt num_keys=3 affected_rows=2, where one PK is in the delta map but invisible to the SELECT). binlog_order_commits=ON is the MySQL default; this check is mostly defensive, but it eliminates one whole hypothesis branch from the #746 investigation and protects future users on customised configurations from a silent data-loss mode.

Why the test cleanup (item 2)

Initial review of the new tests pointed out that SET GLOBAL flips a server-wide variable, and any other Go test binary running configurationCheck (or any check that reads the same global) during the flipped window sees the wrong value and fails. t.Parallel() only governs within-binary scheduling — Go test runs distinct package binaries in parallel via -p (default GOMAXPROCS), and pkg/migration/check, pkg/move/check, pkg/migration, etc. all share the same test MySQL.

That race wasn't introduced by this PR — the existing binlog_row_image=NOBLOB test in configuration_test.go and the two TestPrivilegesWithRDSSuperuserRole variants that flip activate_all_roles_on_login had it too. They're a plausible source of cross-package flakes elsewhere in the suite.

Rather than adding more SET GLOBAL tests to a pattern we already wanted to clean up, this PR drops them all. Each removed test leaves a short comment explaining the rationale and pointing at the path back: refactor the affected check to take its variable values via an injectable struct, then test it as a pure unit without touching the server. That refactor is out of scope for this PR; tracked for follow-up.

The negative branches (wrong binlog_row_image, OFF binlog_order_commits, wrong activate_all_roles_on_login) are now exercised only at startup against a real misconfigured server, but the trade-off is a lot of cross-package flake risk gone.

Will this fix #746's second bug?

binlog_order_commits=ON is the MySQL default and the docker-compose images don't override it, so unless the CI MySQL is somehow running with it OFF, this check won't fire on the failing lanes. Worth a quick SELECT @@global.binlog_order_commits on the docker-compose lane to verify — if it's OFF, this PR is the actual fix; if it's ON, the second bug is something else and this PR still hardens spirit against future users running with it OFF.

Changes

  • pkg/migration/check/configuration.go — adds binlog_order_commits to the existing combined SELECT and a hard-fail check after the other variable validations.
  • pkg/move/check/configuration.go — same, with the per-source error wrapping that file uses.
  • pkg/migration/check/configuration_test.go, pkg/move/check/configuration_test.go, pkg/migration/check/privileges_test.go, pkg/move/check/privileges_test.go — drop all SET GLOBAL tests; happy-path coverage retained.

Test plan

  • go build ./... clean.
  • go vet ./... clean.
  • go test ./pkg/migration/check ./pkg/move/check — pass.

🤖 Generated with Claude Code

morgo and others added 3 commits May 3, 2026 09:59
Both pkg/migration/check/configuration.go and pkg/move/check/configuration.go
already validate the binlog-related server settings spirit relies on
(binlog_format, binlog_row_image, binlog_row_value_options, log_bin,
log_slave_updates), but neither checks binlog_order_commits.

With binlog_order_commits=OFF, MySQL is allowed to commit transactions to
InnoDB in a different order than they appear in the binary log: a
transaction T's row events can be written to the binlog *before* T's
commit becomes visible to fresh SELECTs on other connections. Spirit's
binlog applier path assumes the opposite — when the streamer delivers a
row event to HasChanged, that row is expected to be visible to
deltaMap/bufferedMap.Flush's `REPLACE INTO _new ... SELECT FROM original
WHERE pk IN (...)` (running on a separate autocommit connection).

If the assumption is violated, spirit silently drops rows during cutover.
This is exactly the failure shape on the still-open second issue-block#746
bug observed in CI on PRs block#813 and block#814:

  Flush stmt executed table=t1concurrent_oub
      num_keys=3 affected_rows=2
      stmt="REPLACE INTO _t1concurrent_oub_new (...)
            SELECT ... FROM t1concurrent_oub WHERE id IN (177,180,179)"

A key was in the delta map (so HasChanged saw the binlog event) but the
SELECT couldn't find it in `original` at the statement's snapshot —
binlog_order_commits=OFF is one of the few server configurations that
allows that.

binlog_order_commits=ON is the MySQL default, so this check is mostly
defensive — it eliminates one whole hypothesis branch from the issue
block#746 investigation, and protects future users on customised configs
from a silent data-loss mode regardless of how that investigation
shakes out.

Both check files now SELECT @@global.binlog_order_commits alongside the
existing variables and return a configuration error if it is not "1".
Tests in both packages flip the variable to OFF and assert the check
fails, then restore via SET GLOBAL = ON in a defer (boolean variables
don't accept the parameterised string form that binlog_row_image does).

Refs block#818, block#746.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These tests flipped server-wide globals (`binlog_row_image`,
`binlog_order_commits`, `activate_all_roles_on_login`) to exercise
configurationCheck/privilegesCheck rejection paths. Within a single Go
test binary, t.Parallel() can keep these tests sequential, but t.Parallel
only governs within-binary scheduling — Go test runs distinct package
binaries in parallel via -p (default GOMAXPROCS), and pkg/migration/check,
pkg/move/check, pkg/migration, etc. all share the same test MySQL.

Any other binary running `configurationCheck` or `privilegesCheck` (which
fires from every full migration's preflight) during a flipped window saw
the wrong value and failed with a misleading error. That's a real source
of cross-binary test flakes that has been present since these tests
landed; PR block#819's new binlog_order_commits=OFF tests would have just
added more instances of the same pattern.

Removed:
  - pkg/migration/check/configuration_test.go: the binlog_row_image=NOBLOB
    section + the binlog_order_commits=OFF section added by PR block#819's
    initial revision.
  - pkg/move/check/configuration_test.go: TestConfigurationCheckBinlogOrderCommits
    added by PR block#819's initial revision.
  - pkg/migration/check/privileges_test.go: TestPrivilegesWithRDSSuperuserRole.
  - pkg/move/check/privileges_test.go: TestMovePrivilegesWithRDSSuperuserRole.

The production-code change in PR block#819 — requiring `binlog_order_commits=ON`
in both configuration checks — is unchanged. Negative-path coverage for
all of these branches (wrong binlog_row_image, OFF binlog_order_commits,
wrong activate_all_roles_on_login) is now exercised only at startup
against a real misconfigured server, until configurationCheck and
privilegesCheck are refactored to be unit-testable without touching
server globals.

Each removed test left a comment block explaining the rationale and
flagging the unit-testability refactor as the path back to coverage.

Refs block#818, block#746.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…no SET GLOBAL)

The previous commit dropped both TestPrivilegesWithRDSSuperuserRole tests
because their assertions required `SET GLOBAL activate_all_roles_on_login`,
which races with concurrent Go test binaries. That removed more coverage
than necessary — the *acceptance* path can be tested without flipping the
global, by:

  1. Reading @@global.activate_all_roles_on_login at test start.
  2. t.Skip if not ON.
  3. Otherwise, set up the RDS-style role + user and verify
     privilegesCheck passes (the role-tolerance path).

The rejection path (activate_all_roles_on_login=OFF still rejects the
role) remains uncovered until privilegesCheck is unit-testable without
touching server globals.

Both restored tests:
  - migration/check/privileges_test.go::TestPrivilegesWithRDSSuperuserRole
  - move/check/privileges_test.go::TestMovePrivilegesWithRDSSuperuserRole

No behaviour change for callers; only the negative-path assertion is
gone.

Refs block#818.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@morgo morgo merged commit 3f2931a into block:main May 4, 2026
12 checks passed
@morgo morgo deleted the require-binlog-order-commits branch May 4, 2026 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

configuration check: require binlog_order_commits=ON in pkg/migration/check and pkg/move/check flaky test: TestCutoverAtomicityWithConcurrentWrites

2 participants