add cmd/binlog-visibility-repro: stand-alone MySQL contract repro#820
Merged
Conversation
Probe whether MySQL honours the binlog_order_commits=ON contract under concurrent commit load: every row event the binlog streamer observes should already be visible to a fresh autocommit SELECT on a separate connection. Producers loop BEGIN/INSERT/UPDATE/COMMIT on a fresh table; a streamer goroutine reads WriteRowsEvents directly via go-mysql-org/go-mysql and, for each new PK, runs a SELECT on a distinct connection pool. Failures are logged with binlog_pos, retry-after-50ms outcome, and a producer cross-check. The package deliberately uses no spirit/pkg/repl machinery, so any failure attributes to MySQL (or to the test producer) rather than to spirit's wrappers. Picked up automatically by every existing CI matrix lane via go test ./...; tunable via SPIRIT_VISIBILITY_STRESS_DURATION, _PRODUCERS, and _VERBOSE env vars. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the 3-case event-type switch with an if so the exhaustive linter doesn't demand cases for every replication.EventType constant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Increase stress duration so the lower-load lanes (8.4 GA via the replication-tls compose, 8.0.45-with-replicas) get enough commits to surface the binlog_order_commits=ON visibility race that the 5s run already reproduced on 8.0.28 / 8.0.42 / 8.0.45-rbr-min / 9.7. This branch is not merging — it exists purely to gather repro data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strips the visibility-stress test down to a single main.go that uses only database/sql + go-mysql-org/go-mysql/replication. No spirit dependencies, no testing framework — suitable for attaching to an upstream MySQL bug report. Bisect on Docker MySQL 8.0.45 (no -race, ~16k commits/5s): - autocommit INSERT only: 5/5 PASS - BEGIN; INSERT; COMMIT: 5/5 PASS - BEGIN; INSERT; UPDATE; COMMIT: ~2/3 FAIL (kept as the minimum) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes that turn the artifact into something committable: 1. Drop pkg/replstress/visibility_stress_test.go. The test-form would fail in CI (which is the entire point of the probe), so it can't live in the test matrix. The standalone CLI is the durable form. 2. Switch the CLI from "run for a fixed duration, count misses" to "run until the first miss, re-check after 100ms, exit". This is how a reproducer wants to behave: it terminates as soon as it has evidence, with a clear exit code (1 = miss seen, 0 = clean within the budget, 2 = infrastructure issue). New flags: -max-duration (cap, default 5m) and -recheck-delay (default 100ms). The 100ms recheck classifies the miss as delayed (visible after the wait) or permanent (still not visible), which together capture the soft and hard forms of the same contract violation. Also clears the lint suite (errcheck / noctx / exitAfterDefer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new GitHub Actions workflow that runs cmd/binlog-visibility-repro
against every MySQL version in the existing test matrix:
- 8.0.28 (Aurora), 8.0.42 (Aurora LTS), 8.0.45 (default),
8.0.45 RBR-minimal, 8.4 GA, 9.7
Each lane brings up that lane's MySQL container and runs the repro
binary until either the first miss is detected (lane fails, exit 1) or
-max-duration elapses (lane passes, exit 0). The lane outcomes are
independent (fail-fast: false) so a single failure doesn't mask the
behaviour on the other versions.
Triggered via workflow_dispatch and on push to this branch only — never
on pull_request, since the workflow is intentionally flaky-by-design
(reproducing a contract violation is the whole point) and would
otherwise gate unrelated PRs.
Implementation: a `repro` service is added to compose/compose.yml and
compose/replication-tls/replication-ci.yml that builds the binary and
runs it against the local mysql service. The workflow just brings up
`mysql repro` per lane with `--exit-code-from repro` so the repro's
exit code propagates to the workflow step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Workflow name now "Binlog visibility stress /w docker-compose" so it reads alongside the existing "MySQL X.Y /w docker-compose" workflows. - Matrix lane names prefixed with "Binlog visibility stress /" so the six lanes sort together in the GH check list and identify themselves immediately. - Triggers match every other matrix workflow: push to main and pull_request to main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
We've already proven the contract violation reproduces on every MySQL version we test; pinning a CI workflow on it adds noise without new information. The reproducer lives as a stand-alone artefact in cmd/binlog-visibility-repro, runnable on demand against any MySQL via the docker-compose `repro` service. Anyone can run it locally with the commands documented in the PR description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aparajon
approved these changes
May 3, 2026
Two new docker-compose overlays for binlog-visibility-repro testing:
- 8.0.43.yml — mysql:8.0.43 (the OS contributor's local
version), Oracle Linux 9 base. Same userspace
layer as the default tag.
- 8.0.43-debian.yml — mysql:8.0.43-debian (linux/amd64 only). Same
MySQL binary version, different container
userspace. On Apple Silicon Docker emulates
via Rosetta — useful as both an "OS variation"
and "amd64 emulation" data point.
Both reproduce the binlog_order_commits=ON visibility race in seconds
on Apple Silicon. Combined with the existing matrix data, this rules
out version-specificity AND container-userspace-specificity as the
cause: the only constant across reproducing cases is "running on a
Linux kernel."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
cmd/binlog-visibility-repro, a stand-alone reproducer for what looks like an upstream MySQL contract violation:The tool is a single
main.gothat uses onlydatabase/sql+go-mysql-org/go-mysql/replication. No spirit components. That's the point — when this fails, it can't be attributed to anything inpkg/repl, the delta map, the chunker, the applier, or any other spirit wrapper. The failure attributes to MySQL itself.This is the smoking gun behind the second bug in #746: spirit's
deltaMap.Flushperiodically seesaffected_rows < num_keysfromREPLACE INTO _new ... SELECT FROM original WHERE id IN (...), meaning a key was in the delta map (so the binlog streamer had already delivered an event for it) but the row wasn't visible to a separate-connection SELECT moments later.How it works
BEGIN; INSERT; UPDATE; COMMIT;on the table.WriteRowsEventon the test table, immediately runsSELECT id FROM <table> WHERE id = ?on a separate connection.-recheck-delay(default 100 ms), re-runs the SELECT, and classifies the miss as either permanent (still not visible) or delayed (now visible). Both are violations of the contract; "delayed" is the soft form, "permanent" the hard form.1if a miss was seen,0if-max-durationelapses cleanly,2for an infrastructure problem.Bisect — what triggers the race
Local Docker MySQL 8.0.45 (no
-race), short runs (~16 k commits / 5 s):INSERT … VALUES (…)BEGIN; INSERT; COMMIT;(explicit txn, single stmt)BEGIN; INSERT; UPDATE; COMMIT;The minimum producer shape that surfaces the race is a multi-statement transaction that INSERTs and then UPDATEs the same row before COMMIT. Single-statement transactions, with or without explicit BEGIN/COMMIT, do not reproduce in the same time budget.
How to run it
A
reproservice has been added to compose/compose.yml and compose/replication-tls/replication-ci.yml. It builds the binary and runs it against the localmysqlservice. The exact samedocker composecommands the existing matrix uses, just targetingmysql reproinstead ofmysql test:--exit-code-from repropropagates the binary's exit code to the shell, so a failure shows up as exit 1.For a faster iteration loop (shorter max-duration, more producers, etc.), run the binary directly:
Flags:
-dsntsandbox:msandbox@tcp(127.0.0.1:8033)/-max-duration5m-recheck-delay100ms-producers8Findings so far
mysql:8.0.43(Oracle Linux 9 base) andmysql:8.0.43-debian(Debian base) both reproduce on the same host, with the same MySQL version. Same kernel, different userspace — same bug.Why this isn't in CI
The matrix workflow that ran this on every push is intentionally not included. The test is designed to fail under the bug we're hunting; gating PRs on that would be silly. The repro lives as a stand-alone artefact: anyone can run it on demand against any MySQL with the commands above.
Related
binlog_order_commits=ONpreflight check (this PR's hypothesis depends on that being honoured)🤖 Generated with Claude Code