feat: merge-DML batching optimization for binlog apply#1687
Open
dnovitski wants to merge 1 commit into
Open
Conversation
2 tasks
6d0ead2 to
fe8e3bb
Compare
Add --is-merge-dml-event flag that batches and deduplicates binlog DML events before applying them to the ghost table, significantly reducing SQL round-trips during high-write migrations. When enabled and the unique key is memory-comparable (numeric columns): - Deduplicates DML events by unique key (latest event wins) - Reduces INSERT+DELETE sequences to DELETE (safe against row-copy races) - Batches INSERTs/UPDATEs as multi-row REPLACE INTO - Batches DELETEs as DELETE WHERE (pk) IN (...) - Skips events beyond migration range (not yet copied by row-copy) - Disables merge for tables with secondary unique indexes Safety: strict numeric type validation in formatNumericValue prevents SQL injection. Type detection uses exact base-type parsing (not substring). Uses BuildColumnsPreparedValues for proper per-column conversion tokens. Original implementation by shaohoukun in PR github#1378, adapted to current master's builder-pattern API with correctness and security hardening. Co-authored-by: shaohk <shaohoukun@meituan.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related issue: #1378
script/cibuildreturns with no formatting errors, build errors or unit test errors.Summary
Port of #1378 (by @shaohk) rebased onto current master with API adaptation, correctness hardening, and comprehensive tests.
What it does
When
--is-merge-dml-eventis enabled, gh-ost batches binlog DML events instead of applying them one-by-one:--dml-batch-sizeevents, then emits at most 3 SQL statements (one batchedREPLACEfor inserts/updates, one batchedDELETE, one for key-changing updates)Changes from original PR
NewDMLDeleteQueryBuilderetc.)formatNumericValuetype-switch (rejects non-numeric types, prevents SQL injection via interpolated DELETE values)strings.Contains("int")matching "point" → exact base-type switchdmlEvent.DMLto avoid mutationBenchmark Results
Setup: MySQL 8.0 (Docker), 100K-row table, 8 concurrent writer threads generating ~2900 events/s (60% UPDATE, 25% INSERT, 15% DELETE),
--dml-batch-size=100.Wall-clock is identical because this workload is row-copy bound (backlog stays at 0–1/1000). The optimization delivers:
The wall-clock improvement manifests on large tables (millions of rows, hours-long migrations) where DML apply becomes the bottleneck and backlog grows. The query reduction directly translates to less lock contention, less binlog volume, and lower replication lag on the target.
Go-level microbenchmark (query count per 1000 events):
Usage
Auto-disabled (with log warning) when:
Credit
Based on the work by @shaohk in #1378. Co-authored-by trailer preserved in commit.