-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Umbrella] CDC Connector Design #3175
Comments
+1 |
+1 |
+1 nice |
+1 |
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs. |
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future. |
I want to ask a question: |
Code of Conduct
Search before asking
Backgroud
Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real-time to a downstream process or system.
CDC is mainly divided into two ways: query-based and Binlog-based.
We know that MySQL has binlog (binary log) to record the user's changes to the database, so it is logical that one of the simplest and most efficient CDC implementations can be done using binlog. Of course, there are already many open source MySQL CDC implementations that work out of the box. Using binlog is not the only way to implement CDC (at least for MySQL), even database triggers can perform similar functions, but they may be dwarfed in terms of efficiency and impact on the database.
Typically, after a CDC captures changes to a database, it will publish the change events to a message queue for consumers to consume, such as Debezium, which persists MySQL (and also supports PostgreSQL, Mongo, etc.) changes to Kafka, and by subscribing to the events in Kafka, we can get the content of the changes and implement the functionality we need.
And as data synchronization, I think we need to support CDC as a feature, and I want to hear from you all how you think it can be implemented in SeaTunnel.
Motivation
This design does not include
Overall Design
Basic flow
The CDC base process contains:
Snapshot phase
The enumerator generates multiple
SnapshotSplit
s of a table and assigns them to the readerWhen a
SnapshotSplit
reading is completed, the reader reports the high watermark of the split to the enumerator,When all
SnapshotSplit
s report high watermark, the enumerator enters the incremental phase.Snapshot phase - SnapshotSplit read flow
There are 4 steps:
log low watermark: get current log offset before reading snapshot data.
read SnapshotSplit data: Read the range data belonging to the split
Exactly-once: use memory table to hold history data & filter the log data from the low to high watermark
At-least-once: direct output data & use low watermark instead of high watermark
case 2: step 1 & 2 can be atomized (Oracle)
log high watermark:
step 2 case 1 & Exactly-once
: get current log offset after reading snapshot data.other
: use low watermark instead of high watermarkif high > low watermark, read range log data
Snapshot phase - MySQL Snapshot Read & Exactly-once
Because we can't determine where the query statement is executed between the high and low water levels, in order to ensure the exact-once of the data, we need to use the memory table to temporarily save the data.
Incremental phase
When all snapshot splits report the water level, start the incremental phase.
Combine all snapshot splits and water level information to get
LogSplit
sWe want to minimize the number of log connections:
Exactly-Once:
At-Least-Once: Not filter data, and completedSnapshotSplitInfos doesn't need any data.
Dynamic discovery of new tables
Case 1: When a new table is discovered, the enumerator is in the snapshot phase and directly assigns a new split.
Case 2: When a new table is discovered, the enumerator is in the increment phase.
Dynamic discovery of new tables in the increment phase.
Advantage: Add new table without restarting job.
Disadvantage: may cause existing table data to be delayed. (How to minimize the impact?)
Multiple structured tables
Advantages: take up fewer database connections, reduce database pressure
Disadvantage: In the
SeaTunnel engine
, multiple tables will be in a pipeline, and the granularity of fault tolerance will become larger.This feature expects that the source can support the reading of multiple structure tables, and then use the side stream output to be consistent with the single table stream.
Also since this will involve changes to the DAG and translation modules, I also expect support for defining partitioners (Hash and forward).
Some features have been implemented, can see #2490
Task list
Are you willing to submit PR?
The text was updated successfully, but these errors were encountered: