[Umbrella] CDC Connector Design #3175

ashulin · 2022-10-24T12:23:50Z

Code of Conduct

I agree to follow this project's Code of Conduct

Search before asking

I had searched in the issues and found no similar issues.

Backgroud

Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real-time to a downstream process or system.

CDC is mainly divided into two ways: query-based and Binlog-based.
We know that MySQL has binlog (binary log) to record the user's changes to the database, so it is logical that one of the simplest and most efficient CDC implementations can be done using binlog. Of course, there are already many open source MySQL CDC implementations that work out of the box. Using binlog is not the only way to implement CDC (at least for MySQL), even database triggers can perform similar functions, but they may be dwarfed in terms of efficiency and impact on the database.

Typically, after a CDC captures changes to a database, it will publish the change events to a message queue for consumers to consume, such as Debezium, which persists MySQL (and also supports PostgreSQL, Mongo, etc.) changes to Kafka, and by subscribing to the events in Kafka, we can get the content of the changes and implement the functionality we need.

And as data synchronization, I think we need to support CDC as a feature, and I want to hear from you all how you think it can be implemented in SeaTunnel.

Motivation

Support parallel reading of historical data (Fast synchronization, billions of large table)
Support reading incremental data (CDC)
Support heartbeat detection (metrics, small traffic table)
Support for dynamically adding new tables (Easier to operate and maintain)
Support multi-table and sharding (Easy Configuration)

This design does not include

Support Schema evolution(DDL)

Overall Design

Basic flow

The CDC base process contains:

Snapshot phase: Read the history data of the table
- Minimum split granularity: the primary key range data of a table
Incremental phase: Read the incremental log change data for the table
- Minimum split granularity: a table

Snapshot phase

The enumerator generates multiple SnapshotSplits of a table and assigns them to the reader

//  pseudo-code. 
public class SnapshotSplit implements SourceSplit {
    private final String splitId;
    private final TableId tableId;
    private final SeaTunnelRowType splitKeyType;
    private final Object splitStart;
    private final Object splitEnd;
}

When a SnapshotSplit reading is completed, the reader reports the high watermark of the split to the enumerator,
When all SnapshotSplits report high watermark, the enumerator enters the incremental phase.

//  pseudo-code. 
public class CompletedSnapshotSplitReportEvent implements SourceEvent {
    private final String splitId;
    private final Offset highWatermark;
}

Snapshot phase - SnapshotSplit read flow

There are 4 steps:

log low watermark: get current log offset before reading snapshot data.
read SnapshotSplit data: Read the range data belonging to the split
- case 1: step 1 & 2 cannot be atomized (MySQL)
Because we can't add table locks, and we can't add interval locks based on the low watermark, steps 1 & 2 are not atomic.
- Exactly-once: use memory table to hold history data & filter the log data from the low to high watermark
- At-least-once: direct output data & use low watermark instead of high watermark
- case 2: step 1 & 2 can be atomized (Oracle)
We can use for scn to ensure the atomicity of the two steps
- Exactly-once: direct output data & use low watermark instead of high watermark
log high watermark:
- step 2 case 1 & Exactly-once: get current log offset after reading snapshot data.
- other: use low watermark instead of high watermark
if high > low watermark, read range log data

Snapshot phase - MySQL Snapshot Read & Exactly-once

Because we can't determine where the query statement is executed between the high and low water levels, in order to ensure the exact-once of the data, we need to use the memory table to temporarily save the data.

log low watermark: get current log offset before reading snapshot data.
read SnapshotSplit data: read the range data belonging to the split and write to the memory table.
log high watermark: get current log offset after reading snapshot data.
read range log data: read log data and write to memory table
output the data of the memory table and release memory usage.

Incremental phase

When all snapshot splits report the water level, start the incremental phase.

Combine all snapshot splits and water level information to get LogSplits

We want to minimize the number of log connections:

In the incremental phase, only one reader works by default, and the user can also configure the option to specify the number (cannot exceed the number of readers)
A reader gets at most one connection

//  pseudo-code. 
public class LogSplit implements SourceSplit {
    private final String splitId;
    /**
     * All the tables that this log split needs to capture.
     */
    private final List<TableId> tableIds;
    /**
     * Minimum watermark for SnapshotSplits for all tables in this LogSplit
     */
    private final Offset startingOffset;
    /**
     * Obtained by configuration, may not end
     */
    private final Offset endingOffset;
    /**
     * SnapshotSplit information for all tables in this LogSplit.
     * </br> Used to support Exactly-Once.
     */
    private final List<CompletedSnapshotSplitInfo> completedSnapshotSplitInfos;
    /**
     * Maximum watermark in SnapshotSplits per table.
     * </br> Used to delete information in completedSnapshotSplitInfos, reducing state size.
     * </br> Used to support Exactly-Once.
     */
    private final Map<TableId, Offset> tableWatermarks;
}

//  pseudo-code. 
public class CompletedSnapshotSplitInfo implements Serializable {
    private final String splitId;
    private final TableId tableId;
    private final SeaTunnelRowType splitKeyType;
    private final Object splitStart;
    private final Object splitEnd;
    private final Offset watermark;
}

Exactly-Once:

phase 1: Use completedSnapshotSplitInfos filter before the watermark data.
phase 2: A table no longer needs to be filtered, delete the data belonging to the table in completedSnapshotSplitInfos, because the following data needs to be processed.

At-Least-Once: Not filter data, and completedSnapshotSplitInfos doesn't need any data.

Dynamic discovery of new tables

Case 1: When a new table is discovered, the enumerator is in the snapshot phase and directly assigns a new split.

Case 2: When a new table is discovered, the enumerator is in the increment phase.

Dynamic discovery of new tables in the increment phase.

Advantage: Add new table without restarting job.
Disadvantage: may cause existing table data to be delayed. (How to minimize the impact?)

Suspend the LogSplit reader. (Do we need to suspend the reader now if there is an idle reader?)
Reader performs suspend operation.
Reader report current log offset. (If it is not reported, the reader needs to support the combination of LogSplit)
Assign SnapshotSplit to reader.
Reader execution snapshot phase read.
Reader report all SnapshotSplit watermark.
Assign a new LogSplit to the reader.
The reader starts incremental reading again and ACK to the enumerator.

Multiple structured tables

Advantages: take up fewer database connections, reduce database pressure
Disadvantage: In the SeaTunnel engine, multiple tables will be in a pipeline, and the granularity of fault tolerance will become larger.

This feature expects that the source can support the reading of multiple structure tables, and then use the side stream output to be consistent with the single table stream.

Also since this will involve changes to the DAG and translation modules, I also expect support for defining partitioners (Hash and forward).

Some features have been implemented, can see #2490

Task list

Are you willing to submit PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

hailin0 · 2022-10-24T12:37:57Z

+1
great

yuangjiang · 2022-10-25T09:29:52Z

+1
great

laglangyue · 2022-10-27T15:06:50Z

+1 nice

baisui1981 · 2022-10-31T01:23:04Z

+1
great

github-actions · 2022-12-25T00:25:01Z

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions · 2023-01-03T00:23:01Z

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

calipeng · 2023-06-02T03:21:22Z

I want to ask a question:
Cdc synchronizes data through a data log, so what is the data log of clickhouse?

ashulin changed the title ~~[Umbrella] CDC Connector Design~~ [WIP][Umbrella] CDC Connector Design Oct 24, 2022

ashulin added connectors-v2 feature New feature labels Oct 24, 2022

ashulin mentioned this issue Oct 24, 2022

[OnlineMeeting&Oct.25]SeaTunnel community meeting Topic collect #3172

Closed

ashulin changed the title ~~[WIP][Umbrella] CDC Connector Design~~ [Umbrella] CDC Connector Design Oct 25, 2022

github-actions bot added the stale label Dec 25, 2022

github-actions bot closed this as completed Jan 3, 2023

Hisoka-X added the design label Mar 6, 2023

hailin0 mentioned this issue Apr 13, 2023

[Feature][CDC] Support Exactly-Once for incremental phase stage 1 #4564

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Umbrella] CDC Connector Design #3175

[Umbrella] CDC Connector Design #3175

ashulin commented Oct 24, 2022 •

edited

Loading

hailin0 commented Oct 24, 2022

yuangjiang commented Oct 25, 2022

laglangyue commented Oct 27, 2022

baisui1981 commented Oct 31, 2022

github-actions bot commented Dec 25, 2022

github-actions bot commented Jan 3, 2023

calipeng commented Jun 2, 2023

[Umbrella] CDC Connector Design #3175

[Umbrella] CDC Connector Design #3175

Comments

ashulin commented Oct 24, 2022 • edited Loading

Code of Conduct

Search before asking

Backgroud

Motivation

Overall Design

Basic flow

Snapshot phase

Snapshot phase - SnapshotSplit read flow

Snapshot phase - MySQL Snapshot Read & Exactly-once

Incremental phase

Dynamic discovery of new tables

Dynamic discovery of new tables in the increment phase.

Multiple structured tables

Task list

Are you willing to submit PR?

hailin0 commented Oct 24, 2022

yuangjiang commented Oct 25, 2022

laglangyue commented Oct 27, 2022

baisui1981 commented Oct 31, 2022

github-actions bot commented Dec 25, 2022

github-actions bot commented Jan 3, 2023

calipeng commented Jun 2, 2023

ashulin commented Oct 24, 2022 •

edited

Loading