Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBZ-55 Corrected filtering of DDL statements based upon affected database #49

Merged
merged 1 commit into from May 24, 2016

Conversation

rhauch
Copy link
Member

@rhauch rhauch commented May 23, 2016

Previously, the DDL statements were being filtered and recorded based upon the name of the database that appeared in the binlog. That database name, however, is actually the name of the database to which the client submitting the operation is connected, and is not necessarily the database affected by the operation (e.g., when an operation includes a fully-qualified table name not in the connected-to database).

With these changes, the table/database affected by the DDL statements is now being used to filter the recording of the statements. The order of the DDL statements in the binlog is still maintained, but since each DDL statement can apply to a separate database, the DDL statements are batched (in the same original order) based upon the affected database. For example, two statements affecting db1 will get batched together into one schema change record, followed by one statement affecting db2 as a second schema change record, followed by another statement affecting db1 as a third schema record. Of course, if db2 is excluded for some reason from the connector's configuration, then that second schema change record would not be written.

To determine the affected database for each DDL statement required changes to the DDL parsing framework. Although a listener mechanism was recently added, this PR adds a reusable listener implementation that accumulates 1 or more DDL statements and allows the caller (in this case the MySQL connector) to consume the sequences of statements and the database names to which they apply. Consecutive statements that apply to the same database are grouped/batched together. The MySQL connector uses this to process each QUERY event in the binlog, which may contain 1 or more DDL statements. The MySQL DDL parser was also enhanced to properly parse and handle CREATE DATABASE, ALTER DATABASE, and DROP DATABASE statements, since the parser needs to identify the affected database for these statements so they can be properly filtered.

Meanwhile, this change does not affect how the database history records the statements: it still records them exactly as submitted without regard to filtering and using a single record for each separate binlog QUERY event. IOW, the database history continues to record every DDL statement in the same order as found in the binlog, and all DDL statements found in a single binlog event are written atomically to the history stream. However, this commit does change the order that the database history and schema change records are written, so that the latter are now first and the database history is written second. Under nominal operation each is written exactly once, but the database history records are now written after any schema change record so that, upon recovery after failure, no schema change records are lost (and instead have at-least-once delivery guarantees).

…base

Previously, the DDL statements were being filtered and recorded based upon the name of the database that appeared in the binlog. However, that database name is actually the name of the database to which the client submitting the operation is connected, and is not necessarily the database _affected_ by the operation (e.g., when an operation includes a fully-qualified table name not in the connected-to database).

With these changes, the table/database affected by the DDL statements is now being used to filter the recording of the statements. The order of the DDL statements is still maintained, but since each DDL statement can apply to a separate database the DDL statements are batched (in the same original order) based upon the affected database. For example, two statements affecting "db1" will get batched together into one schema change record, followed by one statement affecting "db2" as a second schema change record, followed by another statement affecting "db1" as a third schema record.

Meanwhile, this change does not affect how the database history records the changes: it still records them as submitted using a single record for each separate binlog event/position. This is much safer as each binlog event (with specific position) is written atomically to the history stream. Also, since the database history stream is what the connector uses upon recovery, the database history records are now written _after_ any schema change records to ensure that, upon recovery after failure, no schema change records are lost (and instead have at-least-once delivery guarantees).
@rhauch rhauch merged commit 57e6c73 into debezium:master May 24, 2016
@rhauch rhauch deleted the dbz-55 branch May 24, 2016 00:46
mikekamornikov pushed a commit to mikekamornikov/debezium that referenced this pull request Apr 30, 2021
DBZ-3452: source.timestamp.mode=commit imposes a significant performance penalty
bdbene pushed a commit to bdbene/debezium that referenced this pull request Jun 23, 2023
* kafka connect 2.6 & dbz 1.3.1

* update jars
bdbene pushed a commit to bdbene/debezium that referenced this pull request Jun 23, 2023
* kafka connect 2.6 & dbz 1.3.1

* update jars
@xinbinhuang xinbinhuang mentioned this pull request Jun 27, 2023
methodmissing pushed a commit to methodmissing/debezium that referenced this pull request Apr 6, 2024
* kafka connect 2.6 & dbz 1.3.1

* update jars
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant