Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220

Merged
merged 2 commits into from
Jun 20, 2023

Conversation

wallkop
Copy link
Contributor

@wallkop wallkop commented Jun 17, 2023

Resolve #1944

Problem analysis

Issue #1944 reports a bug where recovery from checkpoint fails when starting from offset-binlogfile&pos. After conducting further tests, I found that this bug can occur in Earliest, Timestamp, BinlogFile&Pos startup modes, as well as when incomplete Gtids are incorrectly set. The root cause of this bug is that when MySQL has GTID enabled and the aforementioned startup modes are used, the complete GTIDs cannot be recorded in the offsetContext. Instead, an empty GtidSet is initialized and subsequently appended by received Gtid Events. This leads to recovery failure when restoring from checkpoint due to incomplete GTIDs being recorded.

Test premise: MySQL needs to have GTID enabled, and due to settings such as master-slave switching and multiple masters, there are multiple sourceIDs for GTIDs

startupMode How to initialize GTID Restore from a checkpoint
initial Obtain the GTID value through SHOW MASTER STATUS before starting the snapshot
latest Obtain the GTID value through SHOW MASTER STATUS before starting to consume
offset-gtid Use the user-set GTID before starting to consume.Note: The user needs to set the complete GTID, otherwise it will not pass the startup verification
offset-binlogfile&pos Set the consumption anchor of the binlog through binlog_file='xxx', offset=yyy before starting to consume, but the GTID is null after initialization X
earliest Set the consumption anchor of the binlog through binlog_file='', offset=0 before starting to consume, but the GTID is null after initialization X
timestamp Set the consumption anchor of the binlog through binlog_file='', offset=0 before starting to consume, but the GTID is null after initialization. After starting the consumption, all binlog events with timestamps less than the specified timestamp will be skipped until an event with a timestamp greater than or equal to timestamp is found, and the GTID will be recorded at that point X

Solution approach

PoC From @PatrickRen: PatrickRen@2189da3

The solution comes from the POC of @PatrickRen. I have carefully reviewed this POC and conducted many tests. In the end, I believe this solution is viable.

The POC implementation is very elegant, and I only supplemented some comments and added a few test cases. The modified code has been validated using the company's production environment, and it has proven to perform very well under the classic master-slave architecture, correctly recovering from various start-mode checkpoints.

However, there are still two potential risks:

  1. When MySQL is under a dual-master architecture, GTIDs may have gaps, similar to A:1-102, 105-150. Such gaps are temporary but will eventually be consistent. But if the CDC happens to recover from the checkpoint when there are gaps in MySQL's GTID, it may access non-existent transactions, leading to recovery failure. This is because our GtidUtils' fixRestoredGtidSet method does not fix the gaps in the server GTID. I initially wanted to optimize this issue, but after weighing it, I believe that in most cases, the occurrence of GTID gaps is a problem with MySQL itself, which should be fixed by the DBA in MySQL, not to be compatible with CDC, to avoid triggering other unpredictable issues.

  2. The current implementation will correct all GTIDs, including those set by the user during the initial startup of CDC via StartupOptions. This may cause a problem: the user manually sets the GTID-offset to A:300-500, the original intention might be to expect CDC to consume data A:1-299:501~xxx, but CDC corrects it to A:1-500, only starting consumption from 500. Solving this problem is not difficult, the key is whether it is necessary to solve it. I finally convinced myself to form a unified specification and inform the users: CDC only cares about the maximum GTID position and starts from it. This standard will make our program more user-friendly, but to some extent, it will violate MySQL's replication protocol.

It's worth mentioning that I'm just raising the above two risks for discussion, and I don't suggest immediately solving and fixing them, as the current implementation is already very well done.

Copy link
Contributor

@PatrickRen PatrickRen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wallkop Thanks for the patch! LGTM.

Could you rebase the latest master to resolve the conflict, and squash all commits into one? Also it looks like the author and email of your commits don't match your profile on GitHub, and you may want to modify them before pushing.

@wallkop
Copy link
Contributor Author

wallkop commented Jun 19, 2023

hi @PatrickRen It looks like the master branch of debezium was upgraded to 1.9.7 a few hours ago. I may need to do some retesting after resolving code conflicts to make sure this change works fine with debezium 1.9.7.

I will solve the all problem you describe later.

@wallkop wallkop force-pushed the fix_1944 branch 6 times, most recently from 0445b8d to b4ce0dc Compare June 19, 2023 12:10
…p mode

Author:    wallkop <120839632@qq.com>
Date:      Sat Jun 17 22:36:54 2023 +0800
Copy link
Contributor

@PatrickRen PatrickRen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! LGTM.

@PatrickRen PatrickRen merged commit debd6ef into apache:master Jun 20, 2023
13 of 15 checks passed
yuxiqian added a commit to yuxiqian/flink-cdc that referenced this pull request Jun 30, 2023
…-2062

* origin/feat/issue-2062:
  [docs] Update connector link to 2.5-SNAPSHOT in docs
  [build] Bump version to 2.5-SNAPSHOT
  [hotfix][mysql] remove unused code (apache#2231)
  [hotfix] Add vitess connector to the release profile (apache#2233)
  [docs][hotfix] Update debezium reference links to 1.9 version
  [build] Update the copyright year to 2023 (apache#2205)
  [postgres] Fix postgres e2e test
  [postgres] scan.incremental.snapshot.enabled is closed by default
  [postgres] Backfill task will be able to end when there is not new change data but read the ending lsn
  [postgres] Create slot for backfill task before snapshot reading
  [postgres] Prepare a slot for the unique global stream split
  [mysql] Fix GTID issues to recover from checkpoint normally in specifying startup mode (apache#2220)
@SML0127
Copy link
Contributor

SML0127 commented Aug 5, 2023

@wallkop
Thx to fix this issue! (I had a similar issue)

I have one question in your description.
what is complete GTID?
as i understand, it means below one of them

1. GTIDs like instance:111-2222 (not start from 1)
2. { kind: "SPECIFIC", binlogfile: "binlog.000123", pos: "3834747", gtids: "instance:123-45667"}, { kind: "NON_STOPPING", binlogfile: "", pos: "-982387918273", gtids: ""} (in checkpoint)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants