[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220

wallkop · 2023-06-17T15:00:16Z

Resolve #1944

Problem analysis

Issue #1944 reports a bug where recovery from checkpoint fails when starting from offset-binlogfile&pos. After conducting further tests, I found that this bug can occur in Earliest, Timestamp, BinlogFile&Pos startup modes, as well as when incomplete Gtids are incorrectly set. The root cause of this bug is that when MySQL has GTID enabled and the aforementioned startup modes are used, the complete GTIDs cannot be recorded in the offsetContext. Instead, an empty GtidSet is initialized and subsequently appended by received Gtid Events. This leads to recovery failure when restoring from checkpoint due to incomplete GTIDs being recorded.

Test premise: MySQL needs to have GTID enabled, and due to settings such as master-slave switching and multiple masters, there are multiple sourceIDs for GTIDs

startupMode	How to initialize GTID	Restore from a checkpoint
initial	Obtain the GTID value through SHOW MASTER STATUS before starting the snapshot	√
latest	Obtain the GTID value through SHOW MASTER STATUS before starting to consume	√
offset-gtid	Use the user-set GTID before starting to consume.Note: The user needs to set the complete GTID, otherwise it will not pass the startup verification	√
offset-binlogfile&pos	Set the consumption anchor of the binlog through binlog_file='xxx', offset=yyy before starting to consume, but the GTID is null after initialization	X
earliest	Set the consumption anchor of the binlog through binlog_file='', offset=0 before starting to consume, but the GTID is null after initialization	X
timestamp	Set the consumption anchor of the binlog through binlog_file='', offset=0 before starting to consume, but the GTID is null after initialization. After starting the consumption, all binlog events with timestamps less than the specified timestamp will be skipped until an event with a timestamp greater than or equal to timestamp is found, and the GTID will be recorded at that point	X

Solution approach

PoC From @PatrickRen: PatrickRen@2189da3

The solution comes from the POC of @PatrickRen. I have carefully reviewed this POC and conducted many tests. In the end, I believe this solution is viable.

The POC implementation is very elegant, and I only supplemented some comments and added a few test cases. The modified code has been validated using the company's production environment, and it has proven to perform very well under the classic master-slave architecture, correctly recovering from various start-mode checkpoints.

However, there are still two potential risks:

When MySQL is under a dual-master architecture, GTIDs may have gaps, similar to A:1-102, 105-150. Such gaps are temporary but will eventually be consistent. But if the CDC happens to recover from the checkpoint when there are gaps in MySQL's GTID, it may access non-existent transactions, leading to recovery failure. This is because our GtidUtils' fixRestoredGtidSet method does not fix the gaps in the server GTID. I initially wanted to optimize this issue, but after weighing it, I believe that in most cases, the occurrence of GTID gaps is a problem with MySQL itself, which should be fixed by the DBA in MySQL, not to be compatible with CDC, to avoid triggering other unpredictable issues.
The current implementation will correct all GTIDs, including those set by the user during the initial startup of CDC via StartupOptions. This may cause a problem: the user manually sets the GTID-offset to A:300-500, the original intention might be to expect CDC to consume data A:1-299:501~xxx, but CDC corrects it to A:1-500, only starting consumption from 500. Solving this problem is not difficult, the key is whether it is necessary to solve it. I finally convinced myself to form a unified specification and inform the users: CDC only cares about the maximum GTID position and starts from it. This standard will make our program more user-friendly, but to some extent, it will violate MySQL's replication protocol.

It's worth mentioning that I'm just raising the above two risks for discussion, and I don't suggest immediately solving and fixing them, as the current implementation is already very well done.

PatrickRen

@wallkop Thanks for the patch! LGTM.

Could you rebase the latest master to resolve the conflict, and squash all commits into one? Also it looks like the author and email of your commits don't match your profile on GitHub, and you may want to modify them before pushing.

wallkop · 2023-06-19T08:31:27Z

hi @PatrickRen It looks like the master branch of debezium was upgraded to 1.9.7 a few hours ago. I may need to do some retesting after resolving code conflicts to make sure this change works fine with debezium 1.9.7.

I will solve the all problem you describe later.

…p mode Author: wallkop <120839632@qq.com> Date: Sat Jun 17 22:36:54 2023 +0800

PatrickRen

Thanks for the patch! LGTM.

…-2062 * origin/feat/issue-2062: [docs] Update connector link to 2.5-SNAPSHOT in docs [build] Bump version to 2.5-SNAPSHOT [hotfix][mysql] remove unused code (apache#2231) [hotfix] Add vitess connector to the release profile (apache#2233) [docs][hotfix] Update debezium reference links to 1.9 version [build] Update the copyright year to 2023 (apache#2205) [postgres] Fix postgres e2e test [postgres] scan.incremental.snapshot.enabled is closed by default [postgres] Backfill task will be able to end when there is not new change data but read the ending lsn [postgres] Create slot for backfill task before snapshot reading [postgres] Prepare a slot for the unique global stream split [mysql] Fix GTID issues to recover from checkpoint normally in specifying startup mode (apache#2220)

SML0127 · 2023-08-05T07:55:34Z

@wallkop
Thx to fix this issue! (I had a similar issue)

I have one question in your description.
what is complete GTID?
as i understand, it means below one of them
~~1. GTIDs like instance:111-2222 (not start from 1)~~
~~2. { kind: "SPECIFIC", binlogfile: "binlog.000123", pos: "3834747", gtids: "instance:123-45667"}, { kind: "NON_STOPPING", binlogfile: "", pos: "-982387918273", gtids: ""} (in checkpoint)~~

This was referenced Jun 17, 2023

[Bug] Flink CDC 2.3.0 set startupOptions = specificOffset set specificOffsetFile and specificOffsetPos then can not start from checkpoint #1944

Closed

[mysql] Fix issue #1944: Initialize complete GTIDs to ensure subsequent recovery from checkpoint #2063

Closed

PatrickRen self-assigned this Jun 19, 2023

PatrickRen reviewed Jun 19, 2023

View reviewed changes

wallkop force-pushed the fix_1944 branch 6 times, most recently from 0445b8d to b4ce0dc Compare June 19, 2023 12:10

Fix issue#1944, recover from checkpoint normally in specifying startu…

8e46aaa

…p mode Author: wallkop <120839632@qq.com> Date: Sat Jun 17 22:36:54 2023 +0800

wallkop force-pushed the fix_1944 branch from 2fcdbd1 to 8e46aaa Compare June 19, 2023 12:31

Merge branch 'master' into fix_1944

718d844

PatrickRen approved these changes Jun 20, 2023

View reviewed changes

PatrickRen merged commit debd6ef into apache:master Jun 20, 2023
13 of 15 checks passed

wallkop mentioned this pull request Jun 21, 2023

[hotfix] remove unused code in PR#2220 #2231

Merged

This was referenced Jun 22, 2023

[Bug][MySql] Job restart failed from savepoint When set 'scan.startup.mode' = 'timestamp' #2230

Closed

[INLONG-8307][Sort] Fix job restart failed from savepoint When set 'scan.startup.mode' = 'timestamp|earliest-offset|specific-offset' apache/inlong#8308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220

[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220

wallkop commented Jun 17, 2023

PatrickRen left a comment

wallkop commented Jun 19, 2023

PatrickRen left a comment

SML0127 commented Aug 5, 2023 •

edited

Loading

[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220

[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220

Conversation

wallkop commented Jun 17, 2023

PatrickRen left a comment

Choose a reason for hiding this comment

wallkop commented Jun 19, 2023

PatrickRen left a comment

Choose a reason for hiding this comment

SML0127 commented Aug 5, 2023 • edited Loading

SML0127 commented Aug 5, 2023 •

edited

Loading