-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wallkop Thanks for the patch! LGTM.
Could you rebase the latest master to resolve the conflict, and squash all commits into one? Also it looks like the author and email of your commits don't match your profile on GitHub, and you may want to modify them before pushing.
hi @PatrickRen It looks like the master branch of debezium was upgraded to I will solve the all problem you describe later. |
0445b8d
to
b4ce0dc
Compare
…p mode Author: wallkop <120839632@qq.com> Date: Sat Jun 17 22:36:54 2023 +0800
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! LGTM.
…-2062 * origin/feat/issue-2062: [docs] Update connector link to 2.5-SNAPSHOT in docs [build] Bump version to 2.5-SNAPSHOT [hotfix][mysql] remove unused code (apache#2231) [hotfix] Add vitess connector to the release profile (apache#2233) [docs][hotfix] Update debezium reference links to 1.9 version [build] Update the copyright year to 2023 (apache#2205) [postgres] Fix postgres e2e test [postgres] scan.incremental.snapshot.enabled is closed by default [postgres] Backfill task will be able to end when there is not new change data but read the ending lsn [postgres] Create slot for backfill task before snapshot reading [postgres] Prepare a slot for the unique global stream split [mysql] Fix GTID issues to recover from checkpoint normally in specifying startup mode (apache#2220)
@wallkop
|
Resolve #1944
Issue #1944 reports a bug where recovery from checkpoint fails when starting from offset-binlogfile&pos. After conducting further tests, I found that this bug can occur in Earliest, Timestamp, BinlogFile&Pos startup modes, as well as when incomplete Gtids are incorrectly set. The root cause of this bug is that when MySQL has GTID enabled and the aforementioned startup modes are used, the complete GTIDs cannot be recorded in the offsetContext. Instead, an empty GtidSet is initialized and subsequently appended by received Gtid Events. This leads to recovery failure when restoring from checkpoint due to incomplete GTIDs being recorded.
Test premise: MySQL needs to have GTID enabled, and due to settings such as master-slave switching and multiple masters, there are multiple sourceIDs for GTIDs
PoC From @PatrickRen: PatrickRen@2189da3
The solution comes from the POC of @PatrickRen. I have carefully reviewed this POC and conducted many tests. In the end, I believe this solution is viable.
The POC implementation is very elegant, and I only supplemented some comments and added a few test cases. The modified code has been validated using the company's production environment, and it has proven to perform very well under the classic master-slave architecture, correctly recovering from various start-mode checkpoints.
However, there are still two potential risks:
When MySQL is under a dual-master architecture, GTIDs may have gaps, similar to
A:1-102, 105-150
. Such gaps are temporary but will eventually be consistent. But if the CDC happens to recover from the checkpoint when there are gaps in MySQL's GTID, it may access non-existent transactions, leading to recovery failure. This is because our GtidUtils' fixRestoredGtidSet method does not fix the gaps in the server GTID. I initially wanted to optimize this issue, but after weighing it, I believe that in most cases, the occurrence of GTID gaps is a problem with MySQL itself, which should be fixed by the DBA in MySQL, not to be compatible with CDC, to avoid triggering other unpredictable issues.The current implementation will correct all GTIDs, including those set by the user during the initial startup of CDC via StartupOptions. This may cause a problem: the user manually sets the GTID-offset to
A:300-500
, the original intention might be to expect CDC to consume dataA:1-299:501~xxx
, but CDC corrects it toA:1-500
, only starting consumption from500
. Solving this problem is not difficult, the key is whether it is necessary to solve it. I finally convinced myself to form a unified specification and inform the users: CDC only cares about the maximum GTID position and starts from it. This standard will make our program more user-friendly, but to some extent, it will violate MySQL's replication protocol.It's worth mentioning that I'm just raising the above two risks for discussion, and I don't suggest immediately solving and fixing them, as the current implementation is already very well done.