[Mysql] Fix duplicate split which cause duplicate data when open scanNewlyAddedTableEnabled#2096
[Mysql] Fix duplicate split which cause duplicate data when open scanNewlyAddedTableEnabled#2096ruanhang1993 merged 3 commits intoapache:masterfrom
Conversation
|
@leonardBang @lzshlzsh PLAL |
|
@EMsnap Thanks. |
…NewlyAddedTableEnabled
Sure, done Thanks for the update |
|
@EMsnap I think this PR should be modified. Actually all captured tables are not equals to remainingTables + alreadyProcessedTables. The method |
| remainingSplits.addAll(schemaLessSnapshotSplits); | ||
| if (!chunkSplitter.hasNextChunk()) { | ||
| remainingTables.remove(nextTable); | ||
| addAlreadyProcessedTablesIfNotExists(nextTable); |
There was a problem hiding this comment.
We should change the code in captureNewlyAddedTables instead of changing the location of addAlreadyProcessedTablesIfNotExists.
There was a problem hiding this comment.
Noted, I'll change the implementation asap
Thanks for the reply, I'll take a look. |
Sorry, the |
…pen scanNewlyAddedTableEnabled" This reverts commit 02e3318.
|
@EMsnap Thanks for the work. We try to fix this in the version 2.4.1. So I provide a new changes to fix this. |
Thanks for your reply and your new changes, great job ! |
…y added tables (apache#2096) Co-authored-by: Hang Ruan <ruanhang1993@hotmail.com>



Fix #2095
I carefully debugged the bug and there is the analysis:
1、In MySqlSnapshotSplitAssigner, split table happens asynchronously
2、After the finish of every table split the table will be moved from remaining table
3、After tm calls getNext() from MySqlSnapshotSplitAssigner, the table will be added to alreadyProcessedTables
4、However when job restarts and open scanNewlyAddedTableEnabled and discover newly added table, it gets into the following logic
The alreadyProcessedTables + remainingTables won't always be euqals to the tables splitted in the last run.
This happens when table is so large and tm can't process split on time.
For example:
1、The task reads from A and B, and each of them split into 1000 chunks
2、The remaingTable is empty since they are already splitted, but alreadyProcessedTables only contains A since B is not fetch by tm now
3、 job restarts and B will be spliited again.