Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cdc-connector][mongodb] Avoid mongodb source to read data after high_watermark in backfill phase #2893

Merged
merged 1 commit into from
Dec 20, 2023

Conversation

loserwang1024
Copy link
Contributor

Fix #2892

@loserwang1024
Copy link
Contributor Author

@Jiabao-Sun , @Shawn-Hx , CC

Comment on lines 381 to 383
private boolean shouldEmit(ChangeStreamOffset currentOffset) {
return !isBoundedRead() || currentOffset.isAtOrBefore(streamSplit.getEndingOffset());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code don't be called at other place. Maybe it shoud be deleted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, I first use it to determine whether to emit, then think a unbounded stream read currentOffset is a waste, But I forget to remove it.

if (currentOffset.isAtOrBefore(streamSplit.getEndingOffset())) {
queue.enqueue(new DataChangeEvent(changeRecord));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe currentOffset.isAtOrBefore shoud be currentOffset.isBefore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in another PR? Now other cdc source read also includes high_watermark, this PR just fix bug, not to change the common logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in another PR? Now other cdc source read also includes high_watermark, this PR just fix bug, not to change the common logic?

Could it cause read same data twice if use includes high_watermark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix in #2885 for all connectors.But it is a big influence, still need to discuss more, for example, Whether to read skip high_watermark in backfill phase, or skip low_watermark in streaming phase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix in #2885 for all connectors.But it is a big influence, still need to discuss more, for example, Whether to read skip high_watermark, or streaming phase skip low_watermark.

ok

Copy link
Contributor

@Jiabao-Sun Jiabao-Sun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @loserwang1024 for this contribution and @gong for the review.
LGTM.

@Jiabao-Sun Jiabao-Sun merged commit 6b6d9fc into apache:master Dec 20, 2023
15 of 17 checks passed
lvyanquan pushed a commit that referenced this pull request Jan 18, 2024
…_watermark in backfill phase (#2893)

(cherry picked from commit 6b6d9fc)
joyCurry30 pushed a commit to joyCurry30/flink-cdc-connectors that referenced this pull request Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug][mongodb] Mongo backfill may read data after high_watermark
3 participants