Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flink: use correct scan mode when in TABLE_SCAN_THEN_INCREMENTAL mode #7338

Merged
merged 2 commits into from
Apr 16, 2023

Conversation

chenjunjiedada
Copy link
Collaborator

When consuming a table in TABLE_SCAN_THEN_INCREMENTAL mode and its snapshot history has expired, data can be lost. This is because checkScanMode returns incremental mode when the scan context is streaming. To address this issue, we have added a case to handle the TABLE_SCAN_THEN_INCREMENTAL mode.

@github-actions github-actions bot added the flink label Apr 13, 2023
@stevenzwu stevenzwu self-requested a review April 13, 2023 13:41
BATCH,
INCREMENTAL_APPEND_SCAN
}

private static ScanMode checkScanMode(ScanContext context) {
@VisibleForTesting
static ScanMode checkScanMode(ScanContext context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenjunjiedada thx for catching the bug and creating the PR fix.

For the conditions here, is there any other simpler logic? E.g., is it enough to just remove the context.isStreaming() condition in the original if clause?

Also I think it is better safer/more clear to construct a new ScanContext object and set the useSnapshotId.

    if (scanContext.streamingStartingStrategy()
        == StreamingStartingStrategy.TABLE_SCAN_THEN_INCREMENTAL) {
      // do a batch table scan first
      splits = FlinkSplitPlanner.planIcebergSourceSplits(table, scanContext, workerPool);
      LOG.info(
          "Discovered {} splits from initial batch table scan with snapshot Id {}",
          splits.size(),
          startSnapshot.snapshotId());
      // For TABLE_SCAN_THEN_INCREMENTAL, incremental mode starts exclusive from the startSnapshot
      toPosition =
          IcebergEnumeratorPosition.of(startSnapshot.snapshotId(), startSnapshot.timestampMillis());

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the conditions here, is there any other simpler logic? E.g., is it enough to just remove the context.isStreaming() condition in the original if clause?

Yes, it looks more simple and more direct.

Also I think it is better safer/more clear to construct a new ScanContext object and set the useSnapshotId.

Agree, we can use scanContext.copyWithSnapshotId to achieve that.

@stevenzwu stevenzwu merged commit b78d336 into apache:master Apr 16, 2023
12 checks passed
@stevenzwu
Copy link
Contributor

@chenjunjiedada thx for finding and fixing this bug

@stevenzwu
Copy link
Contributor

@chenjunjiedada can you create a backport PR too?

@chenjunjiedada chenjunjiedada deleted the fix-incr-start branch April 19, 2023 01:45
chenjunjiedada added a commit to chenjunjiedada/incubator-iceberg that referenced this pull request Apr 19, 2023
stevenzwu pushed a commit that referenced this pull request Apr 19, 2023
manisin pushed a commit to Snowflake-Labs/iceberg that referenced this pull request May 9, 2023
manisin pushed a commit to Snowflake-Labs/iceberg that referenced this pull request May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants