Skip to content

[HUDI-7235] Fix checkpoint bug for S3/GCS Incremental Source#10336

Merged
bvaradar merged 8 commits intoapache:masterfrom
vinishjail97:HUDI-7235-Fix-Checkpoint-Bug
Apr 25, 2024
Merged

[HUDI-7235] Fix checkpoint bug for S3/GCS Incremental Source#10336
bvaradar merged 8 commits intoapache:masterfrom
vinishjail97:HUDI-7235-Fix-Checkpoint-Bug

Conversation

@vinishjail97
Copy link
Contributor

@vinishjail97 vinishjail97 commented Dec 15, 2023

Change Logs

Fix bug in checkpointing logic for S3/GCS in empty dataset use-case. The reason for the bug was following.

1st delta commit's checkpoint, processed 3 files.

23/12/06 16:55:26 INFO S3EventsHoodieIncrSource  : Querying S3 with:00000000000000, queryInfo:Query information for Incremental Source queryType: snapshot, previousInstant: 00000000000000, startInstant: 00000000000000, endInstant: 20231206150423946, orderColumn: _hoodie_commit_time, keyColumn: s3.object.key, limitColumn: s3.object.size, orderByColumns: [_hoodie_commit_time, s3.object.key]
	

23/12/06 16:55:42 INFO S3EventsHoodieIncrSource  : Adjusting end checkpoint:20231206150423946 based on sourceLimit :300000000
	
23/12/06 16:55:46 INFO S3EventsHoodieIncrSource  : Adjusted end checkpoint :20231206150423946#ee-facts/0012_part_00.parquet
	

23/12/06 16:55:49 INFO S3EventsHoodieIncrSource  : Total number of files to process :3

2nd delta commit was an empty one and the checkpoint returned was 20231206150423946 which is not a valid checkpoint progression because it should either be equal or increase monotonically (based on lexicographical order)

23/12/06 16:59:52 INFO S3EventsHoodieIncrSource  : Querying S3 with:20231206150423946#ee-facts/0012_part_00.parquet, queryInfo:Query information for Incremental Source queryType: incremental, previousInstant: 00000000000000, startInstant: 20231206150423946, endInstant: 20231206150423946, orderColumn: _hoodie_commit_time, keyColumn: s3.object.key, limitColumn: s3.object.size, orderByColumns: [_hoodie_commit_time, s3.object.key]
	
23/12/06 16:59:53 INFO S3EventsHoodieIncrSource  : Adjusting end checkpoint:20231206150423946 based on sourceLimit :300000000
	
23/12/06 16:59:55 INFO S3EventsHoodieIncrSource  : Empty source, returning endpoint:20231206150423946

As the previous commits' checkpoint was a faulty one, the 3rd commit read the same set of files again and wrote duplicate data.

Impact

Bug Fix

Risk level (write none, low medium or high below)

Medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

None, this is a bug fix for an existing feature.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@bvaradar bvaradar self-assigned this Dec 15, 2023
Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinishjail97 : Very good catch. Few questions and nit comments. Please take a look at the failing tests including the validate-PR

if (!checkPointAndDataset.getRight().isPresent()) {
LOG.info("Empty source, returning endpoint:" + queryInfo.getEndInstant());
return Pair.of(Option.empty(), queryInfo.getEndInstant());
LOG.info("Empty source, returning endpoint:" + checkPointAndDataset.getLeft().toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: .toString() is not needed in log message

if (sourceData.isEmpty()) {
return Pair.of(cloudObjectIncrCheckpoint, Option.empty());
// There is no file matching the prefix.
return Pair.of(new CloudObjectIncrCheckpoint(queryInfo.getEndInstant(), null), Option.empty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use the cloudObjectIncrCheckpoint that is being passed to this function ?

Copy link
Contributor

@bvaradar bvaradar Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the code to pick the right checkpoint

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Feb 26, 2024
@bvaradar
Copy link
Contributor

bvaradar commented Apr 1, 2024

@vinishjail97 : Can you address these comments and land it.

@bvaradar bvaradar force-pushed the HUDI-7235-Fix-Checkpoint-Bug branch from 7f685a6 to de49a9d Compare April 3, 2024 06:10
@bvaradar
Copy link
Contributor

bvaradar commented Apr 3, 2024

@vinishjail97 : Can you address these comments and land it.

Updated the diff after addressing review comments.

@bvaradar bvaradar force-pushed the HUDI-7235-Fix-Checkpoint-Bug branch from 0656fa6 to 1bb5e22 Compare April 5, 2024 20:48
@bvaradar bvaradar force-pushed the HUDI-7235-Fix-Checkpoint-Bug branch from 1bb5e22 to 5edeb0f Compare April 17, 2024 04:32
Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed the issue in additional place when there is no file matching the prefix.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@bvaradar bvaradar merged commit fb02249 into apache:master Apr 25, 2024
yihua pushed a commit that referenced this pull request May 15, 2024
Co-authored-by: Balaji Varadarajan <balaji.varadarajan@robinhood.com>
Co-authored-by: Balaji Varadarajan <vbalaji@apache.org>
yihua pushed a commit that referenced this pull request May 15, 2024
Co-authored-by: Balaji Varadarajan <balaji.varadarajan@robinhood.com>
Co-authored-by: Balaji Varadarajan <vbalaji@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants