Skip to content

[HUDI-6686] - Handling empty commits after s3 applyFilter api#9433

Merged
nsivabalan merged 1 commit intoapache:masterfrom
lokesh-lingarajan-0310:emptycommit
Aug 15, 2023
Merged

[HUDI-6686] - Handling empty commits after s3 applyFilter api#9433
nsivabalan merged 1 commit intoapache:masterfrom
lokesh-lingarajan-0310:emptycommit

Conversation

@lokesh-lingarajan-0310
Copy link
Contributor

@lokesh-lingarajan-0310 lokesh-lingarajan-0310 commented Aug 11, 2023

Change Logs

Handling empty commit and returning current batch's endpoint to handle scenarios of customer configuring filters for specific objects in s3 among other objects.

Impact

Medium

Risk level (write none, low medium or high below)

Medium

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

LOG.info("Processed batch size: " + row.get(row.fieldIndex(CUMULATIVE_COLUMN_NAME)) + " bytes");
sourceData.unpersist();
return Pair.of(new CloudObjectIncrCheckpoint(row.getString(0), row.getString(1)), collectedRows);
return Pair.of(new CloudObjectIncrCheckpoint(row.getString(0), row.getString(1)), Option.of(collectedRows));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Would suggest using row.fieldIndex just like what you have for CUMULATIVE_COLUMN_NAME1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please open a ticket for optimizations that we were discussing offline (Use of applyOrdering in line 182 and commit_key temporary column)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

// Create S3 paths
SerializableConfiguration serializableHadoopConf = new SerializableConfiguration(sparkContext.hadoopConfiguration());
List<CloudObjectMetadata> cloudObjectMetadata = checkPointAndDataset.getRight()
List<CloudObjectMetadata> cloudObjectMetadata = checkPointAndDataset.getRight().get()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the Option be empty or nullable? Should we check before calling get() on Option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are doing that in line 166

@nsivabalan nsivabalan added release-0.14.0 priority:blocker Production down; release blocker labels Aug 15, 2023
@nsivabalan nsivabalan merged commit 7eef7d9 into apache:master Aug 15, 2023
prashantwason pushed a commit that referenced this pull request Aug 18, 2023
Handling empty commit and returning current batch's endpoint to handle scenarios of customer configuring filters for specific objects in s3 among other objects.

Co-authored-by: Lokesh Lingarajan <lokeshlingarajan@Lokeshs-MacBook-Pro.local>
leosanqing pushed a commit to leosanqing/hudi that referenced this pull request Sep 13, 2023
Handling empty commit and returning current batch's endpoint to handle scenarios of customer configuring filters for specific objects in s3 among other objects.

Co-authored-by: Lokesh Lingarajan <lokeshlingarajan@Lokeshs-MacBook-Pro.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker release-0.14.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants