Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7523] Add HOODIE_SPARK_DATASOURCE_OPTIONS to be used in HoodieIncrSource #10900

Merged
merged 5 commits into from
May 13, 2024

Conversation

vinishjail97
Copy link
Contributor

@vinishjail97 vinishjail97 commented Mar 21, 2024

Change Logs

Add a new config HOODIE_SPARK_DATASOURCE_OPTIONS which is used by the spark dataframe reader for HoodieIncrSource, options like using metadataTable, dataSkipping present inDataSourceOptions.scala can be passed for efficient pruning of files.

https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

Impact

The files will be pruned using colstats and other mechanisms available making HoodieIncrSource more efficient.

Risk level (write none, low medium or high below)

Low

Documentation Update

HOODIE_SPARK_DATASOURCE_OPTIONS is the new config being added.
A comma separate list of options that can be passed to the spark dataframe reader of a hudi table, eg: hoodie.metadata.enable=true,hoodie.enable.data.skipping=true.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@vinishjail97 vinishjail97 changed the title Add HOODIE_SPARK_DATASOURCE_OPTIONS used in HoodieIncrSource [HUDI-7523] Add HOODIE_SPARK_DATASOURCE_OPTIONS used in HoodieIncrSource Mar 21, 2024
@vinishjail97 vinishjail97 changed the title [HUDI-7523] Add HOODIE_SPARK_DATASOURCE_OPTIONS used in HoodieIncrSource [HUDI-7523] Add HOODIE_SPARK_DATASOURCE_OPTIONS to be used in HoodieIncrSource Mar 21, 2024
@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Mar 21, 2024
.build();

TypedProperties extraProps = new TypedProperties();
extraProps.setProperty(HoodieIncrSourceConfig.HOODIE_SPARK_DATASOURCE_OPTIONS.key(), "hoodie.metadata.enable=true,hoodie.enable.data.skipping=true");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there also a test which actually confirms that metadata table is used and data skipping happens when reading incrementally?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be hard to check the Spark reader contains the passed configs in the tests.

@nsivabalan
Copy link
Contributor

hey @vinishjail97 : can you address the reviews from sagar.

@nsivabalan nsivabalan self-assigned this May 9, 2024
@yihua yihua force-pushed the HUDI-7523-Spark-DataSource branch from 5fefa9e to d0d786d Compare May 12, 2024 23:25
yihua and others added 3 commits May 12, 2024 16:29
…HoodieIncrSourceConfig.java

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
…/HoodieIncrSource.java

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels May 12, 2024
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit ce08875 into apache:master May 13, 2024
46 checks passed
yihua added a commit that referenced this pull request May 15, 2024
…ncrSource (#10900)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
yihua added a commit that referenced this pull request May 15, 2024
…ncrSource (#10900)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-0.15.0 size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants