-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7523] Add HOODIE_SPARK_DATASOURCE_OPTIONS to be used in HoodieIncrSource #10900
Conversation
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/HoodieIncrSourceConfig.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/HoodieIncrSourceConfig.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
Outdated
Show resolved
Hide resolved
.build(); | ||
|
||
TypedProperties extraProps = new TypedProperties(); | ||
extraProps.setProperty(HoodieIncrSourceConfig.HOODIE_SPARK_DATASOURCE_OPTIONS.key(), "hoodie.metadata.enable=true,hoodie.enable.data.skipping=true"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there also a test which actually confirms that metadata table is used and data skipping happens when reading incrementally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be hard to check the Spark reader contains the passed configs in the tests.
6800d00
to
5fefa9e
Compare
hey @vinishjail97 : can you address the reviews from sagar. |
5fefa9e
to
d0d786d
Compare
…HoodieIncrSourceConfig.java Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
…/HoodieIncrSource.java Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
…ncrSource (#10900) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
…ncrSource (#10900) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Change Logs
Add a new config
HOODIE_SPARK_DATASOURCE_OPTIONS
which is used by the spark dataframe reader for HoodieIncrSource, options like using metadataTable, dataSkipping present inDataSourceOptions.scala can be passed for efficient pruning of files.https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Impact
The files will be pruned using colstats and other mechanisms available making
HoodieIncrSource
more efficient.Risk level (write none, low medium or high below)
Low
Documentation Update
HOODIE_SPARK_DATASOURCE_OPTIONS is the new config being added.
A comma separate list of options that can be passed to the spark dataframe reader of a hudi table, eg: hoodie.metadata.enable=true,hoodie.enable.data.skipping=true.
Contributor's checklist