-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5499] Fixing Spark SQL configs not being properly propagated for CTAS and other commands #7607
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. pls check the UT failures.
@@ -81,10 +80,8 @@ trait ProvidesHoodieConfig extends Logging { | |||
HoodieSyncConfig.META_SYNC_PARTITION_FIELDS.key -> tableConfig.getPartitionFieldProp, | |||
HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key -> hiveSyncConfig.getStringOrDefault(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS), | |||
HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE.key -> hiveSyncConfig.getBoolean(HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE).toString, | |||
HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key -> hoodieProps.getString(HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key, "200"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the upsert parallelism cannot be tuned anymore from the SQL statement? Generally, are the key-value pairs in Map.apply()
just overrides?
Let's say if upsert parallelism is not set in the combined options, is the default parallelism picked up using the write config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good catch. i took a closer look at org.apache.hudi.config.HoodieWriteConfig#getUpsertShuffleParallelism
which does not retrieve the default value. There is no reason that we should not use getIntOrDefault()
in those getXXXParallelism()
methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These default values in sql was probably added as bad bandaid to make thing pass. We should fix the root cause from config object level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the upsert parallelism cannot be tuned anymore from the SQL statement? Generally, are the key-value pairs in
Map.apply()
just overrides?
the combineOptions
method add it from SQLConf, and the properties priority logical is different from the old, Map.apply()
is highest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, we should not hardcode any defaults here. As long as the configs take effect, I'm good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM. I think we need to revisit the config passing in Spark SQL code path. While reviewing the PR, I'm wondering, why not use HoodieWriteConfig
with additional properties instead of merging in different places like ProvidesHoodieConfig
and HoodieSparkSqlWriter
?
9c9f76e
to
05cbdad
Compare
@alexeykudinkin could you check the CI failure? |
05cbdad
to
6e67e79
Compare
…pagation is fixed now)
6e67e79
to
dc90950
Compare
…r CTAS and other commands (apache#7607) ### Change Logs While following up and adding support for BrooklynData Benchmarks we've discovered that CTAS isn't properly propagating configs due to a recent change in [apache#5178](https://github.com/apache/hudi/pull/5178/files#diff-560283e494c8ba8da102fc217a2201220dd4db731ec23d80884e0f001a7cc0bcR117) Unfortunately logic of handling the configuration in `ProvidesHoodieConfig` become overly complicated and fragmented. This PR takes a stab at it trying to unify and streamline fusing the options from different sources (Spark Catalog props, Table properties, Spark SQL conf, overrides, etc) making sure different Spark SQL operations do handle it in much the same way (for ex, `MERGE INTO`, CTAS, `INSERT INTO`, etc) Changes - Simplify and unify `ProvidesHoodieConfig` configuration fusion from different sources - Fixing CTAS to override "hoodie.combine.before.insert" as "false"
…r CTAS and other commands (apache#7607) While following up and adding support for BrooklynData Benchmarks we've discovered that CTAS isn't properly propagating configs due to a recent change in [apache#5178](https://github.com/apache/hudi/pull/5178/files#diff-560283e494c8ba8da102fc217a2201220dd4db731ec23d80884e0f001a7cc0bcR117) Unfortunately logic of handling the configuration in `ProvidesHoodieConfig` become overly complicated and fragmented. This PR takes a stab at it trying to unify and streamline fusing the options from different sources (Spark Catalog props, Table properties, Spark SQL conf, overrides, etc) making sure different Spark SQL operations do handle it in much the same way (for ex, `MERGE INTO`, CTAS, `INSERT INTO`, etc) Changes - Simplify and unify `ProvidesHoodieConfig` configuration fusion from different sources - Fixing CTAS to override "hoodie.combine.before.insert" as "false"
…r CTAS and other commands (apache#7607) ### Change Logs While following up and adding support for BrooklynData Benchmarks we've discovered that CTAS isn't properly propagating configs due to a recent change in [apache#5178](https://github.com/apache/hudi/pull/5178/files#diff-560283e494c8ba8da102fc217a2201220dd4db731ec23d80884e0f001a7cc0bcR117) Unfortunately logic of handling the configuration in `ProvidesHoodieConfig` become overly complicated and fragmented. This PR takes a stab at it trying to unify and streamline fusing the options from different sources (Spark Catalog props, Table properties, Spark SQL conf, overrides, etc) making sure different Spark SQL operations do handle it in much the same way (for ex, `MERGE INTO`, CTAS, `INSERT INTO`, etc) Changes - Simplify and unify `ProvidesHoodieConfig` configuration fusion from different sources - Fixing CTAS to override "hoodie.combine.before.insert" as "false"
Change Logs
While following up and adding support for BrooklynData Benchmarks we've discovered that CTAS isn't properly propagating configs due to a recent change in #5178
Unfortunately logic of handling the configuration in
ProvidesHoodieConfig
become overly complicated and fragmented.This PR takes a stab at it trying to unify and streamline fusing the options from different sources (Spark Catalog props, Table properties, Spark SQL conf, overrides, etc) making sure different Spark SQL operations do handle it in much the same way (for ex,
MERGE INTO
, CTAS,INSERT INTO
, etc)Changes
ProvidesHoodieConfig
configuration fusion from different sourcesImpact
Fixes discrepancies in the ways configuration is handled across different Spark SQL commands addressing some of the issues stemming from this (for ex, CTAS using "insert" write operation instead of "bulk_insert")
Risk level (write none, low medium or high below)
Medium
Documentation Update
N/A
Contributor's checklist