Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5499] Fixing Spark SQL configs not being properly propagated for CTAS and other commands #7607

Merged
merged 11 commits into from
Jan 20, 2023

Conversation

alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Jan 5, 2023

Change Logs

While following up and adding support for BrooklynData Benchmarks we've discovered that CTAS isn't properly propagating configs due to a recent change in #5178

Unfortunately logic of handling the configuration in ProvidesHoodieConfig become overly complicated and fragmented.

This PR takes a stab at it trying to unify and streamline fusing the options from different sources (Spark Catalog props, Table properties, Spark SQL conf, overrides, etc) making sure different Spark SQL operations do handle it in much the same way (for ex, MERGE INTO, CTAS, INSERT INTO, etc)

Changes

  • Simplify and unify ProvidesHoodieConfig configuration fusion from different sources
  • Fixing CTAS to override "hoodie.combine.before.insert" as "false"

Impact

Fixes discrepancies in the ways configuration is handled across different Spark SQL commands addressing some of the issues stemming from this (for ex, CTAS using "insert" write operation instead of "bulk_insert")

Risk level (write none, low medium or high below)

Medium

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. pls check the UT failures.

@@ -81,10 +80,8 @@ trait ProvidesHoodieConfig extends Logging {
HoodieSyncConfig.META_SYNC_PARTITION_FIELDS.key -> tableConfig.getPartitionFieldProp,
HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key -> hiveSyncConfig.getStringOrDefault(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS),
HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE.key -> hiveSyncConfig.getBoolean(HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE).toString,
HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key -> hoodieProps.getString(HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key, "200"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the upsert parallelism cannot be tuned anymore from the SQL statement? Generally, are the key-value pairs in Map.apply() just overrides?

Let's say if upsert parallelism is not set in the combined options, is the default parallelism picked up using the write config?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good catch. i took a closer look at org.apache.hudi.config.HoodieWriteConfig#getUpsertShuffleParallelism which does not retrieve the default value. There is no reason that we should not use getIntOrDefault() in those getXXXParallelism() methods

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These default values in sql was probably added as bad bandaid to make thing pass. We should fix the root cause from config object level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the upsert parallelism cannot be tuned anymore from the SQL statement? Generally, are the key-value pairs in Map.apply() just overrides?

the combineOptions method add it from SQLConf, and the properties priority logical is different from the old, Map.apply() is highest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua @xushiyan

  • Setting here to 200 is clearly a bandaid
  • As Raymond pointed out we should fix this at the root -- where we access these configs (we need to do it for all configs w/ defaults actually)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, we should not hardcode any defaults here. As long as the configs take effect, I'm good.

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. I think we need to revisit the config passing in Spark SQL code path. While reviewing the PR, I'm wondering, why not use HoodieWriteConfig with additional properties instead of merging in different places like ProvidesHoodieConfig and HoodieSparkSqlWriter?

@yihua
Copy link
Contributor

yihua commented Jan 13, 2023

@alexeykudinkin could you check the CI failure?

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@alexeykudinkin
Copy link
Contributor Author

@alexeykudinkin alexeykudinkin merged commit f0f8d61 into apache:master Jan 20, 2023
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Jan 31, 2023
…r CTAS and other commands (apache#7607)

### Change Logs

While following up and adding support for BrooklynData Benchmarks we've discovered that CTAS isn't properly propagating configs due to a recent change in [apache#5178](https://github.com/apache/hudi/pull/5178/files#diff-560283e494c8ba8da102fc217a2201220dd4db731ec23d80884e0f001a7cc0bcR117)

Unfortunately logic of handling the configuration in `ProvidesHoodieConfig` become overly complicated and fragmented. 

This PR takes a stab at it trying to unify and streamline fusing the options from different sources (Spark Catalog props, Table properties, Spark SQL conf, overrides, etc) making sure different Spark SQL operations do handle it in much the same way (for ex, `MERGE INTO`, CTAS, `INSERT INTO`, etc)

Changes

 - Simplify and unify `ProvidesHoodieConfig` configuration fusion from different sources
 - Fixing CTAS to override "hoodie.combine.before.insert" as "false"
nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 22, 2023
…r CTAS and other commands (apache#7607)

While following up and adding support for BrooklynData Benchmarks we've discovered that CTAS isn't properly propagating configs due to a recent change in [apache#5178](https://github.com/apache/hudi/pull/5178/files#diff-560283e494c8ba8da102fc217a2201220dd4db731ec23d80884e0f001a7cc0bcR117)

Unfortunately logic of handling the configuration in `ProvidesHoodieConfig` become overly complicated and fragmented.

This PR takes a stab at it trying to unify and streamline fusing the options from different sources (Spark Catalog props, Table properties, Spark SQL conf, overrides, etc) making sure different Spark SQL operations do handle it in much the same way (for ex, `MERGE INTO`, CTAS, `INSERT INTO`, etc)

Changes

 - Simplify and unify `ProvidesHoodieConfig` configuration fusion from different sources
 - Fixing CTAS to override "hoodie.combine.before.insert" as "false"
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…r CTAS and other commands (apache#7607)

### Change Logs

While following up and adding support for BrooklynData Benchmarks we've discovered that CTAS isn't properly propagating configs due to a recent change in [apache#5178](https://github.com/apache/hudi/pull/5178/files#diff-560283e494c8ba8da102fc217a2201220dd4db731ec23d80884e0f001a7cc0bcR117)

Unfortunately logic of handling the configuration in `ProvidesHoodieConfig` become overly complicated and fragmented. 

This PR takes a stab at it trying to unify and streamline fusing the options from different sources (Spark Catalog props, Table properties, Spark SQL conf, overrides, etc) making sure different Spark SQL operations do handle it in much the same way (for ex, `MERGE INTO`, CTAS, `INSERT INTO`, etc)

Changes

 - Simplify and unify `ProvidesHoodieConfig` configuration fusion from different sources
 - Fixing CTAS to override "hoodie.combine.before.insert" as "false"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants