[HUDI-5499] Fixing Spark SQL configs not being properly propagated for CTAS and other commands #7607

alexeykudinkin · 2023-01-05T03:00:36Z

Change Logs

While following up and adding support for BrooklynData Benchmarks we've discovered that CTAS isn't properly propagating configs due to a recent change in #5178

Unfortunately logic of handling the configuration in ProvidesHoodieConfig become overly complicated and fragmented.

This PR takes a stab at it trying to unify and streamline fusing the options from different sources (Spark Catalog props, Table properties, Spark SQL conf, overrides, etc) making sure different Spark SQL operations do handle it in much the same way (for ex, MERGE INTO, CTAS, INSERT INTO, etc)

Changes

Simplify and unify ProvidesHoodieConfig configuration fusion from different sources
Fixing CTAS to override "hoodie.combine.before.insert" as "false"

Impact

Fixes discrepancies in the ways configuration is handled across different Spark SQL commands addressing some of the issues stemming from this (for ex, CTAS using "insert" write operation instead of "bulk_insert")

Risk level (write none, low medium or high below)

Medium

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

xushiyan

LGTM. pls check the UT failures.

yihua · 2023-01-05T22:22:44Z

...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala

@@ -81,10 +80,8 @@ trait ProvidesHoodieConfig extends Logging {
        HoodieSyncConfig.META_SYNC_PARTITION_FIELDS.key -> tableConfig.getPartitionFieldProp,
        HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key -> hiveSyncConfig.getStringOrDefault(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS),
        HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE.key -> hiveSyncConfig.getBoolean(HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE).toString,
-        HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key -> hoodieProps.getString(HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key, "200"),


Does this mean that the upsert parallelism cannot be tuned anymore from the SQL statement? Generally, are the key-value pairs in Map.apply() just overrides?

Let's say if upsert parallelism is not set in the combined options, is the default parallelism picked up using the write config?

this is a good catch. i took a closer look at org.apache.hudi.config.HoodieWriteConfig#getUpsertShuffleParallelism which does not retrieve the default value. There is no reason that we should not use getIntOrDefault() in those getXXXParallelism() methods

These default values in sql was probably added as bad bandaid to make thing pass. We should fix the root cause from config object level.

Does this mean that the upsert parallelism cannot be tuned anymore from the SQL statement? Generally, are the key-value pairs in Map.apply() just overrides?

the combineOptions method add it from SQLConf, and the properties priority logical is different from the old, Map.apply() is highest

@yihua @xushiyan

Setting here to 200 is clearly a bandaid

As Raymond pointed out we should fix this at the root -- where we access these configs (we need to do it for all configs w/ defaults actually)

Definitely, we should not hardcode any defaults here. As long as the configs take effect, I'm good.

yihua

Generally LGTM. I think we need to revisit the config passing in Spark SQL code path. While reviewing the PR, I'm wondering, why not use HoodieWriteConfig with additional properties instead of merging in different places like ProvidesHoodieConfig and HoodieSparkSqlWriter?

yihua · 2023-01-13T02:04:10Z

@alexeykudinkin could you check the CI failure?

…itance

…pagation is fixed now)

hudi-bot · 2023-01-20T17:48:54Z

CI report:

32033e4 UNKNOWN
6e67e79 Azure: FAILURE
dc90950 Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

alexeykudinkin · 2023-01-20T21:53:07Z

CI is finally green:

https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=14502&view=results

…r CTAS and other commands (apache#7607) ### Change Logs While following up and adding support for BrooklynData Benchmarks we've discovered that CTAS isn't properly propagating configs due to a recent change in [apache#5178](https://github.com/apache/hudi/pull/5178/files#diff-560283e494c8ba8da102fc217a2201220dd4db731ec23d80884e0f001a7cc0bcR117) Unfortunately logic of handling the configuration in `ProvidesHoodieConfig` become overly complicated and fragmented. This PR takes a stab at it trying to unify and streamline fusing the options from different sources (Spark Catalog props, Table properties, Spark SQL conf, overrides, etc) making sure different Spark SQL operations do handle it in much the same way (for ex, `MERGE INTO`, CTAS, `INSERT INTO`, etc) Changes - Simplify and unify `ProvidesHoodieConfig` configuration fusion from different sources - Fixing CTAS to override "hoodie.combine.before.insert" as "false"

…r CTAS and other commands (apache#7607) While following up and adding support for BrooklynData Benchmarks we've discovered that CTAS isn't properly propagating configs due to a recent change in [apache#5178](https://github.com/apache/hudi/pull/5178/files#diff-560283e494c8ba8da102fc217a2201220dd4db731ec23d80884e0f001a7cc0bcR117) Unfortunately logic of handling the configuration in `ProvidesHoodieConfig` become overly complicated and fragmented. This PR takes a stab at it trying to unify and streamline fusing the options from different sources (Spark Catalog props, Table properties, Spark SQL conf, overrides, etc) making sure different Spark SQL operations do handle it in much the same way (for ex, `MERGE INTO`, CTAS, `INSERT INTO`, etc) Changes - Simplify and unify `ProvidesHoodieConfig` configuration fusion from different sources - Fixing CTAS to override "hoodie.combine.before.insert" as "false"

…r CTAS and other commands (apache#7607) ### Change Logs While following up and adding support for BrooklynData Benchmarks we've discovered that CTAS isn't properly propagating configs due to a recent change in [apache#5178](https://github.com/apache/hudi/pull/5178/files#diff-560283e494c8ba8da102fc217a2201220dd4db731ec23d80884e0f001a7cc0bcR117) Unfortunately logic of handling the configuration in `ProvidesHoodieConfig` become overly complicated and fragmented. This PR takes a stab at it trying to unify and streamline fusing the options from different sources (Spark Catalog props, Table properties, Spark SQL conf, overrides, etc) making sure different Spark SQL operations do handle it in much the same way (for ex, `MERGE INTO`, CTAS, `INSERT INTO`, etc) Changes - Simplify and unify `ProvidesHoodieConfig` configuration fusion from different sources - Fixing CTAS to override "hoodie.combine.before.insert" as "false"

alexeykudinkin assigned yihua Jan 5, 2023

alexeykudinkin requested review from yihua and xushiyan January 5, 2023 03:00

alexeykudinkin added priority:blocker spark-sql labels Jan 5, 2023

xushiyan approved these changes Jan 5, 2023

View reviewed changes

yihua reviewed Jan 5, 2023

View reviewed changes

KnightChess approved these changes Jan 8, 2023

View reviewed changes

alexeykudinkin force-pushed the ak/ctas-cfg-fix branch from 9c9f76e to 05cbdad Compare January 10, 2023 00:31

yihua approved these changes Jan 10, 2023

View reviewed changes

alexeykudinkin force-pushed the ak/ctas-cfg-fix branch from 05cbdad to 6e67e79 Compare January 19, 2023 05:35

Alexey Kudinkin added 11 commits January 20, 2023 08:34

Revisited buildHiveSyncConfig API to avoid exposing internal details

ed326cc

Revisited ProvidesHoodieConfig to streamline and unify option inher…

88a7579

…itance

Deduplicate usages of withCombinedOptions and withSparkConf

7f7a992

Fixing CTAS to avoid combining-on-insert

43bd537

Tidying up

19856b9

lint

d7d536f

Fixing compilation after rebase

f807001

Fixing tests to match exception root-cause as well

712b18b

Fix target operation to be overridable for Spark SQL inserting ops

09de1c9

Avoid adding catalog table's props as extra options (since config pro…

50536ac

…pagation is fixed now)

Added CTAS and IIT tests validating that bulk-insert operation is used

dc90950

alexeykudinkin force-pushed the ak/ctas-cfg-fix branch from 6e67e79 to dc90950 Compare January 20, 2023 16:34

alexeykudinkin merged commit f0f8d61 into apache:master Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-5499] Fixing Spark SQL configs not being properly propagated for CTAS and other commands #7607

[HUDI-5499] Fixing Spark SQL configs not being properly propagated for CTAS and other commands #7607

alexeykudinkin commented Jan 5, 2023 •

edited

xushiyan left a comment

yihua Jan 5, 2023

xushiyan Jan 7, 2023

xushiyan Jan 7, 2023

KnightChess Jan 8, 2023

alexeykudinkin Jan 9, 2023

yihua Jan 10, 2023

yihua left a comment

yihua commented Jan 13, 2023

hudi-bot commented Jan 20, 2023

alexeykudinkin commented Jan 20, 2023

[HUDI-5499] Fixing Spark SQL configs not being properly propagated for CTAS and other commands #7607

[HUDI-5499] Fixing Spark SQL configs not being properly propagated for CTAS and other commands #7607

Conversation

alexeykudinkin commented Jan 5, 2023 • edited

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

xushiyan left a comment

Choose a reason for hiding this comment

yihua Jan 5, 2023

Choose a reason for hiding this comment

xushiyan Jan 7, 2023

Choose a reason for hiding this comment

xushiyan Jan 7, 2023

Choose a reason for hiding this comment

KnightChess Jan 8, 2023

Choose a reason for hiding this comment

alexeykudinkin Jan 9, 2023

Choose a reason for hiding this comment

yihua Jan 10, 2023

Choose a reason for hiding this comment

yihua left a comment

Choose a reason for hiding this comment

yihua commented Jan 13, 2023

hudi-bot commented Jan 20, 2023

CI report:

alexeykudinkin commented Jan 20, 2023

alexeykudinkin commented Jan 5, 2023 •

edited