[HUDI-5796] Adding auto inferring partition from incoming df by nsivabalan · Pull Request #7951 · apache/hudi

nsivabalan · 2023-02-15T00:00:09Z

Change Logs

If someone tries to write to hudi in following syntax, we should infer the partition automatically if hoodie's partition path field is not explicitly set.

df.write.partitionBy("col1").format("hudi").options(...).save()

Impact

Improves usability of hudi.

Risk level (write none, low medium or high below)

low.

Documentation Update

We might need to enhance our quick start to call it out.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

nsivabalan · 2023-02-22T20:49:20Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

    // translate the api partitionBy of spark DataFrameWriter to PARTITIONPATH_FIELD
-    if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)) {
+    // we should set hoodie's partition path only if its not set by the user.
+    if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)


this might be backwards incompatible change. but not sure if previous behavior was supported by mistake.
for eg, if some sets hoodie partiiton path field to col1, but incoming df had col2, prior to this patch, col2 will be considered as partitioning col for hudi. but after this patch, it will be col1.
only if user did not explicitly set hoodie partition path config, we will use col2.

let's track this behavior change for next release.

maybe we should fail the write if these 2 options are not the same? at least we should avoid unintended writes

synced up directly. this should be ok behavior. may be we can add to our faq on how we deduce the partitioning columns.

the reasoning is: if people use partitionBy() and set PARTITIONPATH_FIELD_NAME right now, they are likely to be matched. so now we make PARTITIONPATH_FIELD_NAME higher precedence, it'll be compatible.

nsivabalan · 2023-02-22T20:54:49Z

@xushiyan : ready for review.

xushiyan · 2023-02-27T02:36:11Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

    // translate the api partitionBy of spark DataFrameWriter to PARTITIONPATH_FIELD
-    if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)) {
+    // we should set hoodie's partition path only if its not set by the user.
+    if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)


let's track this behavior change for next release.

xushiyan · 2023-02-27T02:37:36Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

    // translate the api partitionBy of spark DataFrameWriter to PARTITIONPATH_FIELD
-    if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)) {
+    // we should set hoodie's partition path only if its not set by the user.
+    if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)


maybe we should fail the write if these 2 options are not the same? at least we should avoid unintended writes

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

nsivabalan · 2023-02-28T15:44:51Z

CI is green

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

xushiyan · 2023-03-01T00:14:20Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

+    val keyGeneratorClass = optParams.getOrElse(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key(),
+      DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.defaultValue)


@nsivabalan how do we consider hoodie.datasource.write.keygenerator.type compare to key gen class in terms of precedence? it should fail a validation if unmatched or we ignore key gen type?

I don't think key gen type is recommended. there are some flows where its not honored. So, we fixed out quick start to call out and ask users to use key gen class only.

don't think we standardized or supported key gen type on all code paths.

NOTE: Please use hoodie.datasource.write.keygenerator.class instead of hoodie.datasource.write.keygenerator.type. The second config was introduced more recently. and will internally instantiate the correct KeyGenerator class based on the type name. The second one is intended for ease of use and is being actively worked on. We still recommend using the first config until it is marked as deprecated.

https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#key-generators

nsivabalan

changes looks ok to me

hudi-bot · 2023-03-03T21:53:26Z

CI report:

9ae7b06 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…7951) - Fixing auto inferring partition from incoming df

nsivabalan assigned xushiyan Feb 16, 2023

xushiyan reviewed Feb 21, 2023

View reviewed changes

nsivabalan commented Feb 22, 2023

View reviewed changes

nsivabalan force-pushed the inferPartitionFields branch from ffc4e3d to cbb0a8c Compare February 23, 2023 23:35

xushiyan reviewed Feb 27, 2023

View reviewed changes

nsivabalan added the priority:critical Production degraded; pipelines stalled label Feb 27, 2023

xushiyan reviewed Feb 28, 2023

View reviewed changes

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala Outdated Show resolved Hide resolved

xushiyan reviewed Mar 1, 2023

View reviewed changes

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala Outdated Show resolved Hide resolved

xushiyan reviewed Mar 1, 2023

View reviewed changes

nsivabalan commented Mar 1, 2023

View reviewed changes

xushiyan approved these changes Mar 2, 2023

View reviewed changes

Fixing auto inferring partition from incoming df

b2d8fd7

nsivabalan force-pushed the inferPartitionFields branch from e5ed02b to b2d8fd7 Compare March 3, 2023 15:17

nsivabalan added 2 commits March 3, 2023 07:19

Fixing method name

bbf05d3

Fixing build failures

9ae7b06

nsivabalan merged commit d40a621 into apache:master Mar 4, 2023

nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Mar 18, 2023

[HUDI-5796] Adding auto inferring partition from incoming df (apache#…

b42eecc

…7951) - Fixing auto inferring partition from incoming df

fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023

[HUDI-5796] Adding auto inferring partition from incoming df (apache#…

11ff63b

…7951) - Fixing auto inferring partition from incoming df

stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 20, 2023

[HUDI-5796] Adding auto inferring partition from incoming df (apache#…

30bb4df

…7951) - Fixing auto inferring partition from incoming df

KnightChess pushed a commit to KnightChess/hudi that referenced this pull request Jan 2, 2024

[HUDI-5796] Adding auto inferring partition from incoming df (apache#…

0470079

…7951) - Fixing auto inferring partition from incoming df

		val keyGeneratorClass = optParams.getOrElse(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key(),
		DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.defaultValue)

Conversation

nsivabalan commented Feb 15, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Feb 22, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nsivabalan commented Feb 28, 2023

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Mar 3, 2023

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments