Skip to content

[HUDI-5796] Adding auto inferring partition from incoming df#7951

Merged
nsivabalan merged 3 commits intoapache:masterfrom
nsivabalan:inferPartitionFields
Mar 4, 2023
Merged

[HUDI-5796] Adding auto inferring partition from incoming df#7951
nsivabalan merged 3 commits intoapache:masterfrom
nsivabalan:inferPartitionFields

Conversation

@nsivabalan
Copy link
Contributor

Change Logs

If someone tries to write to hudi in following syntax, we should infer the partition automatically if hoodie's partition path field is not explicitly set.

df.write.partitionBy("col1").format("hudi").options(...).save()

Impact

Improves usability of hudi.

Risk level (write none, low medium or high below)

low.

Documentation Update

We might need to enhance our quick start to call it out.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

// translate the api partitionBy of spark DataFrameWriter to PARTITIONPATH_FIELD
if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)) {
// we should set hoodie's partition path only if its not set by the user.
if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be backwards incompatible change. but not sure if previous behavior was supported by mistake.
for eg, if some sets hoodie partiiton path field to col1, but incoming df had col2, prior to this patch, col2 will be considered as partitioning col for hudi. but after this patch, it will be col1.
only if user did not explicitly set hoodie partition path config, we will use col2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's track this behavior change for next release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should fail the write if these 2 options are not the same? at least we should avoid unintended writes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced up directly. this should be ok behavior. may be we can add to our faq on how we deduce the partitioning columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reasoning is: if people use partitionBy() and set PARTITIONPATH_FIELD_NAME right now, they are likely to be matched. so now we make PARTITIONPATH_FIELD_NAME higher precedence, it'll be compatible.

@nsivabalan
Copy link
Contributor Author

@xushiyan : ready for review.

// translate the api partitionBy of spark DataFrameWriter to PARTITIONPATH_FIELD
if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)) {
// we should set hoodie's partition path only if its not set by the user.
if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's track this behavior change for next release.

// translate the api partitionBy of spark DataFrameWriter to PARTITIONPATH_FIELD
if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)) {
// we should set hoodie's partition path only if its not set by the user.
if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should fail the write if these 2 options are not the same? at least we should avoid unintended writes

@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Feb 27, 2023
@nsivabalan
Copy link
Contributor Author

CI is green
image

Comment on lines 302 to 303
val keyGeneratorClass = optParams.getOrElse(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key(),
DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.defaultValue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsivabalan how do we consider hoodie.datasource.write.keygenerator.type compare to key gen class in terms of precedence? it should fail a validation if unmatched or we ignore key gen type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think key gen type is recommended. there are some flows where its not honored. So, we fixed out quick start to call out and ask users to use key gen class only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think we standardized or supported key gen type on all code paths.

NOTE: Please use hoodie.datasource.write.keygenerator.class instead of hoodie.datasource.write.keygenerator.type. The second config was introduced more recently. and will internally instantiate the correct KeyGenerator class based on the type name. The second one is intended for ease of use and is being actively worked on. We still recommend using the first config until it is marked as deprecated.

https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#key-generators

Copy link
Contributor Author

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes looks ok to me

@nsivabalan nsivabalan force-pushed the inferPartitionFields branch from e5ed02b to b2d8fd7 Compare March 3, 2023 15:17
@hudi-bot
Copy link
Collaborator

hudi-bot commented Mar 3, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit d40a621 into apache:master Mar 4, 2023
nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Mar 18, 2023
…7951)

- Fixing auto inferring partition from incoming df
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…7951)

- Fixing auto inferring partition from incoming df
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 20, 2023
KnightChess pushed a commit to KnightChess/hudi that referenced this pull request Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants

Comments