[HUDI-5796] Adding auto inferring partition from incoming df#7951
[HUDI-5796] Adding auto inferring partition from incoming df#7951nsivabalan merged 3 commits intoapache:masterfrom
Conversation
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
Outdated
Show resolved
Hide resolved
...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
Show resolved
Hide resolved
...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
Show resolved
Hide resolved
| // translate the api partitionBy of spark DataFrameWriter to PARTITIONPATH_FIELD | ||
| if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)) { | ||
| // we should set hoodie's partition path only if its not set by the user. | ||
| if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY) |
There was a problem hiding this comment.
this might be backwards incompatible change. but not sure if previous behavior was supported by mistake.
for eg, if some sets hoodie partiiton path field to col1, but incoming df had col2, prior to this patch, col2 will be considered as partitioning col for hudi. but after this patch, it will be col1.
only if user did not explicitly set hoodie partition path config, we will use col2.
There was a problem hiding this comment.
let's track this behavior change for next release.
There was a problem hiding this comment.
maybe we should fail the write if these 2 options are not the same? at least we should avoid unintended writes
There was a problem hiding this comment.
synced up directly. this should be ok behavior. may be we can add to our faq on how we deduce the partitioning columns.
There was a problem hiding this comment.
the reasoning is: if people use partitionBy() and set PARTITIONPATH_FIELD_NAME right now, they are likely to be matched. so now we make PARTITIONPATH_FIELD_NAME higher precedence, it'll be compatible.
|
@xushiyan : ready for review. |
ffc4e3d to
cbb0a8c
Compare
| // translate the api partitionBy of spark DataFrameWriter to PARTITIONPATH_FIELD | ||
| if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)) { | ||
| // we should set hoodie's partition path only if its not set by the user. | ||
| if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY) |
There was a problem hiding this comment.
let's track this behavior change for next release.
| // translate the api partitionBy of spark DataFrameWriter to PARTITIONPATH_FIELD | ||
| if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)) { | ||
| // we should set hoodie's partition path only if its not set by the user. | ||
| if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY) |
There was a problem hiding this comment.
maybe we should fail the write if these 2 options are not the same? at least we should avoid unintended writes
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Show resolved
Hide resolved
...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
Outdated
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Outdated
Show resolved
Hide resolved
| val keyGeneratorClass = optParams.getOrElse(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key(), | ||
| DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.defaultValue) |
There was a problem hiding this comment.
@nsivabalan how do we consider hoodie.datasource.write.keygenerator.type compare to key gen class in terms of precedence? it should fail a validation if unmatched or we ignore key gen type?
There was a problem hiding this comment.
I don't think key gen type is recommended. there are some flows where its not honored. So, we fixed out quick start to call out and ask users to use key gen class only.
There was a problem hiding this comment.
don't think we standardized or supported key gen type on all code paths.
NOTE: Please use hoodie.datasource.write.keygenerator.class instead of hoodie.datasource.write.keygenerator.type. The second config was introduced more recently. and will internally instantiate the correct KeyGenerator class based on the type name. The second one is intended for ease of use and is being actively worked on. We still recommend using the first config until it is marked as deprecated.
https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#key-generators
nsivabalan
left a comment
There was a problem hiding this comment.
changes looks ok to me
e5ed02b to
b2d8fd7
Compare
…7951) - Fixing auto inferring partition from incoming df
…7951) - Fixing auto inferring partition from incoming df
…7951) - Fixing auto inferring partition from incoming df
…7951) - Fixing auto inferring partition from incoming df

Change Logs
If someone tries to write to hudi in following syntax, we should infer the partition automatically if hoodie's partition path field is not explicitly set.
Impact
Improves usability of hudi.
Risk level (write none, low medium or high below)
low.
Documentation Update
We might need to enhance our quick start to call it out.
Contributor's checklist