-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7378] Fix Spark SQL DML with custom key generator #10615
[HUDI-7378] Fix Spark SQL DML with custom key generator #10615
Conversation
7abdbd8
to
afc107a
Compare
23c4af9
to
185d0fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- is there any change to partitions in
hoodie.proerties
? Do we now write it asfield1:type,field2:type2
when using CustomKeyGenerator? - Thanks for adding extensive tests. Can you please look into the failures? They seem related to the patch.
...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala
Show resolved
Hide resolved
...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mistakenly approved earlier.
185d0fc
to
c376900
Compare
There is no change to the table configs in
Failures for Spark 3.2 and above are fixed. I'm looking into failures for older Spark versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this has the benefit of not breaking tables with their existing hoodie.table.recordkey.fields, but I am curious about any other approaches you thought about. From you test code, it looks like we can't use partitioned by (dt:int,idk:string)
when creating the table. I don't think that should block this pr from landing, but in the documentation for SQL: https://hudi.apache.org/docs/sql_ddl#create-partitioned-table I think we should add an example
Also, I think think this change will help us to fix partition pruning which currently does not work with timestamp keygen: https://issues.apache.org/jira/browse/HUDI-6614
// in hoodie.properties stores "col,ts". | ||
// The "params" here may only contain the write config of partition path field, | ||
// so we need to pass in the validated key generator class name. | ||
val validatedKeyGenClassName = if (tableConfigKeyGen != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when hoodie.datasource.write.partitionpath.field
is set, we don't set hoodie.table.partition.fields
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the hoodie.datasource.write.partitionpath.field
takes effect in the writer path. Before the fix, the write config is automatically set by the SQL writer based on the value of table config hoodie.table.partition.fields
.
...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala
Show resolved
Hide resolved
} else { | ||
val writeConfigPartitionField = catalogTable.catalogProperties.get(PARTITIONPATH_FIELD.key()) | ||
val keyGenClass = ReflectionUtils.getClass(tableConfigKeyGeneratorClassName) | ||
if (classOf[CustomKeyGenerator].equals(keyGenClass) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make this cover any classes that extend customkeygen as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption is that these key generators should not be extended. We should keep it this way for now.
...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala
Show resolved
Hide resolved
} | ||
} | ||
|
||
private def testInsertInto1(tableName: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be named better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or can you at least add some more comments on what the helper functions are doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed the methods. Let me know if it looks good.
if (StringUtils.isNullOrEmpty(tableConfigKeyGeneratorClassName)) { | ||
partitionFieldNamesWithoutKeyGenType | ||
} else { | ||
val writeConfigPartitionField = catalogTable.catalogProperties.get(PARTITIONPATH_FIELD.key()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is HoodieCatalogTable persisted between sessions? When you add an existing hudi table in spark sql you only need column names usually right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the table properties associated with HoodieCatalogTable
are persisted across Spark sessions. The persisted partition field write config hoodie.datasource.write.partitionpath.field
is a custom config outside Spark, which is used by Hudi logic only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example, the table looks like this in Spark catalog:
spark-sql (default)> DESCRIBE TABLE formatted h0;
24/04/12 13:59:53 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
_hoodie_commit_time string
_hoodie_commit_seqno string
_hoodie_record_key string
_hoodie_partition_path string
_hoodie_file_name string
id int
name string
price decimal(5,1)
ts int
segment string
# Partition Information
# col_name data_type comment
ts int
segment string
# Detailed Table Information
Catalog spark_catalog
Database default
Table h0
Owner ethan
Created Time Fri Apr 12 13:58:05 PDT 2024
Last Access UNKNOWN
Created By Spark 3.5.1
Type EXTERNAL
Provider hudi
Table Properties [hoodie.datasource.write.partitionpath.field=ts:timestamp,segment:simple, preCombineField=name, primaryKey=id, provider=hudi, type=cow]
Location file:/private/var/folders/60/wk8qzx310fd32b2dp7mhzvdc0000gn/T/spark-4ac6fb47-e20b-4679-a668-e28238ec3e05/h0
Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat org.apache.hudi.hadoop.HoodieParquetInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Time taken: 1.694 seconds, Fetched 30 row(s)
Good point. I tried
Right. |
Change Logs
Before this PR, Spark SQL DML does not work for a partitioned table with CustomKeyGenerator. The CustomKeyGenerator requires the write config of the partition fields (
hoodie.datasource.write.partitionpath.field
) to be in the format offield:type
(e.g.,ts:timestamp,segment:simple
). However, the table confighoodie.table.partition.fields
inhoodie.properties
only saves the field names (e.g.,ts,segment
). The Spark SQL writer picks up the partition field names automatically from thehoodie.table.partition.fields
inhoodie.properties
to sethoodie.datasource.write.partitionpath.field
, which does not work with the CustomKeyGenerator (errors attached). Andhoodie.datasource.write.partitionpath.field
cannot be overwritten from either SQL statement or the catalog table properties (stored incatalogProperties
ofHoodieCatalogTable
).This PR fixes Spark SQL DML with CustomKeyGenerator by allowing the
ALTER TABLE
orCREATE TABLE
to sethoodie.datasource.write.partitionpath.field
to overwrite the write config of the partition fields for CustomKeyGenerator to function properly. The table confighoodie.table.partition.fields
inhoodie.properties
is still in the same format. The syntax to overwrite the write config of the partition fields looks like:After adding the write config of partition path fields to the catalog table, it looks like this in Spark catalog:
hoodie.properities
of the table:Conflict of partition path field write config (
ts:timestamp,segment:simple
) with the field names (ts,segment
) for custom key generator:Exception of the CustomKeyGenerator failure:
Impact
Fixes Spark SQL DML with CustomKeyGenerator. A new test class,
TestSparkSqlWithCustomKeyGenerator
, is added to test all DML with CustomKeyGenerator, and the case of settinghoodie.datasource.write.partitionpath.field
withALTER TABLE .. SET TBLPROPERTIES (..)
orCREATE TABLE .. TBLPROPERTIES (..)
at the Spark catalog table level.Risk level
low
Documentation Update
N/A
Contributor's checklist