-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7609] Support array field type whose element type can be nullable #11006
Conversation
@@ -140,7 +141,7 @@ private static String convertGroupField(GroupType field) { | |||
ValidationUtils.checkArgument(field.getFieldCount() == 1, "Illegal List type: " + field); | |||
Type repeatedType = field.getType(0); | |||
if (isElementType(repeatedType, field.getName())) { | |||
return arrayType(repeatedType, false); | |||
return arrayType(repeatedType, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we write a simple test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, UT has been added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flink hive catalog does not really uses the code Parquet2SparkSchemaUtils.java
, should we add UT with Spark SQL ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me introduce the background of this question. The current Flink creates a Hudi table containing array type fields, which defaults to array field elements that cannot be nullable. However, when using Spark to read data from the hive table and write it to the hudi table, the SparkSQL engine assumes that array field elements can be nullable, resulting in inconsistencies during field and type validation. The SparkSQL engine defaults that all fields can be nullable, so I understand that when creating a table in Flink, it is possible to directly specify that array type field elements can be nullable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When Flink uses HoodieHiveCatalog#createTable()
to create a table, it will retrieve the structural information of the current table. Obtain table properties through the SparkDataSourceTableUtils. getSparkTablePropertys()
method, where the Parquet2SparkSchemeUtils. convertToSparkSchemeJson (reOrderedType)
method will be called to obtain table structure information, which is the value of the spark.sql.sources.schema.numParts
field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you might need to validate option spark.sql.sources.schema.numParts
set up within Hive I guess. And this option only affects the usage of Spark engine, should we fix the table schema instead where stored within the hoodie.properties
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 could you take another look?
Change Logs
Support array field type whose element type can be nullable.
Impact
none.
Risk level (write none, low medium or high below)
none.