[HUDI-7609] Support array field type whose element type can be nullable #11006

empcl · 2024-04-12T11:03:59Z

Change Logs

Support array field type whose element type can be nullable.

Impact

none.

Risk level (write none, low medium or high below)

none.

danny0405 · 2024-04-13T00:32:23Z

...udi-sync-common/src/main/java/org/apache/hudi/sync/common/util/Parquet2SparkSchemaUtils.java

@@ -140,7 +141,7 @@ private static String convertGroupField(GroupType field) {
        ValidationUtils.checkArgument(field.getFieldCount() == 1, "Illegal List type: " + field);
        Type repeatedType = field.getType(0);
        if (isElementType(repeatedType, field.getName())) {
-          return arrayType(repeatedType, false);
+          return arrayType(repeatedType, true);


Can we write a simple test for it.

Okay, UT has been added

Flink hive catalog does not really uses the code Parquet2SparkSchemaUtils.java, should we add UT with Spark SQL ?

Let me introduce the background of this question. The current Flink creates a Hudi table containing array type fields, which defaults to array field elements that cannot be nullable. However, when using Spark to read data from the hive table and write it to the hudi table, the SparkSQL engine assumes that array field elements can be nullable, resulting in inconsistencies during field and type validation. The SparkSQL engine defaults that all fields can be nullable, so I understand that when creating a table in Flink, it is possible to directly specify that array type field elements can be nullable.

When Flink uses HoodieHiveCatalog#createTable() to create a table, it will retrieve the structural information of the current table. Obtain table properties through the SparkDataSourceTableUtils. getSparkTablePropertys() method, where the Parquet2SparkSchemeUtils. convertToSparkSchemeJson (reOrderedType) method will be called to obtain table structure information, which is the value of the spark.sql.sources.schema.numParts field

So you might need to validate option spark.sql.sources.schema.numParts set up within Hive I guess. And this option only affects the usage of Spark engine, should we fix the table schema instead where stored within the hoodie.properties?

hudi-bot · 2024-04-15T07:07:15Z

CI report:

572d972 UNKNOWN
84784a5 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

@danny0405 could you take another look?

Support array field type whose element type can be nullable

33451d5

github-actions bot added the size:XS PR with lines of changes in <= 10 label Apr 12, 2024

empcl changed the title ~~Support array field type whose element type can be nullable~~ [HUDI-7609] Support array field type whose element type can be nullable Apr 12, 2024

danny0405 reviewed Apr 13, 2024

View reviewed changes

add ut

572d972

github-actions bot added size:S PR with lines of changes in (10, 100] and removed size:XS PR with lines of changes in <= 10 labels Apr 15, 2024

fix

84784a5

yihua reviewed Sep 19, 2024

View reviewed changes

danny0405 approved these changes Sep 19, 2024

View reviewed changes

danny0405 merged commit ca568be into apache:master Sep 19, 2024
39 of 40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-7609] Support array field type whose element type can be nullable #11006

[HUDI-7609] Support array field type whose element type can be nullable #11006

empcl commented Apr 12, 2024

danny0405 Apr 13, 2024

empcl Apr 15, 2024

danny0405 Apr 15, 2024

empcl Apr 15, 2024

empcl Apr 15, 2024

danny0405 Apr 16, 2024 •

edited

Loading

hudi-bot commented Apr 15, 2024

yihua left a comment

[HUDI-7609] Support array field type whose element type can be nullable #11006

[HUDI-7609] Support array field type whose element type can be nullable #11006

Conversation

empcl commented Apr 12, 2024

Change Logs

Impact

Risk level (write none, low medium or high below)

danny0405 Apr 13, 2024

Choose a reason for hiding this comment

empcl Apr 15, 2024

Choose a reason for hiding this comment

danny0405 Apr 15, 2024

Choose a reason for hiding this comment

empcl Apr 15, 2024

Choose a reason for hiding this comment

empcl Apr 15, 2024

Choose a reason for hiding this comment

danny0405 Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

hudi-bot commented Apr 15, 2024

CI report:

yihua left a comment

Choose a reason for hiding this comment

danny0405 Apr 16, 2024 •

edited

Loading