Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7609] Support array field type whose element type can be nullable #11006

Merged
merged 3 commits into from
Sep 19, 2024

Conversation

empcl
Copy link
Contributor

@empcl empcl commented Apr 12, 2024

Change Logs

Support array field type whose element type can be nullable.

Impact

none.

Risk level (write none, low medium or high below)

none.

@github-actions github-actions bot added the size:XS PR with lines of changes in <= 10 label Apr 12, 2024
@empcl empcl changed the title Support array field type whose element type can be nullable [HUDI-7609] Support array field type whose element type can be nullable Apr 12, 2024
@@ -140,7 +141,7 @@ private static String convertGroupField(GroupType field) {
ValidationUtils.checkArgument(field.getFieldCount() == 1, "Illegal List type: " + field);
Type repeatedType = field.getType(0);
if (isElementType(repeatedType, field.getName())) {
return arrayType(repeatedType, false);
return arrayType(repeatedType, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write a simple test for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, UT has been added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flink hive catalog does not really uses the code Parquet2SparkSchemaUtils.java, should we add UT with Spark SQL ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me introduce the background of this question. The current Flink creates a Hudi table containing array type fields, which defaults to array field elements that cannot be nullable. However, when using Spark to read data from the hive table and write it to the hudi table, the SparkSQL engine assumes that array field elements can be nullable, resulting in inconsistencies during field and type validation. The SparkSQL engine defaults that all fields can be nullable, so I understand that when creating a table in Flink, it is possible to directly specify that array type field elements can be nullable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Flink uses HoodieHiveCatalog#createTable() to create a table, it will retrieve the structural information of the current table. Obtain table properties through the SparkDataSourceTableUtils. getSparkTablePropertys() method, where the Parquet2SparkSchemeUtils. convertToSparkSchemeJson (reOrderedType) method will be called to obtain table structure information, which is the value of the spark.sql.sources.schema.numParts field

Copy link
Contributor

@danny0405 danny0405 Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you might need to validate option spark.sql.sources.schema.numParts set up within Hive I guess. And this option only affects the usage of Spark engine, should we fix the table schema instead where stored within the hoodie.properties?

@github-actions github-actions bot added size:S PR with lines of changes in (10, 100] and removed size:XS PR with lines of changes in <= 10 labels Apr 15, 2024
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 could you take another look?

@danny0405 danny0405 merged commit ca568be into apache:master Sep 19, 2024
39 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:S PR with lines of changes in (10, 100]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants