-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5400] Fix read issues when Hudi-FULL schema evolution is not enabled #7480
Conversation
@voonhous |
@voonhous |
Hmmm, CMIIW, Hudi has been relying on ASR for schema resolution since Nonetheless, a configuration key can be introduced where in this behaviour is enabled by default. However, validation will need to be performed such that the choice between ASR/HFSE is mutually exclusive. i.e. if ASR is enabled, HFSE should be disabled and vice-versa. WDYT? |
@@ -228,7 +228,24 @@ class Spark32PlusHoodieParquetFileFormat(private val shouldAppendPartitionValues | |||
|
|||
SparkInternalSchemaConverter.collectTypeChangedCols(querySchemaOption.get(), mergedInternalSchema) | |||
} else { | |||
new java.util.HashMap() | |||
val implicitTypeChangeInfo: java.util.Map[Integer, Pair[DataType, DataType]] = new java.util.HashMap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls extract a function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed with the latest commit
Looks good. |
@xiarixiaoyao I looked at the code and realised that there is no way validate configuration values based on other configuration values. I wanted to add a
Given that this is the intended behaviour and lack of configuration validation, I see no benefit for introducing Since If WDYT? |
|
@voonhous |
Done! |
@xiarixiaoyao I have added support for Hudi tables that are schema-evolved via ASR for Spark2.4. Can you please help to review the PR again? Thank you! |
@voonhous |
Done! |
@hudi-bot run azure |
@voonhous |
Change Logs
Prior to Hudi's FULL Schema evolution (HFSE) support, Hudi relies on Avro's schema-resolution to perform schema evolution.
The exhaustive list of permitted schema-changes that Avro's schema-resolution allows for can be found here:
https://avro.apache.org/docs/1.10.2/spec.html#Schema+Resolution
A summary of the type changes is listed down below:
The current write execution flow is as such:
When reading:
.schema
.This PR fixes the Spark-read issues when implicit schema changes are made without enabling HFSE.
The scope of this fix is limited to Spark-Read + Spark-Write.
TODO: Check if issue exists in Flink reader/writer [WIP; WILL CREATE ANOTHER PR]
TODO: Implement the same changes to Spark2.4 [DONE; INCLUDED IN THIS PR]
Impact
None; no public APIs changed.
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist