[HUDI-5400] Fix read issues when Hudi-FULL schema evolution is not enabled #7480

voonhous · 2022-12-16T03:48:53Z

Change Logs

Prior to Hudi's FULL Schema evolution (HFSE) support, Hudi relies on Avro's schema-resolution to perform schema evolution.

The exhaustive list of permitted schema-changes that Avro's schema-resolution allows for can be found here:
https://avro.apache.org/docs/1.10.2/spec.html#Schema+Resolution

A summary of the type changes is listed down below:

Supported cast conversions:
 - Integer => Long, Float, Double, Decimal*, String*
 - Long => Float, Double, Decimal*, String*
 - Float => Double, Decimal*, String*
 - Double => Decimal*, String*
 - Decimal => Decimal*, String*
 - String => Byte, Decimal*, Date*
 - Byte=> String
 - Date => String*

*type conversions that are supported by HFSE, but not in Native Avro's schema-resolution

The current write execution flow is as such:

deduceWriterSchema to check if the incoming schema is compatible with table's schema (via org.apache.hudi.avro.AvroSchemaCompatibility.ReaderWriterCompatibilityChecker#calculateCompatibility)
deduceWriterSchema's validation is an adaptation of Avro's schema compatibility check (org.apache.avro.SchemaCompatibility.ReaderWriterCompatiblityChecker#calculateCompatibility); if Avro permits such operation, allow execution to proceed
As such if there are implicit schema changes which are compatible with Avro's schema resolution feature, HFSE does not need to be enabled
If writer is writing to a different filegroup, this filegroup will be written with a new schema; while existing filegroups that are not written to will contain the old schema

When reading:

When reading, the same schema will be used for all the filegroup since nothing is written to .schema.
Parquet reader will throw errors due to type mismatches when reading

This PR fixes the Spark-read issues when implicit schema changes are made without enabling HFSE.

The scope of this fix is limited to Spark-Read + Spark-Write.

TODO: Check if issue exists in Flink reader/writer [WIP; WILL CREATE ANOTHER PR]
TODO: Implement the same changes to Spark2.4 [DONE; INCLUDED IN THIS PR]

Impact

None; no public APIs changed.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

xiarixiaoyao · 2022-12-16T08:33:47Z

@voonhous
nice work，will take a look at the weekend

xiarixiaoyao · 2022-12-19T06:13:51Z

@voonhous
Maybe we need a parameter to control this feature, not all tables need to follow this logic

voonhous · 2022-12-20T10:26:28Z

@voonhous Maybe we need a parameter to control this feature, not all tables need to follow this logic

Hmmm, CMIIW, Hudi has been relying on ASR for schema resolution since hudi-0.7. As such, I was under the impression that this should be a default behaviour.

Nonetheless, a configuration key can be introduced where in this behaviour is enabled by default.

However, validation will need to be performed such that the choice between ASR/HFSE is mutually exclusive. i.e. if ASR is enabled, HFSE should be disabled and vice-versa. WDYT?

xiarixiaoyao · 2022-12-20T13:20:09Z

.../org/apache/spark/sql/execution/datasources/parquet/Spark32PlusHoodieParquetFileFormat.scala

@@ -228,7 +228,24 @@ class Spark32PlusHoodieParquetFileFormat(private val shouldAppendPartitionValues

        SparkInternalSchemaConverter.collectTypeChangedCols(querySchemaOption.get(), mergedInternalSchema)
      } else {
-        new java.util.HashMap()
+        val implicitTypeChangeInfo: java.util.Map[Integer, Pair[DataType, DataType]] = new java.util.HashMap()


pls extract a function

Fixed with the latest commit

xiarixiaoyao · 2022-12-20T13:22:01Z

Looks good.
maybe we also need a pr for flink

voonhous · 2022-12-21T02:56:27Z

@voonhous Maybe we need a parameter to control this feature, not all tables need to follow this logic

Hmmm, CMIIW, Hudi has been relying on ASR for schema resolution since hudi-0.7. As such, I was under the impression that this should be a default behaviour.

Nonetheless, a configuration key can be introduced where in this behaviour is enabled by default.

However, validation will need to be performed such that the choice between ASR/HFSE is mutually exclusive. i.e. if ASR is enabled, HFSE should be disabled and vice-versa. WDYT?

@xiarixiaoyao I looked at the code and realised that there is no way validate configuration values based on other configuration values.

I wanted to add a AVRO_SCHEMA_RESOLUTION_ENABLE configuration key with the description:

Enable support for schema evolution using Avro's Schema Resolution (ASR). This configuration is mutually exclusive to Hudi's Full/Comprehensive Schema Evolution (HFSE) feature via the configuration key (hoodie.schema.on.read.enable). 

The choice between ASR/HFSE is mutually exclusive. i.e. if ASR is enabled, HFSE should be disabled and vice-versa. 

HFSE will take precedence over ASR. i.e. Enabling both HFSE and ASR will cause Hudi to default to HFSE for schema evolution.

Given that this is the intended behaviour and lack of configuration validation, I see no benefit for introducing AVRO_SCHEMA_RESOLUTION_ENABLE.

Since SCHEMA_EVOLUTION_ENABLE will take precedence over AVRO_SCHEMA_RESOLUTION_ENABLE, I think we can rely on the former (SCHEMA_EVOLUTION_ENABLE) to determine if ASR should be used.

If SCHEMA_EVOLUTION_ENABLE is enabled, use HFSE, else, fallback to ASR.

WDYT?

xiarixiaoyao · 2022-12-21T03:31:22Z

@voonhous Maybe we need a parameter to control this feature, not all tables need to follow this logic

Hmmm, CMIIW, Hudi has been relying on ASR for schema resolution since hudi-0.7. As such, I was under the impression that this should be a default behaviour.
Nonetheless, a configuration key can be introduced where in this behaviour is enabled by default.
However, validation will need to be performed such that the choice between ASR/HFSE is mutually exclusive. i.e. if ASR is enabled, HFSE should be disabled and vice-versa. WDYT?

@xiarixiaoyao I looked at the code and realised that there is no way validate configuration values based on other configuration values.

I wanted to add a AVRO_SCHEMA_RESOLUTION_ENABLE configuration key with the description:
Enable support for schema evolution using Avro's Schema Resolution (ASR). This configuration is mutually exclusive to Hudi's Full/Comprehensive Schema Evolution (HFSE) feature via the configuration key (hoodie.schema.on.read.enable). 

The choice between ASR/HFSE is mutually exclusive. i.e. if ASR is enabled, HFSE should be disabled and vice-versa. 

HFSE will take precedence over ASR. i.e. Enabling both HFSE and ASR will cause Hudi to default to HFSE for schema evolution.
Given that this is the intended behaviour and lack of configuration validation, I see no benefit for introducing AVRO_SCHEMA_RESOLUTION_ENABLE.

Since SCHEMA_EVOLUTION_ENABLE will take precedence over AVRO_SCHEMA_RESOLUTION_ENABLE, I think we can rely on the former (SCHEMA_EVOLUTION_ENABLE) to determine if ASR should be used.

If SCHEMA_EVOLUTION_ENABLE is enabled, use HFSE, else, fallback to ASR.

WDYT?
agree

xiarixiaoyao · 2022-12-21T03:35:47Z

@voonhous
pls rebase code，
once ci pass，we can merge it .

voonhous · 2022-12-21T03:38:16Z

@voonhous pls rebase code， once ci pass，we can merge it .

Done!

…ion on Spark2.4

voonhous · 2022-12-22T06:16:14Z

@xiarixiaoyao I have added support for Hudi tables that are schema-evolved via ASR for Spark2.4.

Can you please help to review the PR again?

Thank you!

xiarixiaoyao · 2022-12-22T09:52:45Z

@voonhous
Thank you for your support for spark2.4, although I personally think we don't need to support 2.4.
let‘s extract buildImplicitSchemaChangeInfo and isDataTypeEqual to a helper class to reuse.

voonhous · 2022-12-22T10:24:24Z

@voonhous Thank you for your support for spark2.4, although I personally think we don't need to support 2.4. let‘s extract buildImplicitSchemaChangeInfo and isDataTypeEqual to a helper class to reuse.

Done!

xiarixiaoyao · 2022-12-23T07:47:00Z

@hudi-bot run azure

hudi-bot · 2022-12-23T08:50:17Z

CI report:

a3015b0 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xiarixiaoyao · 2022-12-24T07:41:21Z

@voonhous
Thanks for your contribution

…abled (apache#7480)

voonhous changed the title ~~Fix read issues when Hudi-FULL schema evolution is not enabled~~ [HUDI-5400] Fix read issues when Hudi-FULL schema evolution is not enabled Dec 16, 2022

voonhous mentioned this pull request Dec 16, 2022

[SUPPORT] Implicit schema changes supported by Avro schema-resolution will not work properly if there are filegroups with old schema #7444

Closed

xiarixiaoyao self-assigned this Dec 16, 2022

trushev mentioned this pull request Dec 16, 2022

[HUDI-3981] Flink engine support for comprehensive schema evolution #5830

Merged

4 tasks

xiarixiaoyao reviewed Dec 20, 2022

View reviewed changes

xiarixiaoyao approved these changes Dec 20, 2022

View reviewed changes

voonhous added 3 commits December 21, 2022 11:37

Support avro schema resolution if hudi-schema evolution is not enabled

826adac

Updated doc string for TestAvroSchemaResolutionSupport.scala

dee0c5b

Extracted buildImplicitSchemaChangeInfo into its own function

a3015b0

voonhous force-pushed the HUDI-5400 branch from 456cabe to a3015b0 Compare December 21, 2022 03:37

voonhous added 2 commits December 21, 2022 17:21

Added function to prepDataFrame before performing casting

cc545f0

Add support for reading schema evolved tables via avro schema resolut…

f743369

…ion on Spark2.4

voonhous requested a review from xiarixiaoyao December 22, 2022 06:16

Moved repeated functions into a common helper object

e17f0d6

Add license header

ac9854b

yihua added schema-and-data-types priority:critical production down; pipelines stalled; Need help asap. writer-core Issues relating to core transactions/write actions labels Dec 22, 2022

yihua linked an issue Dec 22, 2022 that may be closed by this pull request

[SUPPORT] Implicit schema changes supported by Avro schema-resolution will not work properly if there are filegroups with old schema #7444

Closed

xiarixiaoyao merged commit 64b814e into apache:master Dec 24, 2022

voonhous deleted the HUDI-5400 branch January 4, 2023 04:04

nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 22, 2023

[HUDI-5400] Fix read issues when Hudi-FULL schema evolution is not en…

2264b58

…abled (apache#7480)

fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023

[HUDI-5400] Fix read issues when Hudi-FULL schema evolution is not en…

61c9a04

…abled (apache#7480)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-5400] Fix read issues when Hudi-FULL schema evolution is not enabled #7480

[HUDI-5400] Fix read issues when Hudi-FULL schema evolution is not enabled #7480

voonhous commented Dec 16, 2022 •

edited

xiarixiaoyao commented Dec 16, 2022

xiarixiaoyao commented Dec 19, 2022

voonhous commented Dec 20, 2022 •

edited

xiarixiaoyao Dec 20, 2022

voonhous Dec 21, 2022

xiarixiaoyao commented Dec 20, 2022

voonhous commented Dec 21, 2022

xiarixiaoyao commented Dec 21, 2022

xiarixiaoyao commented Dec 21, 2022

voonhous commented Dec 21, 2022

voonhous commented Dec 22, 2022

xiarixiaoyao commented Dec 22, 2022

voonhous commented Dec 22, 2022

xiarixiaoyao commented Dec 23, 2022

hudi-bot commented Dec 23, 2022

xiarixiaoyao commented Dec 24, 2022

[HUDI-5400] Fix read issues when Hudi-FULL schema evolution is not enabled #7480

[HUDI-5400] Fix read issues when Hudi-FULL schema evolution is not enabled #7480

Conversation

voonhous commented Dec 16, 2022 • edited

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

xiarixiaoyao commented Dec 16, 2022

xiarixiaoyao commented Dec 19, 2022

voonhous commented Dec 20, 2022 • edited

xiarixiaoyao Dec 20, 2022

Choose a reason for hiding this comment

voonhous Dec 21, 2022

Choose a reason for hiding this comment

xiarixiaoyao commented Dec 20, 2022

voonhous commented Dec 21, 2022

xiarixiaoyao commented Dec 21, 2022

xiarixiaoyao commented Dec 21, 2022

voonhous commented Dec 21, 2022

voonhous commented Dec 22, 2022

xiarixiaoyao commented Dec 22, 2022

voonhous commented Dec 22, 2022

xiarixiaoyao commented Dec 23, 2022

hudi-bot commented Dec 23, 2022

CI report:

xiarixiaoyao commented Dec 24, 2022

voonhous commented Dec 16, 2022 •

edited

voonhous commented Dec 20, 2022 •

edited