Skip to content

[HUDI-7073] Fix schema projection in file group reader-based parquet file format#10047

Merged
yihua merged 1 commit intoapache:masterfrom
linliu-code:defult-configs-100-beta
Nov 10, 2023
Merged

[HUDI-7073] Fix schema projection in file group reader-based parquet file format#10047
yihua merged 1 commit intoapache:masterfrom
linliu-code:defult-configs-100-beta

Conversation

@linliu-code
Copy link
Copy Markdown
Collaborator

@linliu-code linliu-code commented Nov 10, 2023

Change Logs

As above. When turning on the file group reader-based parquet file format for MOR snapshot queries, the following exception is thrown in a few tests although ts exists in the schema

2023-11-09T21:08:42.7345689Z - Test Call show_logfile_metadata Procedure *** FAILED ***
2023-11-09T21:08:42.7347213Z   java.lang.IllegalArgumentException: Field ts does not exist in the data schema
2023-11-09T21:08:42.7349324Z   at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$generateRequiredSchemaWithMandatory$4(HoodieFileGroupReaderBasedParquetFileFormat.scala:277)
2023-11-09T21:08:42.7407685Z   at scala.Option.getOrElse(Option.scala:189)
2023-11-09T21:08:42.7409315Z   at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$generateRequiredSchemaWithMandatory$3(HoodieFileGroupReaderBasedParquetFileFormat.scala:277)
  • To fix the schema project, we search fields in table schema. Previously we only search the data schema for a field, but
    it can also be contained in the partition schema. We add this logic.
  • Update the manually created schema to match the one created through spark sql.

Impact

Bug fix.

Risk level

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@linliu-code linliu-code changed the title Defult configs 100 beta - Lin [Duplicate] Defult configs 100 beta - Lin Nov 10, 2023
@linliu-code linliu-code force-pushed the defult-configs-100-beta branch from edfa00f to a2e58e5 Compare November 10, 2023 05:20
@linliu-code linliu-code changed the title [Duplicate] Defult configs 100 beta - Lin [Minor] Defult configs 100 beta - Lin Nov 10, 2023
@linliu-code linliu-code changed the title [Minor] Defult configs 100 beta - Lin [MINOR] Defult configs 100 beta - Lin Nov 10, 2023
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua changed the title [MINOR] Defult configs 100 beta - Lin [HUDI-7073] Fix schema projection in file group reader-based parquet file format Nov 10, 2023
@yihua yihua force-pushed the defult-configs-100-beta branch 2 times, most recently from 7e88851 to 5a2e080 Compare November 10, 2023 13:56
…file format

1. Previously we only search the data schema for a field, but
it can also be contained in the partition schema. We add
the logic to search fields in table schema.

2. Update the manually created schema to match the one created
through spark sql.
@yihua yihua force-pushed the defult-configs-100-beta branch from 5a2e080 to c3f6c29 Compare November 10, 2023 13:56
@yihua yihua added priority:blocker Production down; release blocker release-1.0.0 labels Nov 10, 2023
@yihua
Copy link
Copy Markdown
Contributor

yihua commented Nov 10, 2023

Tests pass based on #10044 which cherry-picks the change in this PR.

@yihua yihua merged commit 7708270 into apache:master Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker release-1.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants