Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7567] Add schema evolution to the filegroup reader #10957

Merged
merged 111 commits into from
Jun 7, 2024

Conversation

jonvex
Copy link
Contributor

@jonvex jonvex commented Apr 2, 2024

Change Logs

Subtask of https://issues.apache.org/jira/browse/HUDI-7045
Extracts from #10278

This pr adds in schema evolution to the filegroup reader, including schema.on.read and schema.on.write.

Impact

schema evolution supported in fg reader

Risk level (write none, low medium or high below)

high
need to do perf testing

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Apr 2, 2024
@jonvex jonvex changed the title [HUDI-7567] Add schema evolution to fg reader [HUDI-7567] Add schema evolution to the filegroup reader Apr 2, 2024
@jonvex jonvex changed the title [HUDI-7567] Add schema evolution to the filegroup reader [DO NOT MERGE][HUDI-7567] Add schema evolution to the filegroup reader Apr 2, 2024
@jonvex jonvex marked this pull request as ready for review April 2, 2024 21:05
@jonvex jonvex requested a review from yihua April 2, 2024 21:05
@apache apache deleted a comment from hudi-bot Apr 3, 2024
@jonvex jonvex requested a review from codope June 4, 2024 22:12
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving subject to clarification on #10957 (comment)

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really encourage you to break down such large PR into smaller pieces; so each one can be independent on its own (e.g., adding new APIs, util methods) in terms of scope of changes and stacked properly. Then each set of changes can be reviewed closely.

@@ -275,6 +311,9 @@ protected Option<T> merge(Option<T> older, Map<String, Object> olderInfoMap,

if (mergedRecord.isPresent()
&& !mergedRecord.get().getLeft().isDelete(mergedRecord.get().getRight(), payloadProps)) {
if (!mergedRecord.get().getRight().equals(readerSchema)) {
return Option.ofNullable((T) mergedRecord.get().getLeft().rewriteRecordWithNewSchema(mergedRecord.get().getRight(), null, readerSchema).getData());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do partial updates need schema evolution handling like this?

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any performance numbers based on manual benchmarking to make sure there is no regression?

@jonvex jonvex requested a review from yihua June 6, 2024 23:11
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good as a first cut. We should keep testing and improving the schema evolution logic in the new file group reader.

import org.junit.jupiter.api.{BeforeEach, Test}
import org.junit.jupiter.api.Assertions.{assertArrayEquals, assertEquals, assertFalse}

class TestSpark35RecordPositionMetadataColumn extends SparkClientFunctionalTestHarness {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll add this back in #11413?

@hudi-bot
Copy link

hudi-bot commented Jun 7, 2024

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit f4be74c into apache:master Jun 7, 2024
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants