[HUDI-7567] Add schema evolution to the filegroup reader #10957

jonvex · 2024-04-02T20:57:16Z

Change Logs

Subtask of https://issues.apache.org/jira/browse/HUDI-7045
Extracts from #10278

This pr adds in schema evolution to the filegroup reader, including schema.on.read and schema.on.write.

Impact

schema evolution supported in fg reader

Risk level (write none, low medium or high below)

high
need to do perf testing

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

codope

Approving subject to clarification on #10957 (comment)

yihua

I'd really encourage you to break down such large PR into smaller pieces; so each one can be independent on its own (e.g., adding new APIs, util methods) in terms of scope of changes and stacked properly. Then each set of changes can be reviewed closely.

...di-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala

hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordReader.java

...-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java

yihua · 2024-06-06T00:46:32Z

...-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java

@@ -275,6 +311,9 @@ protected Option<T> merge(Option<T> older, Map<String, Object> olderInfoMap,

    if (mergedRecord.isPresent()
        && !mergedRecord.get().getLeft().isDelete(mergedRecord.get().getRight(), payloadProps)) {
+      if (!mergedRecord.get().getRight().equals(readerSchema)) {
+        return Option.ofNullable((T) mergedRecord.get().getLeft().rewriteRecordWithNewSchema(mergedRecord.get().getRight(), null, readerSchema).getData());


Do partial updates need schema evolution handling like this?

...ommon/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java

yihua

Do you have any performance numbers based on manual benchmarking to make sure there is no regression?

...asource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala

...k-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestAvroSchemaResolutionSupport.scala

...c/test/scala/org/apache/hudi/common/table/read/TestSpark35RecordPositionMetadataColumn.scala

...di-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala

yihua

Overall looks good as a first cut. We should keep testing and improving the schema evolution logic in the new file group reader.

yihua · 2024-06-07T08:32:18Z

...c/test/scala/org/apache/hudi/common/table/read/TestSpark35RecordPositionMetadataColumn.scala

-import org.junit.jupiter.api.{BeforeEach, Test}
-import org.junit.jupiter.api.Assertions.{assertArrayEquals, assertEquals, assertFalse}
-
-class TestSpark35RecordPositionMetadataColumn extends SparkClientFunctionalTestHarness {


We'll add this back in #11413?

hudi-bot · 2024-06-07T09:19:00Z

CI report:

04f6991 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Jonathan Vexler added 7 commits April 2, 2024 11:57

add spark 3.3 reader

283d7c3

add spark3.4

ef65428

add spark 3.5

8168147

add spark 3.2

1a53f1e

add spark 3.1

97d9920

add spark 3.0

b9d7ce4

add spark 2.4

a20e9d4

github-actions bot added the size:XL PR with lines of changes > 1000 label Apr 2, 2024

jonvex changed the title ~~[HUDI-7567] Add schema evolution to fg reader~~ [HUDI-7567] Add schema evolution to the filegroup reader Apr 2, 2024

jonvex changed the title ~~[HUDI-7567] Add schema evolution to the filegroup reader~~ [DO NOT MERGE][HUDI-7567] Add schema evolution to the filegroup reader Apr 2, 2024

jonvex marked this pull request as ready for review April 2, 2024 21:05

jonvex requested a review from yihua April 2, 2024 21:05

apache deleted a comment from hudi-bot Apr 3, 2024

Jonathan Vexler added 17 commits April 3, 2024 17:24

spark 3.3 use properties class

abe7839

spark 3.2 add props class

865526e

spark 3.4 add properties

bab974a

add spark 3.5 properties

0eb2185

add properties spark 3.1

10a577f

add props spark 3.0

3c7ecf1

add properties spark 2.4

700013b

fix 3.0

37f52eb

refactor to get rid of properties, spark 3.1

b9c1592

remove props spark 3.0

e3957c5

use class model for spark 3.3

7345f6b

remove props spark 3.3

2942a6c

remove props spark 3.4

2012131

remove props spark 3.5

e40072e

remove props spark 2.4

5813cbf

remove change

0f00822

remove bad import

867593d

Jonathan Vexler added 3 commits June 4, 2024 14:41

add testing back/ add new testing

4045388

add spark test

575b206

fix build errors

36d0b15

jonvex requested a review from codope June 4, 2024 22:12

Merge branch 'master' into add_schema_evolution_to_fg_reader

11862a3

codope approved these changes Jun 5, 2024

View reviewed changes

make default value -1 for position column

e710020

yihua reviewed Jun 5, 2024

View reviewed changes

...di-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala Outdated Show resolved Hide resolved

...di-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala Outdated Show resolved Hide resolved

yihua reviewed Jun 6, 2024

View reviewed changes

fix name requested by reviewer

1e677e9

yihua reviewed Jun 6, 2024

View reviewed changes

...asource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala Outdated Show resolved Hide resolved

yihua reviewed Jun 6, 2024

View reviewed changes

...k-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestAvroSchemaResolutionSupport.scala Outdated Show resolved Hide resolved

...c/test/scala/org/apache/hudi/common/table/read/TestSpark35RecordPositionMetadataColumn.scala Show resolved Hide resolved

yihua reviewed Jun 6, 2024

View reviewed changes

...di-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala Outdated Show resolved Hide resolved

Jonathan Vexler added 4 commits June 6, 2024 12:06

fixing position logic, switching to 3.5 to test

76836bc

fix position merging and test

c4c5e10

address review comments

de51812

fix issue with schema handler

b0f17cf

jonvex requested a review from yihua June 6, 2024 23:11

Jonathan Vexler added 7 commits June 6, 2024 19:12

optimize imports

c3c2bd3

fix bug with test

b0e45dc

fix issue with filtering recordkey col

3e9489c

remove comment line

d14bbee

add new filter rules

166151a

fix failing test

1a1ca64

get rid of position stuff

04f6991

jonvex mentioned this pull request Jun 7, 2024

[HUDI-7840] Add position merging to the new file group reader #11413

Merged

4 tasks

apache deleted a comment from hudi-bot Jun 7, 2024

yihua approved these changes Jun 7, 2024

View reviewed changes

codope merged commit f4be74c into apache:master Jun 7, 2024
46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-7567] Add schema evolution to the filegroup reader #10957

[HUDI-7567] Add schema evolution to the filegroup reader #10957

jonvex commented Apr 2, 2024 •

edited

codope left a comment

yihua left a comment

yihua Jun 6, 2024

yihua left a comment

yihua left a comment

yihua Jun 7, 2024

hudi-bot commented Jun 7, 2024

[HUDI-7567] Add schema evolution to the filegroup reader #10957

[HUDI-7567] Add schema evolution to the filegroup reader #10957

Conversation

jonvex commented Apr 2, 2024 • edited

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

codope left a comment

Choose a reason for hiding this comment

yihua left a comment

Choose a reason for hiding this comment

yihua Jun 6, 2024

Choose a reason for hiding this comment

yihua left a comment

Choose a reason for hiding this comment

yihua left a comment

Choose a reason for hiding this comment

yihua Jun 7, 2024

Choose a reason for hiding this comment

hudi-bot commented Jun 7, 2024

CI report:

jonvex commented Apr 2, 2024 •

edited