[HUDI-7565] Create spark file readers to read a single file instead of an entire partition #10954

jonvex · 2024-04-02T17:14:01Z

Change Logs

Subtask of https://issues.apache.org/jira/browse/HUDI-7045
Extracts from #10278

Spark parquet readers are created per partition. We want to create a reader for each file. This pr ports over the spark readers for each version and removes the partition iterator.

To verify the ported code, I have listed the ported spark version in the javadoc for readParquetFile
You can use the following link and switch between tags to see the code for that spark version
https://github.com/apache/spark/blob/v2.4.8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

Impact

Subtask for schema evolution support in new fg reader

Risk level (write none, low medium or high below)

low

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24HoodieParquetReader.scala

...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark31HoodieParquetReader.scala

yihua · 2024-04-11T19:25:12Z

...main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkHoodieParquetReader.scala

+   * @param sharedConf      the hadoop conf
+   * @return iterator of rows read from the file output type says [[InternalRow]] but could be [[ColumnarBatch]]
+   */
+  def read(file: PartitionedFile,


I was thinking that SparkHoodieParquetReader.read can be unit-tested by passing in parameters and validating the output iterator of the InternalRows. For now, the functional test serves similar purpose.

...main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkHoodieParquetReader.scala

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala

...source/hudi-spark/src/test/java/org/apache/hudi/functional/TestSparkHoodieParquetReader.java

...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24HoodieParquetReader.scala

...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark30HoodieParquetReader.scala

...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark31HoodieParquetReader.scala

...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark32HoodieParquetReader.scala

...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark33HoodieParquetReader.scala

...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark34HoodieParquetReader.scala

...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark35HoodieParquetReader.scala

hudi-bot · 2024-04-12T15:49:30Z

CI report:

8f1ba6d Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

LGTM

Jonathan Vexler added 7 commits April 2, 2024 11:57

add spark 3.3 reader

283d7c3

add spark3.4

ef65428

add spark 3.5

8168147

add spark 3.2

1a53f1e

add spark 3.1

97d9920

add spark 3.0

b9d7ce4

add spark 2.4

a20e9d4

github-actions bot added the size:XL PR with lines of changes > 1000 label Apr 2, 2024

jonvex mentioned this pull request Apr 2, 2024

[HUDI-7566] Add schema evolution to spark file readers #10956

Merged

4 tasks

jonvex requested a review from yihua April 2, 2024 19:30

apache deleted a comment from hudi-bot Apr 2, 2024

yihua added reader-core release-1.0.0 labels Apr 2, 2024

yihua reviewed Apr 2, 2024

View reviewed changes

...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24HoodieParquetReader.scala Outdated Show resolved Hide resolved

yihua reviewed Apr 2, 2024

View reviewed changes

...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24HoodieParquetReader.scala Outdated Show resolved Hide resolved

...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24HoodieParquetReader.scala Outdated Show resolved Hide resolved

Jonathan Vexler added 15 commits April 3, 2024 17:24

spark 3.3 use properties class

abe7839

spark 3.2 add props class

865526e

spark 3.4 add properties

bab974a

add spark 3.5 properties

0eb2185

add properties spark 3.1

10a577f

add props spark 3.0

3c7ecf1

add properties spark 2.4

700013b

fix 3.0

37f52eb

refactor to get rid of properties, spark 3.1

b9c1592

remove props spark 3.0

e3957c5

use class model for spark 3.3

7345f6b

remove props spark 3.3

2942a6c

remove props spark 3.4

2012131

remove props spark 3.5

e40072e

remove props spark 2.4

5813cbf

Jonathan Vexler added 2 commits April 4, 2024 14:14

remove change

0f00822

remove bad import

867593d

jonvex requested a review from yihua April 4, 2024 20:12

apache deleted a comment from hudi-bot Apr 4, 2024

create a copy of the conf when reading

8ca12f2

yihua reviewed Apr 11, 2024

View reviewed changes

...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark31HoodieParquetReader.scala Outdated Show resolved Hide resolved

add test

8205971

jonvex requested a review from yihua April 11, 2024 01:34

apache deleted a comment from hudi-bot Apr 11, 2024

allow vectorized read and comment better

815b6fd

apache deleted a comment from hudi-bot Apr 11, 2024

yihua reviewed Apr 11, 2024

View reviewed changes

Jonathan Vexler added 8 commits April 11, 2024 17:43

address review comments 3.5

120226a

rename spark 3.4

dbdefad

rename for spark3.3

f950835

rename for spark 3.2

75da5dd

rename spark 3.1

e7e4b51

rename spark 30

81da1a7

rename for spark 2

1c68439

remove empty line

f6c5beb

jonvex requested a review from yihua April 11, 2024 22:24

address hidden review comments

8f1ba6d

apache deleted a comment from hudi-bot Apr 12, 2024

yihua approved these changes Apr 12, 2024

View reviewed changes

yihua merged commit f715e8a into apache:master Apr 12, 2024
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-7565] Create spark file readers to read a single file instead of an entire partition #10954

[HUDI-7565] Create spark file readers to read a single file instead of an entire partition #10954

jonvex commented Apr 2, 2024

yihua Apr 11, 2024

hudi-bot commented Apr 12, 2024

yihua left a comment

[HUDI-7565] Create spark file readers to read a single file instead of an entire partition #10954

[HUDI-7565] Create spark file readers to read a single file instead of an entire partition #10954

Conversation

jonvex commented Apr 2, 2024

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

yihua Apr 11, 2024

Choose a reason for hiding this comment

hudi-bot commented Apr 12, 2024

CI report:

yihua left a comment

Choose a reason for hiding this comment