Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7565] Create spark file readers to read a single file instead of an entire partition #10954

Merged
merged 36 commits into from
Apr 12, 2024

Conversation

jonvex
Copy link
Contributor

@jonvex jonvex commented Apr 2, 2024

Change Logs

Subtask of https://issues.apache.org/jira/browse/HUDI-7045
Extracts from #10278

Spark parquet readers are created per partition. We want to create a reader for each file. This pr ports over the spark readers for each version and removes the partition iterator.

To verify the ported code, I have listed the ported spark version in the javadoc for readParquetFile
You can use the following link and switch between tags to see the code for that spark version
https://github.com/apache/spark/blob/v2.4.8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

Impact

Subtask for schema evolution support in new fg reader

Risk level (write none, low medium or high below)

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Apr 2, 2024
@jonvex jonvex requested a review from yihua April 2, 2024 19:30
@apache apache deleted a comment from hudi-bot Apr 2, 2024
@jonvex jonvex requested a review from yihua April 4, 2024 20:12
@apache apache deleted a comment from hudi-bot Apr 4, 2024
@jonvex jonvex requested a review from yihua April 11, 2024 01:34
@apache apache deleted a comment from hudi-bot Apr 11, 2024
@apache apache deleted a comment from hudi-bot Apr 11, 2024
* @param sharedConf the hadoop conf
* @return iterator of rows read from the file output type says [[InternalRow]] but could be [[ColumnarBatch]]
*/
def read(file: PartitionedFile,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that SparkHoodieParquetReader.read can be unit-tested by passing in parameters and validating the output iterator of the InternalRows. For now, the functional test serves similar purpose.

@jonvex jonvex requested a review from yihua April 11, 2024 22:24
@apache apache deleted a comment from hudi-bot Apr 12, 2024
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yihua yihua merged commit f715e8a into apache:master Apr 12, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reader-core release-1.0.0 size:XL PR with lines of changes > 1000
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants