-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7565] Create spark file readers to read a single file instead of an entire partition #10954
Conversation
...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24HoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24HoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24HoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark31HoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
* @param sharedConf the hadoop conf | ||
* @return iterator of rows read from the file output type says [[InternalRow]] but could be [[ColumnarBatch]] | ||
*/ | ||
def read(file: PartitionedFile, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that SparkHoodieParquetReader.read
can be unit-tested by passing in parameters and validating the output iterator of the InternalRow
s. For now, the functional test serves similar purpose.
...main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkHoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkHoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkHoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala
Outdated
Show resolved
Hide resolved
...source/hudi-spark/src/test/java/org/apache/hudi/functional/TestSparkHoodieParquetReader.java
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24HoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark30HoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark31HoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark31HoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark32HoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark33HoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark34HoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark34HoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
...in/scala/org/apache/spark/sql/execution/datasources/parquet/Spark35HoodieParquetReader.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Change Logs
Subtask of https://issues.apache.org/jira/browse/HUDI-7045
Extracts from #10278
Spark parquet readers are created per partition. We want to create a reader for each file. This pr ports over the spark readers for each version and removes the partition iterator.
To verify the ported code, I have listed the ported spark version in the javadoc for readParquetFile
You can use the following link and switch between tags to see the code for that spark version
https://github.com/apache/spark/blob/v2.4.8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
Impact
Subtask for schema evolution support in new fg reader
Risk level (write none, low medium or high below)
low
Documentation Update
N/A
Contributor's checklist