New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path #23383
Conversation
There was a PR for this: #20933. This one is to migrate ORC with latest data source V2 API. |
...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java
Outdated
Show resolved
Hide resolved
Test build #100446 has finished for PR 23383 at commit
|
retest this please |
Test build #100454 has finished for PR 23383 at commit
|
Test build #100639 has finished for PR 23383 at commit
|
Test build #100686 has finished for PR 23383 at commit
|
retest this please. |
pending on #23387 to be merged. |
f925e8f
to
e1242a4
Compare
Test build #100693 has finished for PR 23383 at commit
|
retest this please. |
Test build #100697 has finished for PR 23383 at commit
|
Test build #100701 has finished for PR 23383 at commit
|
retest this please. |
Test build #100718 has finished for PR 23383 at commit
|
retest this please |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala
Outdated
Show resolved
Hide resolved
Test build #101325 has finished for PR 23383 at commit
|
30e0481
to
6e87532
Compare
Test build #101347 has finished for PR 23383 at commit
|
retest this please. |
Test build #101359 has finished for PR 23383 at commit
|
val hadoopConf = | ||
sparkSession.sessionState.newHadoopConfWithOptions(options.asMap().asScala.toMap) | ||
val rootPathsSpecified = DataSource.checkAndGlobPathIfNecessary(filePaths, hadoopConf, | ||
checkEmptyGlobPath = true, checkFilesExist = options.checkFilesExist()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should revisit it later. It doesn't make sense to have different behaviors of file listing between read and write.
I'm fine with it as a workaround for now. In the followup we can remove it, fix tests and accepts the behavior changes for ds v2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can always set checkFilesExist
as false
here.
LGTM, merging to master! |
@cloud-fan @dongjoon-hyun @gatorsmile Thanks for the review. I will come up with the file write path migration very soon. |
…ion columns ## What changes were proposed in this pull request? Currently OrcColumnarBatchReader returns all the partition column values in the batch read. In data source V2, we can improve it by returning the required partition column values only. This PR is part of apache#23383 . As cloud-fan suggested, create a new PR to make review easier. Also, this PR doesn't improve `OrcFileFormat`, since in the method `buildReaderWithPartitionValues`, the `requiredSchema` filter out all the partition columns, so we can't know which partition column is required. ## How was this patch tested? Unit test Closes apache#23387 from gengliangwang/refactorOrcColumnarBatch. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Gengliang Wang <ltnwgl@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ad path ## What changes were proposed in this pull request? Create a framework for file source V2 based on data source V2 API. As a good example for demonstrating the framework, this PR also migrate ORC source. This is because ORC file source supports both row scan and columnar scan, and the implementation is simpler comparing with Parquet. Note: Currently only read path of V2 API is done, this framework and migration are only for the read path. Supports the following scan: - Scan ColumnarBatch - Scan UnsafeRow - Push down filters - Push down required columns Not supported( due to the limitation of data source V2 API): - Stats metrics - Catalog table - Writes ## How was this patch tested? Unit test Closes apache#23383 from gengliangwang/latest_orcV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ex in the write path ## What changes were proposed in this pull request? In apache#23383, the file source V2 framework is implemented. In the PR, `FileIndex` is created as a member of `FileTable`, so that we can implement partition pruning like apache@0f9fcab in the future(As data source V2 catalog is under development, partition pruning is removed from the PR) However, after write path of file source V2 is implemented, I find that a simple write will create an unnecessary `FileIndex`, which is required by `FileTable`. This is a sort of regression. And we can see there is a warning message when writing to ORC files ``` WARN InMemoryFileIndex: The directory file:/tmp/foo was not found. Was it deleted very recently? ``` This PR is to make `FileIndex` as a lazy value in `FileTable`, so that we can avoid creating unnecessary `FileIndex` in the write path. ## How was this patch tested? Existing unit test Closes apache#23774 from gengliangwang/moveFileIndexInV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ex in the write path ## What changes were proposed in this pull request? In apache#23383, the file source V2 framework is implemented. In the PR, `FileIndex` is created as a member of `FileTable`, so that we can implement partition pruning like apache@0f9fcab in the future(As data source V2 catalog is under development, partition pruning is removed from the PR) However, after write path of file source V2 is implemented, I find that a simple write will create an unnecessary `FileIndex`, which is required by `FileTable`. This is a sort of regression. And we can see there is a warning message when writing to ORC files ``` WARN InMemoryFileIndex: The directory file:/tmp/foo was not found. Was it deleted very recently? ``` This PR is to make `FileIndex` as a lazy value in `FileTable`, so that we can avoid creating unnecessary `FileIndex` in the write path. ## How was this patch tested? Existing unit test Closes apache#23774 from gengliangwang/moveFileIndexInV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Create a framework for file source V2 based on data source V2 API.
As a good example for demonstrating the framework, this PR also migrate ORC source. This is because ORC file source supports both row scan and columnar scan, and the implementation is simpler comparing with Parquet.
Note: Currently only read path of V2 API is done, this framework and migration are only for the read path.
Supports the following scan:
Not supported( due to the limitation of data source V2 API):
How was this patch tested?
Unit test