-
Notifications
You must be signed in to change notification settings - Fork 28k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-47157][SQL] Refactor file listing with ScanFileListing interface
### What changes were proposed in this pull request? In this pull request, we've introduce the `ScanFileListing` trait and its implementation, the `GenericScanFileListing` class, to encapsulate and streamline the handling of file listing results. This new abstraction enhances modularity and facilitates more flexible management of file listings within the system. ### Why are the changes needed? The introduction of these constructs is crucial for defining a standardized API for file listing operations, regardless of the underlying representation that's used to represent files and partitions. By improving the modularity of the code we enable future improvements that can prove to be beneficial both for runtime and memory improvements. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is just a refactoring, not a new behavior, so existing tests would suffice. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45224 from costas-db/refactorFileListing. Lead-authored-by: Costas Zarifis <costas.zarifis@databricks.com> Co-authored-by: Shoumik Palkar <shoumik.palkar@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
- Loading branch information
1 parent
18b8606
commit e0facc3
Showing
5 changed files
with
223 additions
and
61 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
86 changes: 86 additions & 0 deletions
86
sql/core/src/main/scala/org/apache/spark/sql/execution/ScanFileListing.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package org.apache.spark.sql.execution | ||
|
||
import org.apache.spark.sql.catalyst.InternalRow | ||
import org.apache.spark.sql.catalyst.expressions.{BasePredicate, Expression} | ||
import org.apache.spark.sql.execution.datasources.{FileStatusWithMetadata, PartitionedFile} | ||
|
||
case class ListingPartition( | ||
values: InternalRow, | ||
numFiles: Long, | ||
files: Iterator[FileStatusWithMetadata]) | ||
|
||
/** | ||
* Trait used to represent the selected partitions and dynamically selected partitions | ||
* during file listing. | ||
* | ||
* The `ScanFileListing` trait defines the core API for interacting with selected partitions, | ||
* establishing a contract for subclasses. It is situated at the root of this package and it is | ||
* designed to provide a widely accessible definition, that is accessible to other packages and | ||
* classes that need a way to represent the selected partitions and dynamically selected partitions. | ||
*/ | ||
trait ScanFileListing { | ||
|
||
/** | ||
* Returns the number of partitions for the current partition representation. | ||
*/ | ||
def partitionCount: Int | ||
|
||
/** | ||
* Calculates the total size in bytes of all files across the current file listing representation. | ||
*/ | ||
def totalFileSize: Long | ||
|
||
/** | ||
* Returns the total number of files across the current file listing representation. | ||
*/ | ||
def totalNumberOfFiles: Long | ||
|
||
/** | ||
* Filters and prunes files from the current scan file listing representation based on the given | ||
* predicate and dynamic file filters. Initially, it filters partitions based on a static | ||
* predicate. For partitions that pass this filter, it further prunes files using dynamic file | ||
* filters, if any are provided. This method assumes that dynamic file filters are applicable | ||
* only to files within partitions that have already passed the static predicate filter. | ||
*/ | ||
def filterAndPruneFiles( | ||
boundPredicate: BasePredicate, dynamicFileFilters: Seq[Expression]): ScanFileListing | ||
|
||
/** | ||
* Returns an [[Array[PartitionedFile]] from the current ScanFileListing representation. | ||
*/ | ||
def toPartitionArray: Array[PartitionedFile] | ||
|
||
/** | ||
* Returns the total partition size in bytes for the current ScanFileListing representation. | ||
*/ | ||
def calculateTotalPartitionBytes : Long | ||
|
||
/** | ||
* Returns an iterator of over the partitions and their files for the file listing representation. | ||
* This allows us to iterate over the partitions without the additional overhead of materializing | ||
* the whole collection. | ||
*/ | ||
def filePartitionIterator: Iterator[ListingPartition] | ||
|
||
/** | ||
* Determines if each bucket in the current file listing representation contains at most one file. | ||
* This function returns true if it does, or false otherwise. | ||
*/ | ||
def bucketsContainSingleFile: Boolean | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.