-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17974] Refactor FileCatalog classes to simplify the inheritance tree #15518
Conversation
* single root path from which partitions are discovered, or individual partitions may be | ||
* specified by each path. | ||
*/ | ||
def rootPaths: Seq[Path] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's "root" about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say "pretty much nothing" anymore.
In an earlier version, it was the "root" path of the table, excluding any partition dirs. The PR drifted away from that definition.
Now I'd say it could be reverted to paths
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are the roots of the recursive traversal to discover leaf files. I am inclined to keep rootPaths
since it's easy to grep for, vs paths
which is too common.
Test build #67084 has finished for PR 15518 at commit
|
* A collection of data files from a partitioned relation, along with the partition values in the | ||
* form of an [[InternalRow]]. | ||
*/ | ||
case class Partition(values: InternalRow, files: Seq[FileStatus]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while you are doing it, perhaps we can rename this TaskPartition?
I always find "Partition" very confusing because it can mean 3 different things:
- A block in HDFS.
- A Hive partition.
- A Spark task partition (or split).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these are the partitions tasks see. Confusingly, those seem to be named FilePartition
. How about PartitionFiles
?
assert(location.partitionSpec === PartitionSpec.emptySpec) | ||
case LogicalRelation( | ||
HadoopFsRelation(location: PartitioningAwareFileCatalog, _, _, _, _, _), _, _) => | ||
assert(location.partitionSpec() === PartitionSpec.emptySpec) | ||
}.getOrElse { | ||
fail(s"Expecting a ParquetRelation2, but got:\n$queryExecution") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not expecting a ParquetRelation2
anymore—more like a HadoopFsRelation
with a PartitioningAwareFileCatalog
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
Alright renamed PartitionFiles => PartitionDirectory, and PartitionDirectory => PartitionPath. |
Test build #67095 has finished for PR 15518 at commit
|
Test build #67092 has finished for PR 15518 at commit
|
LGTM - merging in master. |
Oops - just realized the tests for the latest commit failed. I will revert the patch. |
Seems this fails the scala style check. |
…eritance tree ## What changes were proposed in this pull request? This renames `BasicFileCatalog => FileCatalog`, combines `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait. In summary, ``` MetadataLogFileCatalog extends PartitioningAwareFileCatalog ListingFileCatalog extends PartitioningAwareFileCatalog PartitioningAwareFileCatalog extends FileCatalog TableFileCatalog extends FileCatalog ``` (note that this is a re-submission of apache#15518 which got reverted) ## How was this patch tested? Existing tests Author: Eric Liang <ekl@databricks.com> Closes apache#15533 from ericl/fix-scalastyle-revert.
…e tree ## What changes were proposed in this pull request? This renames `BasicFileCatalog => FileCatalog`, combines `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait. In summary, ``` MetadataLogFileCatalog extends PartitioningAwareFileCatalog ListingFileCatalog extends PartitioningAwareFileCatalog PartitioningAwareFileCatalog extends FileCatalog TableFileCatalog extends FileCatalog ``` cc cloud-fan mallman ## How was this patch tested? Existing tests Author: Eric Liang <ekl@databricks.com> Closes apache#15518 from ericl/refactor-session-file-catalog.
…eritance tree ## What changes were proposed in this pull request? This renames `BasicFileCatalog => FileCatalog`, combines `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait. In summary, ``` MetadataLogFileCatalog extends PartitioningAwareFileCatalog ListingFileCatalog extends PartitioningAwareFileCatalog PartitioningAwareFileCatalog extends FileCatalog TableFileCatalog extends FileCatalog ``` (note that this is a re-submission of apache#15518 which got reverted) ## How was this patch tested? Existing tests Author: Eric Liang <ekl@databricks.com> Closes apache#15533 from ericl/fix-scalastyle-revert.
…e tree ## What changes were proposed in this pull request? This renames `BasicFileCatalog => FileCatalog`, combines `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait. In summary, ``` MetadataLogFileCatalog extends PartitioningAwareFileCatalog ListingFileCatalog extends PartitioningAwareFileCatalog PartitioningAwareFileCatalog extends FileCatalog TableFileCatalog extends FileCatalog ``` cc cloud-fan mallman ## How was this patch tested? Existing tests Author: Eric Liang <ekl@databricks.com> Closes apache#15518 from ericl/refactor-session-file-catalog.
…eritance tree ## What changes were proposed in this pull request? This renames `BasicFileCatalog => FileCatalog`, combines `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait. In summary, ``` MetadataLogFileCatalog extends PartitioningAwareFileCatalog ListingFileCatalog extends PartitioningAwareFileCatalog PartitioningAwareFileCatalog extends FileCatalog TableFileCatalog extends FileCatalog ``` (note that this is a re-submission of apache#15518 which got reverted) ## How was this patch tested? Existing tests Author: Eric Liang <ekl@databricks.com> Closes apache#15533 from ericl/fix-scalastyle-revert.
What changes were proposed in this pull request?
This renames
BasicFileCatalog => FileCatalog
, combinesSessionFileCatalog
withPartitioningAwareFileCatalog
, and removes the oldFileCatalog
trait.In summary,
cc @cloud-fan @mallman
How was this patch tested?
Existing tests