Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17974] Refactor FileCatalog classes to simplify the inheritance tree #15518

Closed
wants to merge 6 commits into from

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Oct 17, 2016

What changes were proposed in this pull request?

This renames BasicFileCatalog => FileCatalog, combines SessionFileCatalog with PartitioningAwareFileCatalog, and removes the old FileCatalog trait.

In summary,

MetadataLogFileCatalog extends PartitioningAwareFileCatalog
ListingFileCatalog extends PartitioningAwareFileCatalog
PartitioningAwareFileCatalog extends FileCatalog
TableFileCatalog extends FileCatalog

cc @cloud-fan @mallman

How was this patch tested?

Existing tests

* single root path from which partitions are discovered, or individual partitions may be
* specified by each path.
*/
def rootPaths: Seq[Path]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's "root" about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "pretty much nothing" anymore.

In an earlier version, it was the "root" path of the table, excluding any partition dirs. The PR drifted away from that definition.

Now I'd say it could be reverted to paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are the roots of the recursive traversal to discover leaf files. I am inclined to keep rootPaths since it's easy to grep for, vs paths which is too common.

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67084 has finished for PR 15518 at commit 5d390c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* A collection of data files from a partitioned relation, along with the partition values in the
* form of an [[InternalRow]].
*/
case class Partition(values: InternalRow, files: Seq[FileStatus])
Copy link
Contributor

@rxin rxin Oct 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while you are doing it, perhaps we can rename this TaskPartition?

I always find "Partition" very confusing because it can mean 3 different things:

  1. A block in HDFS.
  2. A Hive partition.
  3. A Spark task partition (or split).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these are the partitions tasks see. Confusingly, those seem to be named FilePartition. How about PartitionFiles?

assert(location.partitionSpec === PartitionSpec.emptySpec)
case LogicalRelation(
HadoopFsRelation(location: PartitioningAwareFileCatalog, _, _, _, _, _), _, _) =>
assert(location.partitionSpec() === PartitionSpec.emptySpec)
}.getOrElse {
fail(s"Expecting a ParquetRelation2, but got:\n$queryExecution")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not expecting a ParquetRelation2 anymore—more like a HadoopFsRelation with a PartitioningAwareFileCatalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@ericl
Copy link
Contributor Author

ericl commented Oct 17, 2016

Alright renamed PartitionFiles => PartitionDirectory, and PartitionDirectory => PartitionPath.

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67095 has finished for PR 15518 at commit 0d0954d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class PartitionDirectory(values: InternalRow, files: Seq[FileStatus])
    • case class PartitionPath(values: InternalRow, path: Path)

@SparkQA
Copy link

SparkQA commented Oct 18, 2016

Test build #67092 has finished for PR 15518 at commit 43c4374.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class PartitionFiles(values: InternalRow, files: Seq[FileStatus])

@rxin
Copy link
Contributor

rxin commented Oct 18, 2016

LGTM - merging in master.

@asfgit asfgit closed this in 8daa1a2 Oct 18, 2016
@rxin
Copy link
Contributor

rxin commented Oct 18, 2016

Oops - just realized the tests for the latest commit failed. I will revert the patch.

@yhuai
Copy link
Contributor

yhuai commented Oct 18, 2016

Seems this fails the scala style check.

ghost pushed a commit to dbtsai/spark that referenced this pull request Oct 18, 2016
…eritance tree

## What changes were proposed in this pull request?

This renames `BasicFileCatalog => FileCatalog`, combines  `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait.

In summary,
```
MetadataLogFileCatalog extends PartitioningAwareFileCatalog
ListingFileCatalog extends PartitioningAwareFileCatalog
PartitioningAwareFileCatalog extends FileCatalog
TableFileCatalog extends FileCatalog
```

(note that this is a re-submission of apache#15518 which got reverted)

## How was this patch tested?

Existing tests

Author: Eric Liang <ekl@databricks.com>

Closes apache#15533 from ericl/fix-scalastyle-revert.
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
…e tree

## What changes were proposed in this pull request?

This renames `BasicFileCatalog => FileCatalog`, combines  `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait.

In summary,
```
MetadataLogFileCatalog extends PartitioningAwareFileCatalog
ListingFileCatalog extends PartitioningAwareFileCatalog
PartitioningAwareFileCatalog extends FileCatalog
TableFileCatalog extends FileCatalog
```

cc cloud-fan mallman

## How was this patch tested?

Existing tests

Author: Eric Liang <ekl@databricks.com>

Closes apache#15518 from ericl/refactor-session-file-catalog.
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
…eritance tree

## What changes were proposed in this pull request?

This renames `BasicFileCatalog => FileCatalog`, combines  `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait.

In summary,
```
MetadataLogFileCatalog extends PartitioningAwareFileCatalog
ListingFileCatalog extends PartitioningAwareFileCatalog
PartitioningAwareFileCatalog extends FileCatalog
TableFileCatalog extends FileCatalog
```

(note that this is a re-submission of apache#15518 which got reverted)

## How was this patch tested?

Existing tests

Author: Eric Liang <ekl@databricks.com>

Closes apache#15533 from ericl/fix-scalastyle-revert.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…e tree

## What changes were proposed in this pull request?

This renames `BasicFileCatalog => FileCatalog`, combines  `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait.

In summary,
```
MetadataLogFileCatalog extends PartitioningAwareFileCatalog
ListingFileCatalog extends PartitioningAwareFileCatalog
PartitioningAwareFileCatalog extends FileCatalog
TableFileCatalog extends FileCatalog
```

cc cloud-fan mallman

## How was this patch tested?

Existing tests

Author: Eric Liang <ekl@databricks.com>

Closes apache#15518 from ericl/refactor-session-file-catalog.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…eritance tree

## What changes were proposed in this pull request?

This renames `BasicFileCatalog => FileCatalog`, combines  `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait.

In summary,
```
MetadataLogFileCatalog extends PartitioningAwareFileCatalog
ListingFileCatalog extends PartitioningAwareFileCatalog
PartitioningAwareFileCatalog extends FileCatalog
TableFileCatalog extends FileCatalog
```

(note that this is a re-submission of apache#15518 which got reverted)

## How was this patch tested?

Existing tests

Author: Eric Liang <ekl@databricks.com>

Closes apache#15533 from ericl/fix-scalastyle-revert.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants