Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18028][SQL] simplify TableFileCatalog #15568

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Oct 20, 2016

What changes were proposed in this pull request?

Simplify/cleanup TableFileCatalog:

  1. pass a CatalogTable instead of databaseName and tableName into TableFileCatalog, so that we don't need to fetch table metadata from metastore again
  2. In TableFileCatalog.filterPartitions0, DO NOT set PartitioningAwareFileCatalog.BASE_PATH_PARAM. According to the classdoc, the default value of basePath already satisfies our need. What's more, if we set this parameter, we may break the case 2 which is metioned in the classdoc.
  3. add equals and hashCode to TableFileCatalog
  4. add SessionCatalog.listPartitionsByFilter which handles case sensitivity.

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

cc @ericl @mallman @yhuai

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67265 has finished for PR 15568 at commit bd808af.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67266 has finished for PR 15568 at commit c98c4f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mallman
Copy link
Contributor

mallman commented Oct 20, 2016

@cloud-fan, I don't know what you mean in item 4 about SessionCatalog.listPartitionsByFilter handling case-sensitivity. What case-sensitivity issue are you referring to, and does this PR handle it differently?

*/
def listPartitionsByFilter(
tableName: TableIdentifier,
predicates: Seq[Expression]): Seq[CatalogTablePartition] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this, @cloud-fan ! I think I can use this in #15302 .

@@ -102,6 +95,13 @@ class TableFileCatalog(
}

override def inputFiles: Array[String] = allPartitions.inputFiles

override def equals(o: Any): Boolean = o match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this needed for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under hive context, we will cache the LogicalRelation for every data source table(including converted from hive), which means every table will always have a TableFileCatalog of same instance.

However, it's not true in sql core. We will re-construct the TableFileCatalog and LogicalRelation everytime we look up a table. Thus we may encounter cache miss even if the table is cached, because TableFileCatalog of difference instances never equal to each other.

Although it's not a real problem now, I think it's reasonable to follow ListFileCatalong and add the equals and hashCode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Could we add a unit test for this and a comment here?

@ericl
Copy link
Contributor

ericl commented Oct 20, 2016

LGTM, but had a question on why we need to override equals().

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Oct 21, 2016

@mallman , item 4 is a potential problem in the future. The current workflow is, we get the MetastoreRelation via HiveMetastoreCatalog.lookupRelation, which always lower case the database and table name. Then we construct TableFileCatalog and call ExternalCatalog.listPartitionsByFilter with the database and table name. So we won't have case senstivity problem here.

However, we may have a workflow which call listPartitionsByFilter directly with database and table name given by users. e.g. #15302. Then we need to care about case sensitivity.

BTW, we also have SessionCatalog.listParttions, I think it's reasonable to also put listPartitionsByFilter there

@SparkQA
Copy link

SparkQA commented Oct 24, 2016

Test build #67444 has finished for PR 15568 at commit c4a906f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ericl
Copy link
Contributor

ericl commented Oct 24, 2016

lgtm, thanks for adding the test!

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@asfgit asfgit closed this in 84a3399 Oct 25, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
## What changes were proposed in this pull request?

Simplify/cleanup TableFileCatalog:

1. pass a `CatalogTable` instead of `databaseName` and `tableName` into `TableFileCatalog`, so that we don't need to fetch table metadata from metastore again
2. In `TableFileCatalog.filterPartitions0`, DO NOT set `PartitioningAwareFileCatalog.BASE_PATH_PARAM`. According to the [classdoc](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L189-L209), the default value of `basePath` already satisfies our need. What's more, if we set this parameter, we may break the case 2 which is metioned in the classdoc.
3. add `equals` and `hashCode` to `TableFileCatalog`
4. add `SessionCatalog.listPartitionsByFilter` which handles case sensitivity.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#15568 from cloud-fan/table-file-catalog.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Simplify/cleanup TableFileCatalog:

1. pass a `CatalogTable` instead of `databaseName` and `tableName` into `TableFileCatalog`, so that we don't need to fetch table metadata from metastore again
2. In `TableFileCatalog.filterPartitions0`, DO NOT set `PartitioningAwareFileCatalog.BASE_PATH_PARAM`. According to the [classdoc](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L189-L209), the default value of `basePath` already satisfies our need. What's more, if we set this parameter, we may break the case 2 which is metioned in the classdoc.
3. add `equals` and `hashCode` to `TableFileCatalog`
4. add `SessionCatalog.listPartitionsByFilter` which handles case sensitivity.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#15568 from cloud-fan/table-file-catalog.
new PrunedTableFileCatalog(
sparkSession, new Path(baseLocation.get), fileStatusCache, partitionSpec)
} else {
new ListingFileCatalog(sparkSession, rootPaths, table.storage.properties, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cloud-fan, here I have a question why should we remove the fileStatusCache for ListingFileCatalog? Do we need to add it back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it was a mistake. Can you send a PR to add it? thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sure. Thanks for your answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants