[SPARK-18028][SQL] simplify TableFileCatalog #15568

cloud-fan · 2016-10-20T13:16:05Z

What changes were proposed in this pull request?

Simplify/cleanup TableFileCatalog:

pass a CatalogTable instead of databaseName and tableName into TableFileCatalog, so that we don't need to fetch table metadata from metastore again
In TableFileCatalog.filterPartitions0, DO NOT set PartitioningAwareFileCatalog.BASE_PATH_PARAM. According to the classdoc, the default value of basePath already satisfies our need. What's more, if we set this parameter, we may break the case 2 which is metioned in the classdoc.
add equals and hashCode to TableFileCatalog
add SessionCatalog.listPartitionsByFilter which handles case sensitivity.

How was this patch tested?

existing tests.

cloud-fan · 2016-10-20T13:18:15Z

cc @ericl @mallman @yhuai

SparkQA · 2016-10-20T13:22:18Z

Test build #67265 has finished for PR 15568 at commit bd808af.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-20T15:49:02Z

Test build #67266 has finished for PR 15568 at commit c98c4f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2016-10-20T18:19:56Z

@cloud-fan, I don't know what you mean in item 4 about SessionCatalog.listPartitionsByFilter handling case-sensitivity. What case-sensitivity issue are you referring to, and does this PR handle it differently?

dongjoon-hyun · 2016-10-20T19:32:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+   */
+  def listPartitionsByFilter(
+      tableName: TableIdentifier,
+      predicates: Seq[Expression]): Seq[CatalogTablePartition] = {


Thank you for adding this, @cloud-fan ! I think I can use this in #15302 .

ericl · 2016-10-20T21:54:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/TableFileCatalog.scala

@@ -102,6 +95,13 @@ class TableFileCatalog(
  }

  override def inputFiles: Array[String] = allPartitions.inputFiles
+
+  override def equals(o: Any): Boolean = o match {


What is this needed for?

Under hive context, we will cache the LogicalRelation for every data source table(including converted from hive), which means every table will always have a TableFileCatalog of same instance.

However, it's not true in sql core. We will re-construct the TableFileCatalog and LogicalRelation everytime we look up a table. Thus we may encounter cache miss even if the table is cached, because TableFileCatalog of difference instances never equal to each other.

Although it's not a real problem now, I think it's reasonable to follow ListFileCatalong and add the equals and hashCode

I see. Could we add a unit test for this and a comment here?

ericl · 2016-10-20T21:54:45Z

LGTM, but had a question on why we need to override equals().

cloud-fan · 2016-10-21T05:54:00Z

@mallman , item 4 is a potential problem in the future. The current workflow is, we get the MetastoreRelation via HiveMetastoreCatalog.lookupRelation, which always lower case the database and table name. Then we construct TableFileCatalog and call ExternalCatalog.listPartitionsByFilter with the database and table name. So we won't have case senstivity problem here.

However, we may have a workflow which call listPartitionsByFilter directly with database and table name given by users. e.g. #15302. Then we need to care about case sensitivity.

BTW, we also have SessionCatalog.listParttions, I think it's reasonable to also put listPartitionsByFilter there

SparkQA · 2016-10-24T10:30:32Z

Test build #67444 has finished for PR 15568 at commit c4a906f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-10-24T22:58:16Z

lgtm, thanks for adding the test!

cloud-fan · 2016-10-25T00:42:50Z

thanks for the review, merging to master!

## What changes were proposed in this pull request? Simplify/cleanup TableFileCatalog: 1. pass a `CatalogTable` instead of `databaseName` and `tableName` into `TableFileCatalog`, so that we don't need to fetch table metadata from metastore again 2. In `TableFileCatalog.filterPartitions0`, DO NOT set `PartitioningAwareFileCatalog.BASE_PATH_PARAM`. According to the [classdoc](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L189-L209), the default value of `basePath` already satisfies our need. What's more, if we set this parameter, we may break the case 2 which is metioned in the classdoc. 3. add `equals` and `hashCode` to `TableFileCatalog` 4. add `SessionCatalog.listPartitionsByFilter` which handles case sensitivity. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#15568 from cloud-fan/table-file-catalog.

ConeyLiu · 2019-01-29T06:08:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/TableFileCatalog.scala

+      new PrunedTableFileCatalog(
+        sparkSession, new Path(baseLocation.get), fileStatusCache, partitionSpec)
+    } else {
+      new ListingFileCatalog(sparkSession, rootPaths, table.storage.properties, None)


Hi @cloud-fan, here I have a question why should we remove the fileStatusCache for ListingFileCatalog? Do we need to add it back?

seems it was a mistake. Can you send a PR to add it? thanks!

OK, sure. Thanks for your answer.

cloud-fan force-pushed the table-file-catalog branch from bd808af to c98c4f6 Compare October 20, 2016 13:36

dongjoon-hyun reviewed Oct 20, 2016

View reviewed changes

ericl reviewed Oct 20, 2016

View reviewed changes

cloud-fan added 2 commits October 24, 2016 14:47

simplify TableFileCatalog

0bd59fb

address comments

c4a906f

cloud-fan force-pushed the table-file-catalog branch from c98c4f6 to c4a906f Compare October 24, 2016 08:13

asfgit closed this in 84a3399 Oct 25, 2016

ConeyLiu reviewed Jan 29, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18028][SQL] simplify TableFileCatalog #15568

[SPARK-18028][SQL] simplify TableFileCatalog #15568

cloud-fan commented Oct 20, 2016 •

edited

cloud-fan commented Oct 20, 2016

SparkQA commented Oct 20, 2016

SparkQA commented Oct 20, 2016

mallman commented Oct 20, 2016

dongjoon-hyun Oct 20, 2016 •

edited

ericl Oct 20, 2016

cloud-fan Oct 21, 2016

ericl Oct 21, 2016

ericl commented Oct 20, 2016

cloud-fan commented Oct 21, 2016 •

edited

SparkQA commented Oct 24, 2016

ericl commented Oct 24, 2016

cloud-fan commented Oct 25, 2016

ConeyLiu Jan 29, 2019

cloud-fan Jan 29, 2019

ConeyLiu Jan 29, 2019

[SPARK-18028][SQL] simplify TableFileCatalog #15568

[SPARK-18028][SQL] simplify TableFileCatalog #15568

Conversation

cloud-fan commented Oct 20, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Oct 20, 2016

SparkQA commented Oct 20, 2016

SparkQA commented Oct 20, 2016

mallman commented Oct 20, 2016

dongjoon-hyun Oct 20, 2016 • edited

Choose a reason for hiding this comment

ericl Oct 20, 2016

Choose a reason for hiding this comment

cloud-fan Oct 21, 2016

Choose a reason for hiding this comment

ericl Oct 21, 2016

Choose a reason for hiding this comment

ericl commented Oct 20, 2016

cloud-fan commented Oct 21, 2016 • edited

SparkQA commented Oct 24, 2016

ericl commented Oct 24, 2016

cloud-fan commented Oct 25, 2016

ConeyLiu Jan 29, 2019

Choose a reason for hiding this comment

cloud-fan Jan 29, 2019

Choose a reason for hiding this comment

ConeyLiu Jan 29, 2019

Choose a reason for hiding this comment

cloud-fan commented Oct 20, 2016 •

edited

dongjoon-hyun Oct 20, 2016 •

edited

cloud-fan commented Oct 21, 2016 •

edited