[SPARK-20194] Add support for partition pruning to in-memory catalog #17510

adrian-ionescu · 2017-04-02T14:43:15Z

What changes were proposed in this pull request?

This patch implements listPartitionsByFilter() for InMemoryCatalog and thus resolves an outstanding TODO causing the PruneFileSourcePartitions optimizer rule not to apply when "spark.sql.catalogImplementation" is set to "in-memory" (which is the default).

The change is straightforward: it extracts the code for further filtering of the list of partitions returned by the metastore's getPartitionsByFilter() out from HiveExternalCatalog into ExternalCatalogUtils and calls this new function from InMemoryCatalog on the whole list of partitions.

Now that this method is implemented we can always pass the CatalogTable to the DataSource in FindDataSourceTable, so that the latter is resolved to a relation with a CatalogFileIndex, which is what the PruneFileSourcePartitions rule matches for.

How was this patch tested?

Ran existing tests and added new test for listPartitionsByFilter in ExternalCatalogSuite, which is subclassed by both InMemoryCatalogSuite and HiveExternalCatalogSuite.

gatorsmile · 2017-04-02T15:23:06Z

ok to test

gatorsmile · 2017-04-02T17:00:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala

+        _.references.map(_.name).toSet.subsetOf(partitionColumnNames)
+      }
+      if (nonPartitionPruningPredicates.nonEmpty) {
+        sys.error("Expected only partition pruning predicates: " + nonPartitionPruningPredicates)


Nit: Throwing an AnalysisException is preferred.

Could you add the negative test cases in your newly added test cases?

gatorsmile · 2017-04-02T17:01:51Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala

@@ -436,6 +438,37 @@ abstract class ExternalCatalogSuite extends SparkFunSuite with BeforeAndAfterEac
    assert(catalog.listPartitions("db2", "tbl2", Some(Map("a" -> "unknown"))).isEmpty)
  }

+  test("list partitions by filter") {
+    val tz = TimeZone.getDefault().getID()


Nit: val tz = TimeZone.getDefault.getID

gatorsmile · 2017-04-02T17:17:16Z

Not related to this PR. It sounds like we have a bug in HiveTableScans. The predicate orders matter. We should not prune the partitions if there exists non-deterministic predicates that place before the partition-pruning predicates. cc @cloud-fan @hvanhovell

SparkQA · 2017-04-02T17:52:58Z

Test build #75461 has finished for PR 17510 at commit 3b031c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-02T21:09:50Z

Test build #75464 has finished for PR 17510 at commit 3ad0327.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-04-03T00:44:07Z

LGTM cc @cloud-fan @hvanhovell @ericl

cloud-fan · 2017-04-03T09:37:45Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

-    if (predicates.nonEmpty) {
-      val clientPrunedPartitions = client.getPartitionsByFilter(rawTable, predicates).map { part =>
+    val clientPrunedPartitions =
+      client.getPartitionsByFilter(rawTable, predicates).map { part =>


if predicates.isEmpty, the previous code will run client.getPartitions. Can you double check there is no performance regression?

A similar optimization is done in the function itself: client.getPartitionsByFilter(), while Hive Shim_v0_12 delegates to getAllPartitions anyway.

I've now made the only piece of non-trivial code along that path lazy, so I think we're good.

cloud-fan · 2017-04-03T09:38:35Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala

+    val catalog = newBasicCatalog()
+
+    def checkAnswer(table: CatalogTable, filters: Seq[Expression],
+        expected: Set[CatalogTablePartition]): Unit = {


nit: code style

def checkAnswer( param1: XX, param2: XX, ...

cloud-fan · 2017-04-03T09:38:59Z

LGTM except some minor comments

SparkQA · 2017-04-03T14:30:03Z

Test build #75479 has finished for PR 17510 at commit d6bdcf4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-04-03T15:49:19Z

Thanks! Merging to master.

adrian-ionescu · 2017-04-03T16:00:13Z

Thanks for moving so fast!

Implement listPartitionsByFilter() for InMemoryCatalog

3b031c7

gatorsmile reviewed Apr 2, 2017

View reviewed changes

Address review remarks: AnalysisException & negative test

3ad0327

cloud-fan reviewed Apr 3, 2017

View reviewed changes

Address more small review remarks.

d6bdcf4

asfgit closed this in 703c42c Apr 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20194] Add support for partition pruning to in-memory catalog #17510

[SPARK-20194] Add support for partition pruning to in-memory catalog #17510

adrian-ionescu commented Apr 2, 2017

gatorsmile commented Apr 2, 2017

gatorsmile Apr 2, 2017

gatorsmile Apr 2, 2017

gatorsmile Apr 2, 2017

gatorsmile commented Apr 2, 2017

SparkQA commented Apr 2, 2017

SparkQA commented Apr 2, 2017

gatorsmile commented Apr 3, 2017 •

edited

Loading

cloud-fan Apr 3, 2017

adrian-ionescu Apr 3, 2017

cloud-fan Apr 3, 2017

cloud-fan commented Apr 3, 2017

SparkQA commented Apr 3, 2017

gatorsmile commented Apr 3, 2017

adrian-ionescu commented Apr 3, 2017

[SPARK-20194] Add support for partition pruning to in-memory catalog #17510

[SPARK-20194] Add support for partition pruning to in-memory catalog #17510

Conversation

adrian-ionescu commented Apr 2, 2017

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile commented Apr 2, 2017

gatorsmile Apr 2, 2017

Choose a reason for hiding this comment

gatorsmile Apr 2, 2017

Choose a reason for hiding this comment

gatorsmile Apr 2, 2017

Choose a reason for hiding this comment

gatorsmile commented Apr 2, 2017

SparkQA commented Apr 2, 2017

SparkQA commented Apr 2, 2017

gatorsmile commented Apr 3, 2017 • edited Loading

cloud-fan Apr 3, 2017

Choose a reason for hiding this comment

adrian-ionescu Apr 3, 2017

Choose a reason for hiding this comment

cloud-fan Apr 3, 2017

Choose a reason for hiding this comment

cloud-fan commented Apr 3, 2017

SparkQA commented Apr 3, 2017

gatorsmile commented Apr 3, 2017

adrian-ionescu commented Apr 3, 2017

gatorsmile commented Apr 3, 2017 •

edited

Loading