[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression #16493

dilipbiswal · 2017-01-07T00:33:40Z

What changes were proposed in this pull request?

Consider the plans inside subquery expressions while looking up cache manager to make
use of cached data. Currently CacheManager.useCachedData does not consider the
subquery expressions in the plan.

SQL

select * from rows where not exists (select * from rows)

Before the fix

== Optimized Logical Plan ==
Join LeftAnti
:- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
:     +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
+- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
   +- Relation[_1#3775,_2#3776] parquet

After

== Optimized Logical Plan ==
Join LeftAnti
:- InMemoryRelation [_1#256, _2#257], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
:     +- *FileScan parquet [_1#256,_2#257] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
+- Project [_1#256 AS _1#256#298, _2#257 AS _2#257#299]
   +- InMemoryRelation [_1#256, _2#257], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *FileScan parquet [_1#256,_2#257] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>

Query2

 SELECT * FROM t1
 WHERE
 c1 IN (SELECT c1 FROM t2 WHERE c1 IN (SELECT c1 FROM t3 WHERE c1 = 1))

Before

== Analyzed Logical Plan ==
c1: int
Project [c1#3]
+- Filter predicate-subquery#47 [(c1#3 = c1#10)]
   :  +- Project [c1#10]
   :     +- Filter predicate-subquery#46 [(c1#10 = c1#17)]
   :        :  +- Project [c1#17]
   :        :     +- Filter (c1#17 = 1)
   :        :        +- SubqueryAlias t3, `t3`
   :        :           +- Project [value#15 AS c1#17]
   :        :              +- LocalRelation [value#15]
   :        +- SubqueryAlias t2, `t2`
   :           +- Project [value#8 AS c1#10]
   :              +- LocalRelation [value#8]
   +- SubqueryAlias t1, `t1`
      +- Project [value#1 AS c1#3]
         +- LocalRelation [value#1]

== Optimized Logical Plan ==
Join LeftSemi, (c1#3 = c1#10)
:- InMemoryRelation [c1#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1
:     +- LocalTableScan [c1#3]
+- Project [value#8 AS c1#10]
   +- Join LeftSemi, (value#8 = c1#17)
      :- LocalRelation [value#8]
      +- Project [value#15 AS c1#17]
         +- Filter (value#15 = 1)
            +- LocalRelation [value#15]

After

== Analyzed Logical Plan ==
c1: int
Project [c1#3]
+- Filter predicate-subquery#47 [(c1#3 = c1#10)]
   :  +- Project [c1#10]
   :     +- Filter predicate-subquery#46 [(c1#10 = c1#17)]
   :        :  +- Project [c1#17]
   :        :     +- Filter (c1#17 = 1)
   :        :        +- SubqueryAlias t3, `t3`
   :        :           +- Project [value#15 AS c1#17]
   :        :              +- LocalRelation [value#15]
   :        +- SubqueryAlias t2, `t2`
   :           +- Project [value#8 AS c1#10]
   :              +- LocalRelation [value#8]
   +- SubqueryAlias t1, `t1`
      +- Project [value#1 AS c1#3]
         +- LocalRelation [value#1]

== Optimized Logical Plan ==
Join LeftSemi, (c1#3 = c1#10)
:- InMemoryRelation [c1#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1
:     +- LocalTableScan [c1#3]
+- Join LeftSemi, (c1#10 = c1#17)
   :- InMemoryRelation [c1#10], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t2
   :     +- LocalTableScan [c1#10]
   +- Filter (c1#17 = 1)
      +- InMemoryRelation [c1#17], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1
            +- LocalTableScan [c1#3]

How was this patch tested?

Added new tests in CachedTableSuite.

gatorsmile · 2017-01-07T02:06:51Z

Although the test cases can be improved, the code fix looks good to me. cc @JoshRosen @hvanhovell

gatorsmile · 2017-01-07T02:08:05Z

@dilipbiswal Could you post the nested subquery and the plan in the PR description? It can help the other reviewers understand the fix. Thanks!

gatorsmile · 2017-01-07T02:10:52Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+      spark.catalog.uncacheTable("t1")
+      spark.catalog.uncacheTable("t2")
+      spark.catalog.uncacheTable("t3")
+      spark.catalog.uncacheTable("t4")


You can call clearCache() and then no need to uncache each table.

override def afterEach(): Unit = { try { clearCache() } finally { super.afterEach() } }

How about this? @dilipbiswal

@gatorsmile sorry.. missed this one .. Will make the change.

SparkQA · 2017-01-07T02:48:34Z

Test build #70999 has finished for PR 16493 at commit f733f90.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-07T03:05:05Z

In the test suite, we can have such a helper function to count InMemoryRelation

  private def getNumInMemoryRelations(plan: LogicalPlan): Int = {
    var sum = plan.collect { case _: InMemoryRelation => 1 }.sum
    plan.transformAllExpressions {
      case e: SubqueryExpression =>
        sum += getNumInMemoryRelations(e.plan)
        e
    }
    sum
  }

gatorsmile · 2017-01-07T03:05:50Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+        case e: SubqueryExpression => cachedRelations += getCachedPlans(e.plan)
+          e
+      }
+      assert(cachedRelations.flatten.size == 4)


Then, this can be simplified to

assert (getNumInMemoryRelations(cachedPlan2) == 4)

@gatorsmile Thanks... I will make the change

gatorsmile · 2017-01-07T03:06:03Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+      assert(
+        cachedPlan.collect {
+          case i: InMemoryRelation => i
+        }.size == 3)


Then, this can be simplified to

assert (getNumInMemoryRelations(cachedPlan) == 3)

gatorsmile · 2017-01-07T03:08:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+    }
+  }
+
+  private def useCachedDataInternal(plan: LogicalPlan): LogicalPlan = {


After rethinking about it, we do not need to add a new function. We can combine them into a single function, like:

/** Replaces segments of the given logical plan with cached versions where possible. */ def useCachedData(plan: LogicalPlan): LogicalPlan = { val newPlan = plan transformDown { case currentFragment => lookupCachedData(currentFragment) .map(_.cachedRepresentation.withOutput(currentFragment.output)) .getOrElse(currentFragment) } newPlan transformAllExpressions { case s: SubqueryExpression => s.withNewPlan(useCachedData(s.plan)) } }

@gatorsmile Sure

@gatorsmile Thank you very much. I have addressed your comments.

gatorsmile · 2017-01-07T04:00:03Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+      assert(
+        cachedPlan.collect {
+          case i: InMemoryRelation => i
+        }.size == 2)


The same here.

gatorsmile · 2017-01-07T04:00:29Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+
+  test("SPARK-19093 scalar and nested predicate query") {
+
+


Nit: remove these two lines

SparkQA · 2017-01-07T06:01:58Z

Test build #71004 has finished for PR 16493 at commit f9f0b01.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-07T06:54:49Z

Test build #71005 has finished for PR 16493 at commit 3c779d5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2017-01-07T07:07:34Z

retest this please

SparkQA · 2017-01-07T07:08:36Z

Test build #71006 has started for PR 16493 at commit 3c779d5.

gatorsmile · 2017-01-08T02:59:32Z

test this please

SparkQA · 2017-01-08T05:29:09Z

Test build #71027 has finished for PR 16493 at commit 3c779d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-08T05:30:32Z

LGTM

gatorsmile · 2017-01-08T05:31:15Z

Also cc @rxin @cloud-fan

hvanhovell · 2017-01-08T22:08:06Z

LGTM

hvanhovell · 2017-01-08T22:08:22Z

Merging to master. Thanks!

dilipbiswal · 2017-01-09T00:39:25Z

Thank you very much @gatorsmile @hvanhovell

## What changes were proposed in this pull request? Consider the plans inside subquery expressions while looking up cache manager to make use of cached data. Currently CacheManager.useCachedData does not consider the subquery expressions in the plan. SQL ``` select * from rows where not exists (select * from rows) ``` Before the fix ``` == Optimized Logical Plan == Join LeftAnti :- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string> +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] +- Relation[_1#3775,_2#3776] parquet ``` After ``` == Optimized Logical Plan == Join LeftAnti :- InMemoryRelation [_1#256, _2#257], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) : +- *FileScan parquet [_1#256,_2#257] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string> +- Project [_1#256 AS _1#256#298, _2#257 AS _2#257#299] +- InMemoryRelation [_1#256, _2#257], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- *FileScan parquet [_1#256,_2#257] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string> ``` Query2 ``` SELECT * FROM t1 WHERE c1 IN (SELECT c1 FROM t2 WHERE c1 IN (SELECT c1 FROM t3 WHERE c1 = 1)) ``` Before ``` == Analyzed Logical Plan == c1: int Project [c1#3] +- Filter predicate-subquery#47 [(c1#3 = c1#10)] : +- Project [c1#10] : +- Filter predicate-subquery#46 [(c1#10 = c1#17)] : : +- Project [c1#17] : : +- Filter (c1#17 = 1) : : +- SubqueryAlias t3, `t3` : : +- Project [value#15 AS c1#17] : : +- LocalRelation [value#15] : +- SubqueryAlias t2, `t2` : +- Project [value#8 AS c1#10] : +- LocalRelation [value#8] +- SubqueryAlias t1, `t1` +- Project [value#1 AS c1#3] +- LocalRelation [value#1] == Optimized Logical Plan == Join LeftSemi, (c1#3 = c1#10) :- InMemoryRelation [c1#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1 : +- LocalTableScan [c1#3] +- Project [value#8 AS c1#10] +- Join LeftSemi, (value#8 = c1#17) :- LocalRelation [value#8] +- Project [value#15 AS c1#17] +- Filter (value#15 = 1) +- LocalRelation [value#15] ``` After ``` == Analyzed Logical Plan == c1: int Project [c1#3] +- Filter predicate-subquery#47 [(c1#3 = c1#10)] : +- Project [c1#10] : +- Filter predicate-subquery#46 [(c1#10 = c1#17)] : : +- Project [c1#17] : : +- Filter (c1#17 = 1) : : +- SubqueryAlias t3, `t3` : : +- Project [value#15 AS c1#17] : : +- LocalRelation [value#15] : +- SubqueryAlias t2, `t2` : +- Project [value#8 AS c1#10] : +- LocalRelation [value#8] +- SubqueryAlias t1, `t1` +- Project [value#1 AS c1#3] +- LocalRelation [value#1] == Optimized Logical Plan == Join LeftSemi, (c1#3 = c1#10) :- InMemoryRelation [c1#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1 : +- LocalTableScan [c1#3] +- Join LeftSemi, (c1#10 = c1#17) :- InMemoryRelation [c1#10], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t2 : +- LocalTableScan [c1#10] +- Filter (c1#17 = 1) +- InMemoryRelation [c1#17], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1 +- LocalTableScan [c1#3] ``` ## How was this patch tested? Added new tests in CachedTableSuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes apache#16493 from dilipbiswal/SPARK-19093.

…L] Backport Three Cache-related PRs to Spark 2.1 ### What changes were proposed in this pull request? Backport a few cache related PRs: --- [[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression](#16493) Consider the plans inside subquery expressions while looking up cache manager to make use of cached data. Currently CacheManager.useCachedData does not consider the subquery expressions in the plan. --- [[SPARK-19736][SQL] refreshByPath should clear all cached plans with the specified path](#17064) Catalog.refreshByPath can refresh the cache entry and the associated metadata for all dataframes (if any), that contain the given data source path. However, CacheManager.invalidateCachedPath doesn't clear all cached plans with the specified path. It causes some strange behaviors reported in SPARK-15678. --- [[SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-cache all cached plans that refer to this table](#17097) When un-cache a table, we should not only remove the cache entry for this table, but also un-cache any other cached plans that refer to this table. The following commands trigger the table uncache: `DropTableCommand`, `TruncateTableCommand`, `AlterTableRenameCommand`, `UncacheTableCommand`, `RefreshTable` and `InsertIntoHiveTable` This PR also includes some refactors: - use java.util.LinkedList to store the cache entries, so that it's safer to remove elements while iterating - rename invalidateCache to recacheByPlan, which is more obvious about what it does. ### How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #17319 from gatorsmile/backport-17097.

[SPARK-19093] Cached tables are not used in SubqueryExpression

f733f90

gatorsmile reviewed Jan 7, 2017

View reviewed changes

Review comments

f9f0b01

gatorsmile reviewed Jan 7, 2017

View reviewed changes

Review comments

3c779d5

asfgit closed this in 4351e62 Jan 8, 2017

nsyca mentioned this pull request Mar 6, 2017

[SPARK-18389][SQL] Disallow cyclic view reference #17152

Closed

gatorsmile mentioned this pull request Mar 16, 2017

[SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BACKPORT-2.1][SQL] Backport Three Cache-related PRs to Spark 2.1 #17319

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression #16493

[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression #16493

dilipbiswal commented Jan 7, 2017 •

edited

gatorsmile commented Jan 7, 2017

gatorsmile commented Jan 7, 2017 •

edited

gatorsmile Jan 7, 2017

gatorsmile Jan 7, 2017

gatorsmile Jan 7, 2017

dilipbiswal Jan 7, 2017

SparkQA commented Jan 7, 2017

gatorsmile commented Jan 7, 2017

gatorsmile Jan 7, 2017

dilipbiswal Jan 7, 2017

gatorsmile Jan 7, 2017 •

edited

gatorsmile Jan 7, 2017

dilipbiswal Jan 7, 2017

dilipbiswal Jan 7, 2017

gatorsmile Jan 7, 2017

gatorsmile Jan 7, 2017

SparkQA commented Jan 7, 2017

SparkQA commented Jan 7, 2017

dilipbiswal commented Jan 7, 2017

SparkQA commented Jan 7, 2017

gatorsmile commented Jan 8, 2017

SparkQA commented Jan 8, 2017

gatorsmile commented Jan 8, 2017

gatorsmile commented Jan 8, 2017

hvanhovell commented Jan 8, 2017

hvanhovell commented Jan 8, 2017

dilipbiswal commented Jan 9, 2017

[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression #16493

[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression #16493

Conversation

dilipbiswal commented Jan 7, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile commented Jan 7, 2017

gatorsmile commented Jan 7, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 7, 2017

gatorsmile commented Jan 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Jan 7, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 7, 2017

SparkQA commented Jan 7, 2017

dilipbiswal commented Jan 7, 2017

SparkQA commented Jan 7, 2017

gatorsmile commented Jan 8, 2017

SparkQA commented Jan 8, 2017

gatorsmile commented Jan 8, 2017

gatorsmile commented Jan 8, 2017

hvanhovell commented Jan 8, 2017

hvanhovell commented Jan 8, 2017

dilipbiswal commented Jan 9, 2017

dilipbiswal commented Jan 7, 2017 •

edited

gatorsmile commented Jan 7, 2017 •

edited

gatorsmile Jan 7, 2017 •

edited