[SPARK-16174][SQL] Improve `OptimizeIn` optimizer to remove literal repetitions #13876

dongjoon-hyun · 2016-06-23T20:35:36Z

What changes were proposed in this pull request?

This PR improves OptimizeIn optimizer to remove the literal repetitions from SQL IN predicates. This optimizer prevents user mistakes and also can optimize some queries like TPCDS-36.

Before

scala> sql("select state from (select explode(array('CA','TN')) state) where state in ('TN','TN','TN','TN','TN','TN','TN')").explain
== Physical Plan ==
*Filter state#6 IN (TN,TN,TN,TN,TN,TN,TN)
+- Generate explode([CA,TN]), false, false, [state#6]
   +- Scan OneRowRelation[]

After

scala> sql("select state from (select explode(array('CA','TN')) state) where state in ('TN','TN','TN','TN','TN','TN','TN')").explain
== Physical Plan ==
*Filter state#6 IN (TN)
+- Generate explode([CA,TN]), false, false, [state#6]
   +- Scan OneRowRelation[]

How was this patch tested?

Pass the Jenkins tests (including a new testcase).

SparkQA · 2016-06-23T22:21:11Z

Test build #61126 has finished for PR 13876 at commit e33b7f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-06-23T22:32:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -793,6 +794,20 @@ object ConstantFolding extends Rule[LogicalPlan] {
 }

 /**
+ * Removes literal repetitions from IN predicate
+ */
+object RemoveLiteralRepetitionFromIn extends Rule[LogicalPlan] {


can this just go into OptimzieIn

also does this need to be literal specific

Thank you for review, @rxin .

Sure, I can merge this into OptimizeIn.

Also, it can be used for deterministic expressions.
I'm just here focus on literals. May I handle both cases?

yea why don't we handle both

Sure! No problem.

SparkQA · 2016-06-24T01:05:59Z

Test build #61136 has finished for PR 13876 at commit e0239a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-24T01:10:04Z

Test build #61135 has finished for PR 13876 at commit 6180daf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-24T01:11:31Z

Test build #61137 has finished for PR 13876 at commit 5a9f4ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-06-24T16:19:14Z

Hi, @rxin .
Could you review this PR again when you have some time?

dongjoon-hyun · 2016-06-25T11:16:46Z

Hi, @rxin .
Now, variable l is replaced with list.

SparkQA · 2016-06-25T13:10:20Z

Test build #61232 has finished for PR 13876 at commit cf7b869.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-06-25T23:32:59Z

Hi, @rxin .
For this OptimizeIn PR, please let me know if we need further optimization.
Thank you always.

SparkQA · 2016-06-27T07:01:35Z

Test build #61289 has finished for PR 13876 at commit 61552f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-06-28T17:04:45Z

Hi, @rxin .
Could you review this OptimizeIn PR?

dongjoon-hyun · 2016-06-28T21:07:42Z

Hi, @rxin .
Do you want me to split this OptimizeIn into another file, too?

SparkQA · 2016-06-30T21:59:37Z

Test build #61563 has finished for PR 13876 at commit 53363e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-06-30T23:40:30Z

Hi, @rxin .
Could you review this OptimizeIn again when you have some time?

dongjoon-hyun · 2016-07-02T19:39:57Z

Rebased to the master.

SparkQA · 2016-07-02T21:26:28Z

Test build #61662 has finished for PR 13876 at commit 63b3ecd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-05T18:27:49Z

Test build #61760 has finished for PR 13876 at commit 6628c7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-06T17:18:59Z

Hi, @rxin .
Could you review this PR again?

rxin · 2016-07-06T17:35:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-        val hSet = list.map(e => e.eval(EmptyRow))
-        InSet(v, HashSet() ++ hSet)
+      case i @ In(v, list) =>
+        val (deterministics, others) = list.partition(_.deterministic)


one question i have is how often do we see an in expression with some expressions being deterministic and some nondeterministic? if not, i'd just simplify this so we only do it if everything is deterministic.

I agree. In real situation, case i @ In(v, list) if list.forall(_.deterministic) will cover the most cases.

I'll update like that. Thank you for review again!

SparkQA · 2016-07-06T20:08:14Z

Test build #61860 has finished for PR 13876 at commit 314bc74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-06T20:12:56Z

Hi, @rxin .
Now, it's simplified for the all deterministic cases and passed the Jenkins again.
Thank you for advice.

rxin · 2016-07-06T20:46:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+        } else if (newList.length < list.length) {
+          i.copy(v, newList)
+        } else { // newList.length == list.length
+          i


I think this can bring some performance concerns because we are doing a lot of work in order to return the original query, and given the optimizer is iterative, it would spend a lot of cycles just doing this.

Can we introduce a flag (lazy val) to the In expression to check whether it is optimizable? If it is not, then we shouldn't even go into the case. Something like

case class In(...) { lazy val inSetConvertable: Boolean = list.forall(_.deterministic) }

Sure. That sounds great. I'll fix soon.

rxin · 2016-07-07T00:21:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

@@ -132,6 +132,7 @@ case class In(value: Expression, list: Seq[Expression]) extends Predicate
  }

  override def children: Seq[Expression] = value +: list
+  lazy val inSetConvertible = children.forall(_.deterministic)


my bad - we should put newList.forall(_.isInstanceOf[Literal]) here too

mark the type explicitly since this is a public funciton

Oh, then, the semantic is different. What you mean is just improving InSet.
My original PR was about for deletion about all deterministic duplications.

But, if that is your intention, Okay.

Ah.. We need to update all PR/JIRA description, too.

SparkQA · 2016-07-07T00:33:22Z

Test build #61875 has finished for PR 13876 at commit 23d6e30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-07T00:47:24Z

Now, the scope of PR is reduced a lot. But, I hope this PR still covers majority of real queries.
Thank you for many advice.

dongjoon-hyun · 2016-07-07T00:50:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

 */
 case class OptimizeIn(conf: CatalystConf) extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case q: LogicalPlan => q transformExpressionsDown {
-      case In(v, list) if !list.exists(!_.isInstanceOf[Literal]) &&


Oh, here is regression. Originally, v could be non-deterministic.

SparkQA · 2016-07-07T01:02:29Z

Test build #61878 has finished for PR 13876 at commit 125036a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-07T02:35:09Z

Test build #61887 has finished for PR 13876 at commit 63a4a79.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-07T02:48:14Z

Test build #61888 has finished for PR 13876 at commit ccf972d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-07T05:54:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+          val hSet = newList.map(e => e.eval(EmptyRow))
+          InSet(v, HashSet() ++ hSet)
+        } else if (newList.size < list.size) {
+          expr.copy(value = v, list = newList)


you don't need to copy value here, do you?

Right, it's a whole value. We had better create the expression In.

sorry what i meant was ... we only need to do

expr.copy(list = newList)

Oops. My bad!

rxin · 2016-07-07T05:55:37Z

Looks pretty good.

cc @cloud-fan for another look.

cloud-fan · 2016-07-07T06:28:25Z

LGTM

dongjoon-hyun · 2016-07-07T06:30:58Z

Thank you for review, @cloud-fan .

SparkQA · 2016-07-07T07:23:03Z

Test build #61898 has finished for PR 13876 at commit eead3db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-07T07:27:51Z

There is one failure in HiveSparkSubmitSuite. It seems to be irrelevant. I'll retry the test.

- SPARK-8020: set sql conf in spark conf *** FAILED *** (31 seconds, 981 milliseconds)

dongjoon-hyun · 2016-07-07T07:27:58Z

Retest this please.

SparkQA · 2016-07-07T07:57:01Z

Test build #61897 has finished for PR 13876 at commit f068e4b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-07T09:28:11Z

Test build #61901 has finished for PR 13876 at commit eead3db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-07T09:34:57Z

At this time, it passed as expected.

cloud-fan · 2016-07-07T11:48:15Z

thanks, merging to master!

dongjoon-hyun · 2016-07-07T19:54:48Z

Thank you for review and merging, @cloud-fan and @rxin .

rxin reviewed Jun 23, 2016
View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-16174][SQL] Add RemoveLiteralRepetitionFromIn optimizer~~ [SPARK-16174][SQL] Improve OptimizeIn optimizer to remove deterministic repetitions Jun 23, 2016

dongjoon-hyun changed the title ~~[SPARK-16174][SQL] Improve OptimizeIn optimizer to remove deterministic repetitions~~ [SPARK-16174][SQL] Improve OptimizeIn optimizer to remove deterministic repetitions Jun 28, 2016

dongjoon-hyun added 5 commits July 5, 2016 09:46

[SPARK-16174][SQL] Add RemoveLiteralRepetitionFromIn optimizer

74d107d

Merge into OptimizeIn optimizer and handle all deterministic cases.

779ade8

Change testcase name and fix typo.

e99fa4a

Update optimizer descriptions.

4f54c64

According to SPARK-16081, replace l with list.

6628c7b

rxin reviewed Jul 6, 2016
View reviewed changes

Simplify the logic to handle only all-deterministic cases.

314bc74

rxin reviewed Jul 6, 2016
View reviewed changes

rxin reviewed Jul 7, 2016
View reviewed changes

Optimize only literals.

63a4a79

dongjoon-hyun changed the title ~~[SPARK-16174][SQL] Improve OptimizeIn optimizer to remove deterministic repetitions~~ [SPARK-16174][SQL] Improve OptimizeIn optimizer to remove literal repetitions Jul 7, 2016

dongjoon-hyun reviewed Jul 7, 2016
View reviewed changes

Fix regression.

ccf972d

rxin reviewed Jul 7, 2016
View reviewed changes

dongjoon-hyun added 2 commits July 6, 2016 23:01

Replace copy.

f068e4b

Fix again.

eead3db

asfgit closed this in a04cab8 Jul 7, 2016

dongjoon-hyun deleted the SPARK-16174 branch July 20, 2016 07:43

[SPARK-16174][SQL] Improve OptimizeIn optimizer to remove literal repetitions #13876

[SPARK-16174][SQL] Improve OptimizeIn optimizer to remove literal repetitions #13876

Conversation

dongjoon-hyun commented Jun 23, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 24, 2016

SparkQA commented Jun 24, 2016

SparkQA commented Jun 24, 2016

dongjoon-hyun commented Jun 24, 2016

dongjoon-hyun commented Jun 25, 2016

SparkQA commented Jun 25, 2016

dongjoon-hyun commented Jun 25, 2016

SparkQA commented Jun 27, 2016

dongjoon-hyun commented Jun 28, 2016

dongjoon-hyun commented Jun 28, 2016

SparkQA commented Jun 30, 2016

dongjoon-hyun commented Jun 30, 2016 • edited

dongjoon-hyun commented Jul 2, 2016

SparkQA commented Jul 2, 2016

SparkQA commented Jul 5, 2016

dongjoon-hyun commented Jul 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 6, 2016

dongjoon-hyun commented Jul 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 7, 2016

dongjoon-hyun commented Jul 7, 2016

Choose a reason for hiding this comment

SparkQA commented Jul 7, 2016

SparkQA commented Jul 7, 2016

SparkQA commented Jul 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Jul 7, 2016

cloud-fan commented Jul 7, 2016

dongjoon-hyun commented Jul 7, 2016

SparkQA commented Jul 7, 2016

dongjoon-hyun commented Jul 7, 2016

dongjoon-hyun commented Jul 7, 2016

SparkQA commented Jul 7, 2016

SparkQA commented Jul 7, 2016

dongjoon-hyun commented Jul 7, 2016

cloud-fan commented Jul 7, 2016

dongjoon-hyun commented Jul 7, 2016

[SPARK-16174][SQL] Improve `OptimizeIn` optimizer to remove literal repetitions #13876

[SPARK-16174][SQL] Improve `OptimizeIn` optimizer to remove literal repetitions #13876

dongjoon-hyun commented Jun 23, 2016 •

edited

dongjoon-hyun commented Jun 30, 2016 •

edited