Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13658][SQL] BooleanSimplification rule is slow with large boolean expressions #11647

Closed
wants to merge 4 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 11, 2016

JIRA: https://issues.apache.org/jira/browse/SPARK-13658

What changes were proposed in this pull request?

Quoted from JIRA description: When run TPCDS Q3 [1] with lots predicates to filter out the partitions, the optimizer rule BooleanSimplification take about 2 seconds (it use lots of sematicsEqual, which require copy the whole tree).

It will great if we could speedup it.

[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql

How to speed up it:

When we ask the canonicalized expression in Expression, it calls Canonicalize.execute on itself. Canonicalize.execute basically transforms up all expressions included in this expression. However, we don't keep the canonicalized versions for these children expressions. So in next time we ask the canonicalized expressions for the children expressions (e.g., BooleanSimplification), we will rerun Canonicalize.execute on each of them. It wastes much time.

By forcing the children expressions to get and keep their canonicalized versions first, we can avoid re-canonicalize these expressions.

I simply benchmark it with an expression which is part of the where clause in TPCDS Q3:

val testRelation = LocalRelation('ss_sold_date_sk.int, 'd_moy.int, 'i_manufact_id.int, 'ss_item_sk.string, 'i_item_sk.string, 'd_date_sk.int)

val input = ('d_date_sk === 'ss_sold_date_sk) && ('ss_item_sk === 'i_item_sk) && ('i_manufact_id === 436) && ('d_moy === 12) && (('ss_sold_date_sk > 2415355 && 'ss_sold_date_sk < 2415385) || ('ss_sold_date_sk > 2415720 && 'ss_sold_date_sk < 2415750) || ('ss_sold_date_sk > 2416085 && 'ss_sold_date_sk < 2416115) || ('ss_sold_date_sk > 2416450 && 'ss_sold_date_sk < 2416480) || ('ss_sold_date_sk > 2416816 && 'ss_sold_date_sk < 2416846) || ('ss_sold_date_sk > 2417181 && 'ss_sold_date_sk < 2417211) || ('ss_sold_date_sk > 2417546 && 'ss_sold_date_sk < 2417576) || ('ss_sold_date_sk > 2417911 && 'ss_sold_date_sk < 2417941) || ('ss_sold_date_sk > 2418277 && 'ss_sold_date_sk < 2418307) || ('ss_sold_date_sk > 2418642 && 'ss_sold_date_sk < 2418672) || ('ss_sold_date_sk > 2419007 && 'ss_sold_date_sk < 2419037) || ('ss_sold_date_sk > 2419372 && 'ss_sold_date_sk < 2419402) || ('ss_sold_date_sk > 2419738 && 'ss_sold_date_sk < 2419768) || ('ss_sold_date_sk > 2420103 && 'ss_sold_date_sk < 2420133) || ('ss_sold_date_sk > 2420468 && 'ss_sold_date_sk < 2420498) || ('ss_sold_date_sk > 2420833 && 'ss_sold_date_sk < 2420863) || ('ss_sold_date_sk > 2421199 && 'ss_sold_date_sk < 2421229) || ('ss_sold_date_sk > 2421564 && 'ss_sold_date_sk < 2421594) || ('ss_sold_date_sk > 2421929 && 'ss_sold_date_sk < 2421959) || ('ss_sold_date_sk > 2422294 && 'ss_sold_date_sk < 2422324) || ('ss_sold_date_sk > 2422660 && 'ss_sold_date_sk < 2422690) || ('ss_sold_date_sk > 2423025 && 'ss_sold_date_sk < 2423055) || ('ss_sold_date_sk > 2423390 && 'ss_sold_date_sk < 2423420) || ('ss_sold_date_sk > 2423755 && 'ss_sold_date_sk < 2423785) || ('ss_sold_date_sk > 2424121 && 'ss_sold_date_sk < 2424151) || ('ss_sold_date_sk > 2424486 && 'ss_sold_date_sk < 2424516) || ('ss_sold_date_sk > 2424851 && 'ss_sold_date_sk < 2424881) || ('ss_sold_date_sk > 2425216 && 'ss_sold_date_sk < 2425246) || ('ss_sold_date_sk > 2425582 && 'ss_sold_date_sk < 2425612) || ('ss_sold_date_sk > 2425947 && 'ss_sold_date_sk < 2425977) || ('ss_sold_date_sk > 2426312 && 'ss_sold_date_sk < 2426342) || ('ss_sold_date_sk > 2426677 && 'ss_sold_date_sk < 2426707) || ('ss_sold_date_sk > 2427043 && 'ss_sold_date_sk < 2427073) || ('ss_sold_date_sk > 2427408 && 'ss_sold_date_sk < 2427438) || ('ss_sold_date_sk > 2427773 && 'ss_sold_date_sk < 2427803) || ('ss_sold_date_sk > 2428138 && 'ss_sold_date_sk < 2428168) || ('ss_sold_date_sk > 2428504 && 'ss_sold_date_sk < 2428534) || ('ss_sold_date_sk > 2428869 && 'ss_sold_date_sk < 2428899) || ('ss_sold_date_sk > 2429234 && 'ss_sold_date_sk < 2429264) || ('ss_sold_date_sk > 2429599 && 'ss_sold_date_sk < 2429629) || ('ss_sold_date_sk > 2429965 && 'ss_sold_date_sk < 2429995) || ('ss_sold_date_sk > 2430330 && 'ss_sold_date_sk < 2430360) || ('ss_sold_date_sk > 2430695 && 'ss_sold_date_sk < 2430725) || ('ss_sold_date_sk > 2431060 && 'ss_sold_date_sk < 2431090) || ('ss_sold_date_sk > 2431426 && 'ss_sold_date_sk < 2431456) || ('ss_sold_date_sk > 2431791 && 'ss_sold_date_sk < 2431821) || ('ss_sold_date_sk > 2432156 && 'ss_sold_date_sk < 2432186) || ('ss_sold_date_sk > 2432521 && 'ss_sold_date_sk < 2432551) || ('ss_sold_date_sk > 2432887 && 'ss_sold_date_sk < 2432917) || ('ss_sold_date_sk > 2433252 && 'ss_sold_date_sk < 2433282) || ('ss_sold_date_sk > 2433617 && 'ss_sold_date_sk < 2433647) || ('ss_sold_date_sk > 2433982 && 'ss_sold_date_sk < 2434012) || ('ss_sold_date_sk > 2434348 && 'ss_sold_date_sk < 2434378) || ('ss_sold_date_sk > 2434713 && 'ss_sold_date_sk < 2434743)))

val plan = testRelation.where(input).analyze
val actual = Optimize.execute(plan)

With this patch:

352 milliseconds
346 milliseconds
340 milliseconds

Without this patch:

585 milliseconds
880 milliseconds
677 milliseconds

How was this patch tested?

Existing tests should pass.

@viirya
Copy link
Member Author

viirya commented Mar 11, 2016

cc @davies @marmbrus

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52902 has finished for PR 11647 at commit c6e8415.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 11, 2016

retest this please.

@rxin
Copy link
Contributor

rxin commented Mar 11, 2016

It still seem pretty slow. Any other way to speed this up by an order of magnitude?

cc @marmbrus

@viirya
Copy link
Member Author

viirya commented Mar 11, 2016

With latest change:

233 milliseconds
207 milliseconds
220 milliseconds

BTW, by completely removing the comparison of canonicalized expression in Expression.semanticEquals, it can't be faster anymore.

In other words, the logic in Canonicalize after this change should be unrelated to the time spent on BooleanSimplification now. If I am wrong about this, please let me know.

@rxin
Copy link
Contributor

rxin commented Mar 11, 2016

What are the time spent beyond that?

@viirya
Copy link
Member Author

viirya commented Mar 11, 2016

I think it is the time needed to go through BooleanSimplification itself? It is a big rule. A few pattern cases in it is complicated as I see.

@viirya
Copy link
Member Author

viirya commented Mar 11, 2016

For example, this case:

// Common factor elimination for conjunction
    case and @ (left And right) =>
      val lhs = splitDisjunctivePredicates(left)
      val rhs = splitDisjunctivePredicates(right)
      ....

Because the where clause in TPCDS Q3 has a long pattern like this: (...) || (...) || (...) ..........., the recursively running splitDisjunctivePredicates should spend significant time.

@viirya
Copy link
Member Author

viirya commented Mar 11, 2016

Should I update the title of this PR, as I simplify Canonicalize instead of BooleanSimplification?

@rxin
Copy link
Contributor

rxin commented Mar 11, 2016

One way I can think of which is a bit ugly is to special case a sequence of ors and optimize it without recursion.

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52905 has finished for PR 11647 at commit c6e8415.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 11, 2016

Hmm, like this?

protected def splitDisjunctivePredicates(condition: Expression): Seq[Expression] = {
  condition match {
    case Or(Or(Or(Or(Or(Or(Or(Or(l, r))))))), cond2) =>
      splitDisjunctivePredicates(l) ++ splitDisjunctivePredicates(r) ++
        splitDisjunctivePredicates(cond2)
    case Or(cond1, Or(Or(Or(Or(Or(Or(Or(l, r)))))))) =>
      splitDisjunctivePredicates(cond1) ++ splitDisjunctivePredicates(l) ++
        splitDisjunctivePredicates(r)
    case Or(Or(Or(Or(Or(Or(Or(l, r)))))), cond2) =>
      splitDisjunctivePredicates(l) ++ splitDisjunctivePredicates(r) ++
        splitDisjunctivePredicates(cond2)
    case Or(cond1, Or(Or(Or(Or(Or(Or(l, r))))))) =>
      splitDisjunctivePredicates(cond1) ++ splitDisjunctivePredicates(l) ++
        splitDisjunctivePredicates(r)
    case Or(Or(Or(Or(Or(Or(l, r))))), cond2) =>
      splitDisjunctivePredicates(l) ++ splitDisjunctivePredicates(r) ++
        splitDisjunctivePredicates(cond2)
    case Or(cond1, Or(Or(Or(Or(Or(l, r)))))) =>
      splitDisjunctivePredicates(cond1) ++ splitDisjunctivePredicates(l) ++
        splitDisjunctivePredicates(r)
    case Or(Or(Or(Or(Or(l, r)))), cond2) =>
      splitDisjunctivePredicates(l) ++ splitDisjunctivePredicates(r) ++
        splitDisjunctivePredicates(cond2)
    case Or(cond1, Or(Or(Or(Or(l, r))))) =>
      splitDisjunctivePredicates(cond1) ++ splitDisjunctivePredicates(l) ++
        splitDisjunctivePredicates(r)
    case Or(Or(Or(Or(l, r))), cond2) =>
      splitDisjunctivePredicates(l) ++ splitDisjunctivePredicates(r) ++
        splitDisjunctivePredicates(cond2)
    case Or(cond1, Or(Or(Or(l, r)))) =>
      splitDisjunctivePredicates(cond1) ++ splitDisjunctivePredicates(l) ++
        splitDisjunctivePredicates(r)
    case Or(Or(Or(l, r)), cond2) =>
      splitDisjunctivePredicates(l) ++ splitDisjunctivePredicates(r) ++
        splitDisjunctivePredicates(cond2)
    case Or(cond1, Or(Or(l, r))) =>
      splitDisjunctivePredicates(cond1) ++ splitDisjunctivePredicates(l) ++
        splitDisjunctivePredicates(r)
    case Or(Or(l, r), cond2) =>
      splitDisjunctivePredicates(l) ++ splitDisjunctivePredicates(r) ++
        splitDisjunctivePredicates(cond2)
    case Or(cond1, Or(l, r)) =>
      splitDisjunctivePredicates(cond1) ++ splitDisjunctivePredicates(l) ++
        splitDisjunctivePredicates(r)
    case Or(cond1, cond2) =>
      splitDisjunctivePredicates(cond1) ++ splitDisjunctivePredicates(cond2)

    case other => other :: Nil
  }
}

@viirya
Copy link
Member Author

viirya commented Mar 11, 2016

Ah, the above is wrong. I forget to consider other side of each Or. But it is already ugly enough...

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52909 has finished for PR 11647 at commit c75e602.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 11, 2016

I was measuring the time to finish the code snippet shown in pr description. But I think I should only measure the time spent on val actual = Optimize.execute(plan).

Let me update it:

With this patch:

84362 microseconds
84075 microseconds
86192 microseconds

Without this patch:

492139 microseconds
503386 microseconds
463162 microseconds

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52918 has finished for PR 11647 at commit cbab017.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -36,15 +36,16 @@ import org.apache.spark.sql.catalyst.rules._
object Canonicalize extends RuleExecutor[Expression] {
override protected def batches: Seq[Batch] =
Batch(
"Expression Canonicalization", FixedPoint(100),
"Expression Canonicalization", Once,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its kind of confusing for this to be a RuleExecutor if we are only going to run it once and its only doing matches. Should we just move this into Expression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should do it. Let me update this.

@SparkQA
Copy link

SparkQA commented Mar 12, 2016

Test build #52986 has finished for PR 11647 at commit 04b3ce4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 14, 2016

@rxin @marmbrus I've addressed comments. Any other change I should do for this? Thanks!

* - [[EqualTo]] and [[EqualNullSafe]] are reordered by `hashCode`.
* - Other comparisons ([[GreaterThan]], [[LessThan]]) are reversed by `hashCode`.
*/
object Canonicalize extends {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, if you are keeping this class, i think we should just have it in its own file. The Expression.scala file is getting pretty long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. done.

@rxin
Copy link
Contributor

rxin commented Mar 14, 2016

Thanks - I will let @marmbrus do the final review & merge.

@viirya
Copy link
Member Author

viirya commented Mar 14, 2016

Thanks for reviewing this.

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53050 has finished for PR 11647 at commit 4d3a310.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

LGTM, merging to master.

@asfgit asfgit closed this in 6a4bfcd Mar 14, 2016
@@ -152,7 +152,10 @@ abstract class Expression extends TreeNode[Expression] {
* `deterministic` expressions where `this.canonicalized == other.canonicalized` will always
* evaluate to the same result.
*/
lazy val canonicalized: Expression = Canonicalize.execute(this)
lazy val canonicalized: Expression = {
val canonicalizedChildred = children.map(_.canonicalized)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: canonicalizedChildred -> canonicalizedChildren

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks. Since this is merged and it is small, I won't send a PR just for this, but I will include this fixing in other PR.

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Mar 17, 2016
…ean expressions

JIRA: https://issues.apache.org/jira/browse/SPARK-13658

## What changes were proposed in this pull request?

Quoted from JIRA description: When run TPCDS Q3 [1] with lots predicates to filter out the partitions, the optimizer rule BooleanSimplification take about 2 seconds (it use lots of sematicsEqual, which require copy the whole tree).

It will great if we could speedup it.

[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql

How to speed up it:

When we ask the canonicalized expression in `Expression`, it calls `Canonicalize.execute` on itself. `Canonicalize.execute` basically transforms up all expressions included in this expression. However, we don't keep the canonicalized versions for these children expressions. So in next time we ask the canonicalized expressions for the children expressions (e.g., `BooleanSimplification`), we will rerun `Canonicalize.execute` on each of them. It wastes much time.

By forcing the children expressions to get and keep their canonicalized versions first, we can avoid re-canonicalize these expressions.

I simply benchmark it with an expression which is part of the where clause in TPCDS Q3:

    val testRelation = LocalRelation('ss_sold_date_sk.int, 'd_moy.int, 'i_manufact_id.int, 'ss_item_sk.string, 'i_item_sk.string, 'd_date_sk.int)

    val input = ('d_date_sk === 'ss_sold_date_sk) && ('ss_item_sk === 'i_item_sk) && ('i_manufact_id === 436) && ('d_moy === 12) && (('ss_sold_date_sk > 2415355 && 'ss_sold_date_sk < 2415385) || ('ss_sold_date_sk > 2415720 && 'ss_sold_date_sk < 2415750) || ('ss_sold_date_sk > 2416085 && 'ss_sold_date_sk < 2416115) || ('ss_sold_date_sk > 2416450 && 'ss_sold_date_sk < 2416480) || ('ss_sold_date_sk > 2416816 && 'ss_sold_date_sk < 2416846) || ('ss_sold_date_sk > 2417181 && 'ss_sold_date_sk < 2417211) || ('ss_sold_date_sk > 2417546 && 'ss_sold_date_sk < 2417576) || ('ss_sold_date_sk > 2417911 && 'ss_sold_date_sk < 2417941) || ('ss_sold_date_sk > 2418277 && 'ss_sold_date_sk < 2418307) || ('ss_sold_date_sk > 2418642 && 'ss_sold_date_sk < 2418672) || ('ss_sold_date_sk > 2419007 && 'ss_sold_date_sk < 2419037) || ('ss_sold_date_sk > 2419372 && 'ss_sold_date_sk < 2419402) || ('ss_sold_date_sk > 2419738 && 'ss_sold_date_sk < 2419768) || ('ss_sold_date_sk > 2420103 && 'ss_sold_date_sk < 2420133) || ('ss_sold_date_sk > 2420468 && 'ss_sold_date_sk < 2420498) || ('ss_sold_date_sk > 2420833 && 'ss_sold_date_sk < 2420863) || ('ss_sold_date_sk > 2421199 && 'ss_sold_date_sk < 2421229) || ('ss_sold_date_sk > 2421564 && 'ss_sold_date_sk < 2421594) || ('ss_sold_date_sk > 2421929 && 'ss_sold_date_sk < 2421959) || ('ss_sold_date_sk > 2422294 && 'ss_sold_date_sk < 2422324) || ('ss_sold_date_sk > 2422660 && 'ss_sold_date_sk < 2422690) || ('ss_sold_date_sk > 2423025 && 'ss_sold_date_sk < 2423055) || ('ss_sold_date_sk > 2423390 && 'ss_sold_date_sk < 2423420) || ('ss_sold_date_sk > 2423755 && 'ss_sold_date_sk < 2423785) || ('ss_sold_date_sk > 2424121 && 'ss_sold_date_sk < 2424151) || ('ss_sold_date_sk > 2424486 && 'ss_sold_date_sk < 2424516) || ('ss_sold_date_sk > 2424851 && 'ss_sold_date_sk < 2424881) || ('ss_sold_date_sk > 2425216 && 'ss_sold_date_sk < 2425246) || ('ss_sold_date_sk > 2425582 && 'ss_sold_date_sk < 2425612) || ('ss_sold_date_sk > 2425947 && 'ss_sold_date_sk < 2425977) || ('ss_sold_date_sk > 2426312 && 'ss_sold_date_sk < 2426342) || ('ss_sold_date_sk > 2426677 && 'ss_sold_date_sk < 2426707) || ('ss_sold_date_sk > 2427043 && 'ss_sold_date_sk < 2427073) || ('ss_sold_date_sk > 2427408 && 'ss_sold_date_sk < 2427438) || ('ss_sold_date_sk > 2427773 && 'ss_sold_date_sk < 2427803) || ('ss_sold_date_sk > 2428138 && 'ss_sold_date_sk < 2428168) || ('ss_sold_date_sk > 2428504 && 'ss_sold_date_sk < 2428534) || ('ss_sold_date_sk > 2428869 && 'ss_sold_date_sk < 2428899) || ('ss_sold_date_sk > 2429234 && 'ss_sold_date_sk < 2429264) || ('ss_sold_date_sk > 2429599 && 'ss_sold_date_sk < 2429629) || ('ss_sold_date_sk > 2429965 && 'ss_sold_date_sk < 2429995) || ('ss_sold_date_sk > 2430330 && 'ss_sold_date_sk < 2430360) || ('ss_sold_date_sk > 2430695 && 'ss_sold_date_sk < 2430725) || ('ss_sold_date_sk > 2431060 && 'ss_sold_date_sk < 2431090) || ('ss_sold_date_sk > 2431426 && 'ss_sold_date_sk < 2431456) || ('ss_sold_date_sk > 2431791 && 'ss_sold_date_sk < 2431821) || ('ss_sold_date_sk > 2432156 && 'ss_sold_date_sk < 2432186) || ('ss_sold_date_sk > 2432521 && 'ss_sold_date_sk < 2432551) || ('ss_sold_date_sk > 2432887 && 'ss_sold_date_sk < 2432917) || ('ss_sold_date_sk > 2433252 && 'ss_sold_date_sk < 2433282) || ('ss_sold_date_sk > 2433617 && 'ss_sold_date_sk < 2433647) || ('ss_sold_date_sk > 2433982 && 'ss_sold_date_sk < 2434012) || ('ss_sold_date_sk > 2434348 && 'ss_sold_date_sk < 2434378) || ('ss_sold_date_sk > 2434713 && 'ss_sold_date_sk < 2434743)))

    val plan = testRelation.where(input).analyze
    val actual = Optimize.execute(plan)

With this patch:

    352 milliseconds
    346 milliseconds
    340 milliseconds

Without this patch:

    585 milliseconds
    880 milliseconds
    677 milliseconds

## How was this patch tested?

Existing tests should pass.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#11647 from viirya/improve-expr-canonicalize.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…ean expressions

JIRA: https://issues.apache.org/jira/browse/SPARK-13658

## What changes were proposed in this pull request?

Quoted from JIRA description: When run TPCDS Q3 [1] with lots predicates to filter out the partitions, the optimizer rule BooleanSimplification take about 2 seconds (it use lots of sematicsEqual, which require copy the whole tree).

It will great if we could speedup it.

[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql

How to speed up it:

When we ask the canonicalized expression in `Expression`, it calls `Canonicalize.execute` on itself. `Canonicalize.execute` basically transforms up all expressions included in this expression. However, we don't keep the canonicalized versions for these children expressions. So in next time we ask the canonicalized expressions for the children expressions (e.g., `BooleanSimplification`), we will rerun `Canonicalize.execute` on each of them. It wastes much time.

By forcing the children expressions to get and keep their canonicalized versions first, we can avoid re-canonicalize these expressions.

I simply benchmark it with an expression which is part of the where clause in TPCDS Q3:

    val testRelation = LocalRelation('ss_sold_date_sk.int, 'd_moy.int, 'i_manufact_id.int, 'ss_item_sk.string, 'i_item_sk.string, 'd_date_sk.int)

    val input = ('d_date_sk === 'ss_sold_date_sk) && ('ss_item_sk === 'i_item_sk) && ('i_manufact_id === 436) && ('d_moy === 12) && (('ss_sold_date_sk > 2415355 && 'ss_sold_date_sk < 2415385) || ('ss_sold_date_sk > 2415720 && 'ss_sold_date_sk < 2415750) || ('ss_sold_date_sk > 2416085 && 'ss_sold_date_sk < 2416115) || ('ss_sold_date_sk > 2416450 && 'ss_sold_date_sk < 2416480) || ('ss_sold_date_sk > 2416816 && 'ss_sold_date_sk < 2416846) || ('ss_sold_date_sk > 2417181 && 'ss_sold_date_sk < 2417211) || ('ss_sold_date_sk > 2417546 && 'ss_sold_date_sk < 2417576) || ('ss_sold_date_sk > 2417911 && 'ss_sold_date_sk < 2417941) || ('ss_sold_date_sk > 2418277 && 'ss_sold_date_sk < 2418307) || ('ss_sold_date_sk > 2418642 && 'ss_sold_date_sk < 2418672) || ('ss_sold_date_sk > 2419007 && 'ss_sold_date_sk < 2419037) || ('ss_sold_date_sk > 2419372 && 'ss_sold_date_sk < 2419402) || ('ss_sold_date_sk > 2419738 && 'ss_sold_date_sk < 2419768) || ('ss_sold_date_sk > 2420103 && 'ss_sold_date_sk < 2420133) || ('ss_sold_date_sk > 2420468 && 'ss_sold_date_sk < 2420498) || ('ss_sold_date_sk > 2420833 && 'ss_sold_date_sk < 2420863) || ('ss_sold_date_sk > 2421199 && 'ss_sold_date_sk < 2421229) || ('ss_sold_date_sk > 2421564 && 'ss_sold_date_sk < 2421594) || ('ss_sold_date_sk > 2421929 && 'ss_sold_date_sk < 2421959) || ('ss_sold_date_sk > 2422294 && 'ss_sold_date_sk < 2422324) || ('ss_sold_date_sk > 2422660 && 'ss_sold_date_sk < 2422690) || ('ss_sold_date_sk > 2423025 && 'ss_sold_date_sk < 2423055) || ('ss_sold_date_sk > 2423390 && 'ss_sold_date_sk < 2423420) || ('ss_sold_date_sk > 2423755 && 'ss_sold_date_sk < 2423785) || ('ss_sold_date_sk > 2424121 && 'ss_sold_date_sk < 2424151) || ('ss_sold_date_sk > 2424486 && 'ss_sold_date_sk < 2424516) || ('ss_sold_date_sk > 2424851 && 'ss_sold_date_sk < 2424881) || ('ss_sold_date_sk > 2425216 && 'ss_sold_date_sk < 2425246) || ('ss_sold_date_sk > 2425582 && 'ss_sold_date_sk < 2425612) || ('ss_sold_date_sk > 2425947 && 'ss_sold_date_sk < 2425977) || ('ss_sold_date_sk > 2426312 && 'ss_sold_date_sk < 2426342) || ('ss_sold_date_sk > 2426677 && 'ss_sold_date_sk < 2426707) || ('ss_sold_date_sk > 2427043 && 'ss_sold_date_sk < 2427073) || ('ss_sold_date_sk > 2427408 && 'ss_sold_date_sk < 2427438) || ('ss_sold_date_sk > 2427773 && 'ss_sold_date_sk < 2427803) || ('ss_sold_date_sk > 2428138 && 'ss_sold_date_sk < 2428168) || ('ss_sold_date_sk > 2428504 && 'ss_sold_date_sk < 2428534) || ('ss_sold_date_sk > 2428869 && 'ss_sold_date_sk < 2428899) || ('ss_sold_date_sk > 2429234 && 'ss_sold_date_sk < 2429264) || ('ss_sold_date_sk > 2429599 && 'ss_sold_date_sk < 2429629) || ('ss_sold_date_sk > 2429965 && 'ss_sold_date_sk < 2429995) || ('ss_sold_date_sk > 2430330 && 'ss_sold_date_sk < 2430360) || ('ss_sold_date_sk > 2430695 && 'ss_sold_date_sk < 2430725) || ('ss_sold_date_sk > 2431060 && 'ss_sold_date_sk < 2431090) || ('ss_sold_date_sk > 2431426 && 'ss_sold_date_sk < 2431456) || ('ss_sold_date_sk > 2431791 && 'ss_sold_date_sk < 2431821) || ('ss_sold_date_sk > 2432156 && 'ss_sold_date_sk < 2432186) || ('ss_sold_date_sk > 2432521 && 'ss_sold_date_sk < 2432551) || ('ss_sold_date_sk > 2432887 && 'ss_sold_date_sk < 2432917) || ('ss_sold_date_sk > 2433252 && 'ss_sold_date_sk < 2433282) || ('ss_sold_date_sk > 2433617 && 'ss_sold_date_sk < 2433647) || ('ss_sold_date_sk > 2433982 && 'ss_sold_date_sk < 2434012) || ('ss_sold_date_sk > 2434348 && 'ss_sold_date_sk < 2434378) || ('ss_sold_date_sk > 2434713 && 'ss_sold_date_sk < 2434743)))

    val plan = testRelation.where(input).analyze
    val actual = Optimize.execute(plan)

With this patch:

    352 milliseconds
    346 milliseconds
    340 milliseconds

Without this patch:

    585 milliseconds
    880 milliseconds
    677 milliseconds

## How was this patch tested?

Existing tests should pass.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#11647 from viirya/improve-expr-canonicalize.
@viirya viirya deleted the improve-expr-canonicalize branch December 27, 2023 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants