[SPARK-24994][SQL] Add UnwrapCastInBinaryComparison optimizer to simplify integral literals #29565

sunchao · 2020-08-27T23:52:44Z

What changes were proposed in this pull request?

Currently, in cases like the following:

SELECT * FROM t WHERE age < 40

where age is of short type, Spark won't be able to simplify this and can only generate filter cast(age, int) < 40. This won't get pushed down to datasources and therefore is not optimized.

This PR proposes a optimizer rule to improve this when the following constraints are satisfied:

input expression is binary comparisons when one side is a cast operation and another is a literal.
both the cast child expression and literal are of integral type (i.e., byte, short, int or long)

When this is true, it tries to do several optimizations to either simplify the expression or move the cast to the literal side, so
result filter for the above case becomes age < cast(40 as smallint). This is better since the cast can be optimized away later and the filter can be pushed down to data sources.

This PR follows a similar effort in Presto (https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html). Here we only handles integral types but plan to extend to other types as follow-ups.

Why are the changes needed?

As mentioned in the previous section, when cast is not optimized, it cannot be pushed down to data sources which can lead
to unnecessary IO and therefore longer job time and waste of resources. This helps to improve that.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests for both the optimizer rule and filter pushdown on datasource level for both Orc and Parquet.

sunchao · 2020-08-28T02:14:18Z

cc @dbtsai @holdenk @dongjoon-hyun @cloud-fan

SparkQA · 2020-08-28T04:35:35Z

Test build #127969 has finished for PR 29565 at commit a39b99a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2020-08-28T05:58:37Z

Thanks for the PR, @sunchao. This optimization rule will be very useful since many simple predicates with casting as you mentioned are not currently able to pushdown.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala

viirya · 2020-08-28T04:03:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala

+  override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case l: LogicalPlan => l transformExpressionsUp {
+      case e @ BinaryComparison(_, _) => unwrapCast(e)
+    }
+  }


If this rule targets predicate pushdown, shall we only capture the pattern PhysicalOperation?

I think predicate pushdown is the main use case but this can also help to simplify the binary comparisons.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala

dbtsai · 2020-08-28T22:20:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala

+
+      // In case both sides have integral type, optimize the comparison by removing casts or
+      // moving cast to the literal side.
+      simplifyIntegral(exp, fromExp, toType.asInstanceOf[IntegralType], value)


Is it safe to do the type casting here? We might add more types in canImplicitlyCast.

If we match case BinaryComparison(Cast(fromExp, _, _), Literal(value, toType: IntegralType)) it's safer and no need for casting.

Yes currently it's safe since we only allow integral types. Later when we add more types we'll add safe guard here.

We can do the matching on IntegralType but I think it might be slightly better to encapsulate the logic in canImplicitylyCast so that we can expand it later such as casting from long to double with digits checking etc. But yeah I don't have strong preference on either at this moment.

Or we can keep the existing and add the above toType: IntegralType, then we can remove the casting.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala

SparkQA · 2020-08-29T01:12:37Z

Test build #128001 has finished for PR 29565 at commit fc4311b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala

tanelk · 2020-08-29T17:41:50Z

I wondered how it handles null literal, might want to add a test case like this:

  test("unwrap casts when literal is null") {
    val intLit = Literal.create(null, IntegerType)
    val shortLit = Literal.create(null, ShortType)
    assertEquivalent('a > intLit, 'a > shortLit)
    assertEquivalent('a >= intLit, 'a >= shortLit)
    assertEquivalent('a === intLit, 'a === shortLit)
    assertEquivalent('a <=> intLit, 'a <=> shortLit)
    assertEquivalent('a <= intLit, 'a <= shortLit)
    assertEquivalent('a < intLit, 'a < shortLit)
  }

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala

sunchao · 2020-08-29T23:32:12Z

I wondered how it handles null literal, might want to add a test case like this:

  test("unwrap casts when literal is null") {
    val intLit = Literal.create(null, IntegerType)
    val shortLit = Literal.create(null, ShortType)
    assertEquivalent('a > intLit, 'a > shortLit)
    assertEquivalent('a >= intLit, 'a >= shortLit)
    assertEquivalent('a === intLit, 'a === shortLit)
    assertEquivalent('a <=> intLit, 'a <=> shortLit)
    assertEquivalent('a <= intLit, 'a <= shortLit)
    assertEquivalent('a < intLit, 'a < shortLit)
  }

Sure. I can add tests for this. But notice that nulls should have already been handled by NullPropagation though so the binary comparisons should have been transformed to null or something else.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

dongjoon-hyun · 2020-08-30T00:29:20Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+    }
+
+    Seq("orc", "parquet").foreach { format =>
+      withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "") {


@sunchao . BTW, why do we test DataSourceV2 only? Your PR is neutral to V1 or V2, isn't it?

We recommend not to use irrelevant SQLConfs in the test case if the PR is neutral. If this is the test case limitation at 3601 ~ 3604, we had better change it to test with the default configuration.

Yes it applies to both v1 and v2. The reason I chose v2 is because v1 FileSourceScanExec doesn't expose API to get pushed filters.

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

dongjoon-hyun

Thank you for working on this, @sunchao . This will be helpful. I left a few comments.

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

viirya · 2020-09-10T04:56:08Z

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

+ *  - `cast(fromExp, toType) > value` ==> if(isnull(fromExp), null, false)
+ *  - `cast(fromExp, toType) >= value` ==> fromExp == max
+ *  - `cast(fromExp, toType) === value` ==> fromExp == max
+ *  - `cast(fromExp, toType) <=> value` ==> fromExp == max


fromExp <=> max?

For cast(fromExp, toType) <=> value, if fromExp is null, original binary expression evaluates to false. But fromExp == max evaluates to null, right?

I think you are right. Good catch! Let me see how to add a test for this.

Fixed this and added tests to cover it. Please take a look.

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

- Add comments on type coercion constraint - Add comment for non-deterministic case in `EqualNullSafe` - Rename `simplifyIntegral` to `simplifyIntegralComparison` - Refactor test - Fix nit in pattern matching

dbtsai

LGTM.

SparkQA · 2020-09-11T01:18:31Z

Test build #128538 has finished for PR 29565 at commit 1f87c37.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-11T05:55:53Z

Test build #128539 has finished for PR 29565 at commit ec88961.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-09-11T13:32:22Z

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

+   */
+  private def canImplicitlyCast(fromExp: Expression, toType: DataType,
+      literalType: DataType): Boolean = {
+    toType.sameType(literalType) &&


This additional check looks good and robust. BTW, do we have a test coverage for this?

I tried to come up with a test for this but it seems the query compiler always wrap a cast to make sure type from both sides are the same.

dongjoon-hyun

I guess we have one remaining request from @cloud-fan .

[SPARK-24994][SQL] Add UnwrapCastInBinaryComparison optimizer to simplify integral literals #29565 (comment)

This reverts commit 265c169.

SparkQA · 2020-09-11T22:11:35Z

Test build #128577 has finished for PR 29565 at commit de911da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-12T01:14:14Z

Test build #128581 has finished for PR 29565 at commit 5f32ee5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Thank you, @sunchao and all.
Merged to master for Apache Spark 3.1.0.

sunchao · 2020-09-13T05:30:07Z

Thank you all for the review and commit!

cloud-fan · 2020-09-14T04:36:25Z

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

+   * i.e., the conversion is injective. Note this only handles the case when both sides are of
+   * integral type.
+   */
+  private def canImplicitlyCast(fromExp: Expression, toType: DataType,


indentation nit:

def func( para1: T, para2: T, para3: T): R = ...

BTW I'll also check !from.foldable, otherwise it's simpler to not run this rule.

Ya. +1 for foldable checking.

Thanks. I'll check foldable in the follow-up PR to handle more types. And also fix the indentation.

cloud-fan · 2020-09-14T04:45:14Z

Late LGTM with one minor comment: https://github.com/apache/spark/pull/29565/files#r487646973

### What changes were proposed in this pull request? This is a follow-up on #29565, and addresses a few issues in the last PR: - style issue pointed by [this comment](#29565 (comment)) - skip optimization when `fromExp` is foldable (by [this comment](#29565 (comment))) as there could be more efficient rule to apply for this case. - pass timezone info to the generated cast on the literal value - a bunch of cleanups and test improvements Originally I plan to handle this when implementing [SPARK-32858](https://issues.apache.org/jira/browse/SPARK-32858) but now think it's better to isolate these changes from that. ### Why are the changes needed? To fix a few left over issues in the above PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test for the foldable case. Otherwise relying on existing tests. Closes #29775 from sunchao/SPARK-24994-followup. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This is a follow-up on #29565, and addresses a few issues in the last PR: - style issue pointed by [this comment](apache/spark#29565 (comment)) - skip optimization when `fromExp` is foldable (by [this comment](apache/spark#29565 (comment))) as there could be more efficient rule to apply for this case. - pass timezone info to the generated cast on the literal value - a bunch of cleanups and test improvements Originally I plan to handle this when implementing [SPARK-32858](https://issues.apache.org/jira/browse/SPARK-32858) but now think it's better to isolate these changes from that. ### Why are the changes needed? To fix a few left over issues in the above PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test for the foldable case. Otherwise relying on existing tests. Closes #29775 from sunchao/SPARK-24994-followup. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…lify literal types ### What changes were proposed in this pull request? Currently, in cases like the following: ```sql SELECT * FROM t WHERE age < 40 ``` where `age` is of short type, Spark won't be able to simplify this and can only generate filter `cast(age, int) < 40`. This won't get pushed down to datasources and therefore is not optimized. This PR proposes a optimizer rule to improve this when the following constraints are satisfied: - input expression is binary comparisons when one side is a cast operation and another is a literal. - both the cast child expression and literal are of integral type (i.e., byte, short, int or long) When this is true, it tries to do several optimizations to either simplify the expression or move the cast to the literal side, so result filter for the above case becomes `age < cast(40 as smallint)`. This is better since the cast can be optimized away later and the filter can be pushed down to data sources. This PR follows a similar effort in Presto (https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html). Here we only handles integral types but plan to extend to other types as follow-ups. ### Why are the changes needed? As mentioned in the previous section, when cast is not optimized, it cannot be pushed down to data sources which can lead to unnecessary IO and therefore longer job time and waste of resources. This helps to improve that. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests for both the optimizer rule and filter pushdown on datasource level for both Orc and Parquet. Closes apache#29565 from sunchao/SPARK-24994. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This is a follow-up on apache#29565, and addresses a few issues in the last PR: - style issue pointed by [this comment](apache#29565 (comment)) - skip optimization when `fromExp` is foldable (by [this comment](apache#29565 (comment))) as there could be more efficient rule to apply for this case. - pass timezone info to the generated cast on the literal value - a bunch of cleanups and test improvements Originally I plan to handle this when implementing [SPARK-32858](https://issues.apache.org/jira/browse/SPARK-32858) but now think it's better to isolate these changes from that. ### Why are the changes needed? To fix a few left over issues in the above PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test for the foldable case. Otherwise relying on existing tests. Closes apache#29775 from sunchao/SPARK-24994-followup. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

probot-autolabeler bot added the SQL label Aug 27, 2020

viirya reviewed Aug 28, 2020

View reviewed changes

dbtsai reviewed Aug 28, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala Outdated Show resolved Hide resolved

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala Outdated Show resolved Hide resolved

dbtsai reviewed Aug 28, 2020

View reviewed changes