Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24994][SQL] Add UnwrapCastInBinaryComparison optimizer to simplify integral literals #29565

Closed
wants to merge 17 commits into from

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Aug 27, 2020

What changes were proposed in this pull request?

Currently, in cases like the following:

SELECT * FROM t WHERE age < 40

where age is of short type, Spark won't be able to simplify this and can only generate filter cast(age, int) < 40. This won't get pushed down to datasources and therefore is not optimized.

This PR proposes a optimizer rule to improve this when the following constraints are satisfied:

  • input expression is binary comparisons when one side is a cast operation and another is a literal.
  • both the cast child expression and literal are of integral type (i.e., byte, short, int or long)

When this is true, it tries to do several optimizations to either simplify the expression or move the cast to the literal side, so
result filter for the above case becomes age < cast(40 as smallint). This is better since the cast can be optimized away later and the filter can be pushed down to data sources.

This PR follows a similar effort in Presto (https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html). Here we only handles integral types but plan to extend to other types as follow-ups.

Why are the changes needed?

As mentioned in the previous section, when cast is not optimized, it cannot be pushed down to data sources which can lead
to unnecessary IO and therefore longer job time and waste of resources. This helps to improve that.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests for both the optimizer rule and filter pushdown on datasource level for both Orc and Parquet.

@sunchao
Copy link
Member Author

sunchao commented Aug 28, 2020

@SparkQA
Copy link

SparkQA commented Aug 28, 2020

Test build #127969 has finished for PR 29565 at commit a39b99a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Aug 28, 2020

Thanks for the PR, @sunchao. This optimization rule will be very useful since many simple predicates with casting as you mentioned are not currently able to pushdown.

Comment on lines 42 to 46
override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case l: LogicalPlan => l transformExpressionsUp {
case e @ BinaryComparison(_, _) => unwrapCast(e)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this rule targets predicate pushdown, shall we only capture the pattern PhysicalOperation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think predicate pushdown is the main use case but this can also help to simplify the binary comparisons.

@sunchao sunchao changed the title [WIP][SPARK-24994][SQL] Simplify casts for literal types [SPARK-24994][SQL] Simplify casts for literal types Aug 28, 2020

// In case both sides have integral type, optimize the comparison by removing casts or
// moving cast to the literal side.
simplifyIntegral(exp, fromExp, toType.asInstanceOf[IntegralType], value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to do the type casting here? We might add more types in canImplicitlyCast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we match case BinaryComparison(Cast(fromExp, _, _), Literal(value, toType: IntegralType)) it's safer and no need for casting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes currently it's safe since we only allow integral types. Later when we add more types we'll add safe guard here.

We can do the matching on IntegralType but I think it might be slightly better to encapsulate the logic in canImplicitylyCast so that we can expand it later such as casting from long to double with digits checking etc. But yeah I don't have strong preference on either at this moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can keep the existing and add the above toType: IntegralType, then we can remove the casting.

@SparkQA
Copy link

SparkQA commented Aug 29, 2020

Test build #128001 has finished for PR 29565 at commit fc4311b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tanelk
Copy link
Contributor

tanelk commented Aug 29, 2020

I wondered how it handles null literal, might want to add a test case like this:

  test("unwrap casts when literal is null") {
    val intLit = Literal.create(null, IntegerType)
    val shortLit = Literal.create(null, ShortType)
    assertEquivalent('a > intLit, 'a > shortLit)
    assertEquivalent('a >= intLit, 'a >= shortLit)
    assertEquivalent('a === intLit, 'a === shortLit)
    assertEquivalent('a <=> intLit, 'a <=> shortLit)
    assertEquivalent('a <= intLit, 'a <= shortLit)
    assertEquivalent('a < intLit, 'a < shortLit)
  }

@sunchao
Copy link
Member Author

sunchao commented Aug 29, 2020

I wondered how it handles null literal, might want to add a test case like this:

  test("unwrap casts when literal is null") {
    val intLit = Literal.create(null, IntegerType)
    val shortLit = Literal.create(null, ShortType)
    assertEquivalent('a > intLit, 'a > shortLit)
    assertEquivalent('a >= intLit, 'a >= shortLit)
    assertEquivalent('a === intLit, 'a === shortLit)
    assertEquivalent('a <=> intLit, 'a <=> shortLit)
    assertEquivalent('a <= intLit, 'a <= shortLit)
    assertEquivalent('a < intLit, 'a < shortLit)
  }

Sure. I can add tests for this. But notice that nulls should have already been handled by NullPropagation though so the binary comparisons should have been transformed to null or something else.

}

Seq("orc", "parquet").foreach { format =>
withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao . BTW, why do we test DataSourceV2 only? Your PR is neutral to V1 or V2, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recommend not to use irrelevant SQLConfs in the test case if the PR is neutral. If this is the test case limitation at 3601 ~ 3604, we had better change it to test with the default configuration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it applies to both v1 and v2. The reason I chose v2 is because v1 FileSourceScanExec doesn't expose API to get pushed filters.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, @sunchao . This will be helpful. I left a few comments.

* - `cast(fromExp, toType) > value` ==> if(isnull(fromExp), null, false)
* - `cast(fromExp, toType) >= value` ==> fromExp == max
* - `cast(fromExp, toType) === value` ==> fromExp == max
* - `cast(fromExp, toType) <=> value` ==> fromExp == max
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromExp <=> max?

For cast(fromExp, toType) <=> value, if fromExp is null, original binary expression evaluates to false. But fromExp == max evaluates to null, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. Good catch! Let me see how to add a test for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this and added tests to cover it. Please take a look.

- Add comments on type coercion constraint
- Add comment for non-deterministic case in `EqualNullSafe`
- Rename `simplifyIntegral` to `simplifyIntegralComparison`
- Refactor test
- Fix nit in pattern matching
@dbtsai dbtsai self-requested a review September 10, 2020 23:29
Copy link
Member

@dbtsai dbtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@SparkQA
Copy link

SparkQA commented Sep 11, 2020

Test build #128538 has finished for PR 29565 at commit 1f87c37.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 11, 2020

Test build #128539 has finished for PR 29565 at commit ec88961.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
private def canImplicitlyCast(fromExp: Expression, toType: DataType,
literalType: DataType): Boolean = {
toType.sameType(literalType) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This additional check looks good and robust. BTW, do we have a test coverage for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to come up with a test for this but it seems the query compiler always wrap a cast to make sure type from both sides are the same.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Sep 11, 2020

Test build #128577 has finished for PR 29565 at commit de911da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 12, 2020

Test build #128581 has finished for PR 29565 at commit 5f32ee5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @sunchao and all.
Merged to master for Apache Spark 3.1.0.

@sunchao
Copy link
Member Author

sunchao commented Sep 13, 2020

Thank you all for the review and commit!

* i.e., the conversion is injective. Note this only handles the case when both sides are of
* integral type.
*/
private def canImplicitlyCast(fromExp: Expression, toType: DataType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation nit:

def func(
    para1: T,
    para2: T,
    para3: T): R = ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I'll also check !from.foldable, otherwise it's simpler to not run this rule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya. +1 for foldable checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll check foldable in the follow-up PR to handle more types. And also fix the indentation.

@cloud-fan
Copy link
Contributor

Late LGTM with one minor comment: https://github.com/apache/spark/pull/29565/files#r487646973

@gatorsmile gatorsmile changed the title [SPARK-24994][SQL] Add UnwrapCastInBinaryComparison optimizer to simplify literal types [SPARK-24994][SQL] Add UnwrapCastInBinaryComparison optimizer to simplify integral types Sep 16, 2020
@gatorsmile gatorsmile changed the title [SPARK-24994][SQL] Add UnwrapCastInBinaryComparison optimizer to simplify integral types [SPARK-24994][SQL] Add UnwrapCastInBinaryComparison optimizer to simplify integral literals Sep 16, 2020
dongjoon-hyun pushed a commit that referenced this pull request Sep 17, 2020
### What changes were proposed in this pull request?

This is a follow-up on #29565, and addresses a few issues in the last PR:
- style issue pointed by [this comment](#29565 (comment))
- skip optimization when `fromExp` is foldable (by [this comment](#29565 (comment))) as there could be more efficient rule to apply for this case.
- pass timezone info to the generated cast on the literal value
- a bunch of cleanups and test improvements

Originally I plan to handle this when implementing [SPARK-32858](https://issues.apache.org/jira/browse/SPARK-32858) but now think it's better to isolate these changes from that.

### Why are the changes needed?

To fix a few left over issues in the above PR.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added a test for the foldable case. Otherwise relying on existing tests.

Closes #29775 from sunchao/SPARK-24994-followup.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
a0x8o added a commit to a0x8o/spark that referenced this pull request Sep 17, 2020
### What changes were proposed in this pull request?

This is a follow-up on #29565, and addresses a few issues in the last PR:
- style issue pointed by [this comment](apache/spark#29565 (comment))
- skip optimization when `fromExp` is foldable (by [this comment](apache/spark#29565 (comment))) as there could be more efficient rule to apply for this case.
- pass timezone info to the generated cast on the literal value
- a bunch of cleanups and test improvements

Originally I plan to handle this when implementing [SPARK-32858](https://issues.apache.org/jira/browse/SPARK-32858) but now think it's better to isolate these changes from that.

### Why are the changes needed?

To fix a few left over issues in the above PR.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added a test for the foldable case. Otherwise relying on existing tests.

Closes #29775 from sunchao/SPARK-24994-followup.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…lify literal types

### What changes were proposed in this pull request?

Currently, in cases like the following:
```sql
SELECT * FROM t WHERE age < 40
```
where `age` is of short type, Spark won't be able to simplify this and can only generate filter `cast(age, int) < 40`. This won't get pushed down to datasources and therefore is not optimized.

This PR proposes a optimizer rule to improve this when the following constraints are satisfied:
 - input expression is binary comparisons when one side is a cast operation and another is a literal.
 - both the cast child expression and literal are of integral type (i.e., byte, short, int or long)

When this is true, it tries to do several optimizations to either simplify the expression or move the cast to the literal side, so
result filter for the above case becomes `age < cast(40 as smallint)`. This is better since the cast can be optimized away later and the filter can be pushed down to data sources.

This PR follows a similar effort in Presto (https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html). Here we only handles integral types but plan to extend to other types as follow-ups.

### Why are the changes needed?

As mentioned in the previous section, when cast is not optimized, it cannot be pushed down to data sources which can lead
to unnecessary IO and therefore longer job time and waste of resources. This helps to improve that.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests for both the optimizer rule and filter pushdown on datasource level for both Orc and Parquet.

Closes apache#29565 from sunchao/SPARK-24994.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
### What changes were proposed in this pull request?

This is a follow-up on apache#29565, and addresses a few issues in the last PR:
- style issue pointed by [this comment](apache#29565 (comment))
- skip optimization when `fromExp` is foldable (by [this comment](apache#29565 (comment))) as there could be more efficient rule to apply for this case.
- pass timezone info to the generated cast on the literal value
- a bunch of cleanups and test improvements

Originally I plan to handle this when implementing [SPARK-32858](https://issues.apache.org/jira/browse/SPARK-32858) but now think it's better to isolate these changes from that.

### Why are the changes needed?

To fix a few left over issues in the above PR.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added a test for the foldable case. Otherwise relying on existing tests.

Closes apache#29775 from sunchao/SPARK-24994-followup.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants