New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24994][SQL] Add UnwrapCastInBinaryComparison optimizer to simplify integral literals #29565
Conversation
Test build #127969 has finished for PR 29565 at commit
|
Thanks for the PR, @sunchao. This optimization rule will be very useful since many simple predicates with casting as you mentioned are not currently able to pushdown. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
override def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
case l: LogicalPlan => l transformExpressionsUp { | ||
case e @ BinaryComparison(_, _) => unwrapCast(e) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this rule targets predicate pushdown, shall we only capture the pattern PhysicalOperation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think predicate pushdown is the main use case but this can also help to simplify the binary comparisons.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
|
||
// In case both sides have integral type, optimize the comparison by removing casts or | ||
// moving cast to the literal side. | ||
simplifyIntegral(exp, fromExp, toType.asInstanceOf[IntegralType], value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to do the type casting here? We might add more types in canImplicitlyCast
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we match case BinaryComparison(Cast(fromExp, _, _), Literal(value, toType: IntegralType))
it's safer and no need for casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes currently it's safe since we only allow integral types. Later when we add more types we'll add safe guard here.
We can do the matching on IntegralType
but I think it might be slightly better to encapsulate the logic in canImplicitylyCast
so that we can expand it later such as casting from long to double with digits checking etc. But yeah I don't have strong preference on either at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can keep the existing and add the above toType: IntegralType
, then we can remove the casting.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
Test build #128001 has finished for PR 29565 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
I wondered how it handles null literal, might want to add a test case like this:
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
Sure. I can add tests for this. But notice that nulls should have already been handled by |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
} | ||
|
||
Seq("orc", "parquet").foreach { format => | ||
withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao . BTW, why do we test DataSourceV2
only? Your PR is neutral to V1 or V2, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recommend not to use irrelevant SQLConfs in the test case if the PR is neutral. If this is the test case limitation at 3601 ~ 3604, we had better change it to test with the default configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it applies to both v1 and v2. The reason I chose v2 is because v1 FileSourceScanExec
doesn't expose API to get pushed filters.
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this, @sunchao . This will be helpful. I left a few comments.
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
Show resolved
Hide resolved
...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala
Show resolved
Hide resolved
* - `cast(fromExp, toType) > value` ==> if(isnull(fromExp), null, false) | ||
* - `cast(fromExp, toType) >= value` ==> fromExp == max | ||
* - `cast(fromExp, toType) === value` ==> fromExp == max | ||
* - `cast(fromExp, toType) <=> value` ==> fromExp == max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fromExp <=> max
?
For cast(fromExp, toType) <=> value
, if fromExp
is null, original binary expression evaluates to false. But fromExp == max
evaluates to null, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right. Good catch! Let me see how to add a test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this and added tests to cover it. Please take a look.
...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala
Outdated
Show resolved
Hide resolved
- Add comments on type coercion constraint - Add comment for non-deterministic case in `EqualNullSafe` - Rename `simplifyIntegral` to `simplifyIntegralComparison` - Refactor test - Fix nit in pattern matching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Test build #128538 has finished for PR 29565 at commit
|
Test build #128539 has finished for PR 29565 at commit
|
*/ | ||
private def canImplicitlyCast(fromExp: Expression, toType: DataType, | ||
literalType: DataType): Boolean = { | ||
toType.sameType(literalType) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This additional check looks good and robust. BTW, do we have a test coverage for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to come up with a test for this but it seems the query compiler always wrap a cast to make sure type from both sides are the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we have one remaining request from @cloud-fan .
This reverts commit 265c169.
Test build #128577 has finished for PR 29565 at commit
|
Test build #128581 has finished for PR 29565 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @sunchao and all.
Merged to master for Apache Spark 3.1.0.
Thank you all for the review and commit! |
* i.e., the conversion is injective. Note this only handles the case when both sides are of | ||
* integral type. | ||
*/ | ||
private def canImplicitlyCast(fromExp: Expression, toType: DataType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation nit:
def func(
para1: T,
para2: T,
para3: T): R = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I'll also check !from.foldable
, otherwise it's simpler to not run this rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya. +1 for foldable
checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'll check foldable
in the follow-up PR to handle more types. And also fix the indentation.
Late LGTM with one minor comment: https://github.com/apache/spark/pull/29565/files#r487646973 |
### What changes were proposed in this pull request? This is a follow-up on #29565, and addresses a few issues in the last PR: - style issue pointed by [this comment](#29565 (comment)) - skip optimization when `fromExp` is foldable (by [this comment](#29565 (comment))) as there could be more efficient rule to apply for this case. - pass timezone info to the generated cast on the literal value - a bunch of cleanups and test improvements Originally I plan to handle this when implementing [SPARK-32858](https://issues.apache.org/jira/browse/SPARK-32858) but now think it's better to isolate these changes from that. ### Why are the changes needed? To fix a few left over issues in the above PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test for the foldable case. Otherwise relying on existing tests. Closes #29775 from sunchao/SPARK-24994-followup. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? This is a follow-up on #29565, and addresses a few issues in the last PR: - style issue pointed by [this comment](apache/spark#29565 (comment)) - skip optimization when `fromExp` is foldable (by [this comment](apache/spark#29565 (comment))) as there could be more efficient rule to apply for this case. - pass timezone info to the generated cast on the literal value - a bunch of cleanups and test improvements Originally I plan to handle this when implementing [SPARK-32858](https://issues.apache.org/jira/browse/SPARK-32858) but now think it's better to isolate these changes from that. ### Why are the changes needed? To fix a few left over issues in the above PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test for the foldable case. Otherwise relying on existing tests. Closes #29775 from sunchao/SPARK-24994-followup. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…lify literal types ### What changes were proposed in this pull request? Currently, in cases like the following: ```sql SELECT * FROM t WHERE age < 40 ``` where `age` is of short type, Spark won't be able to simplify this and can only generate filter `cast(age, int) < 40`. This won't get pushed down to datasources and therefore is not optimized. This PR proposes a optimizer rule to improve this when the following constraints are satisfied: - input expression is binary comparisons when one side is a cast operation and another is a literal. - both the cast child expression and literal are of integral type (i.e., byte, short, int or long) When this is true, it tries to do several optimizations to either simplify the expression or move the cast to the literal side, so result filter for the above case becomes `age < cast(40 as smallint)`. This is better since the cast can be optimized away later and the filter can be pushed down to data sources. This PR follows a similar effort in Presto (https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html). Here we only handles integral types but plan to extend to other types as follow-ups. ### Why are the changes needed? As mentioned in the previous section, when cast is not optimized, it cannot be pushed down to data sources which can lead to unnecessary IO and therefore longer job time and waste of resources. This helps to improve that. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests for both the optimizer rule and filter pushdown on datasource level for both Orc and Parquet. Closes apache#29565 from sunchao/SPARK-24994. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? This is a follow-up on apache#29565, and addresses a few issues in the last PR: - style issue pointed by [this comment](apache#29565 (comment)) - skip optimization when `fromExp` is foldable (by [this comment](apache#29565 (comment))) as there could be more efficient rule to apply for this case. - pass timezone info to the generated cast on the literal value - a bunch of cleanups and test improvements Originally I plan to handle this when implementing [SPARK-32858](https://issues.apache.org/jira/browse/SPARK-32858) but now think it's better to isolate these changes from that. ### Why are the changes needed? To fix a few left over issues in the above PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test for the foldable case. Otherwise relying on existing tests. Closes apache#29775 from sunchao/SPARK-24994-followup. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Currently, in cases like the following:
where
age
is of short type, Spark won't be able to simplify this and can only generate filtercast(age, int) < 40
. This won't get pushed down to datasources and therefore is not optimized.This PR proposes a optimizer rule to improve this when the following constraints are satisfied:
When this is true, it tries to do several optimizations to either simplify the expression or move the cast to the literal side, so
result filter for the above case becomes
age < cast(40 as smallint)
. This is better since the cast can be optimized away later and the filter can be pushed down to data sources.This PR follows a similar effort in Presto (https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html). Here we only handles integral types but plan to extend to other types as follow-ups.
Why are the changes needed?
As mentioned in the previous section, when cast is not optimized, it cannot be pushed down to data sources which can lead
to unnecessary IO and therefore longer job time and waste of resources. This helps to improve that.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added unit tests for both the optimizer rule and filter pushdown on datasource level for both Orc and Parquet.