-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19145][SQL] Timestamp to String casting is slowing the query s… #17174
Conversation
…ignificantly If BinaryComparison has expression with timestamp and string datatype then cast string to timestamp if string type expression is foldable. This results in order of magnitude performance improvement in query execution Added new unit tests to conver the functionality
// TimeStamp(2013-01-01 00:00 ...) < Cast( "2014" as timestamp) = true | ||
case p @ BinaryComparison(left @ StringType(), right) if dateOrTimestampType(right) => | ||
if (left.foldable) { | ||
p.makeCopy(Array(Cast(left, right.dataType), right)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are changing the semantics of the comparison here. For example '10' < current_timestamp
currently returns true
, if we apply your PR it will return null
. This is a breaking change, and I don't think we should do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out.
After applying the PR '10' < current_timestamp
does return null
and its breaking the semantics.
The fix to this issue is very important. We have seen order of magnitude performance difference if string is correctly casted to timestamp for Literals in SQL and most of these SQLs are generated using some BI tool.
Any other suggestion to fix this issue? Shall i make it more restrictive so that null cases are better covered. i.e
if( expr.foldable && expressed.eval( null ) !=null ) cast to timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change, I think you can still explicitly cast the string to time/date in order to speed up the comparison, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.. You can explicitly cast the string to timestamp and then speed up will be much faster. By default without casting query just runs fine silently , pick up a very bad plan, with no indication to user whatsoever and about order of magnitude slower
Some of the other issue related to comparison such astime < 'abc'
will also run just fine which i think should be fail fast and let user know about the issue with casting
Other problem is with BI tools which generate these SQLs where user do not have direct control on the SQL.
We came across this issue when the same query in Impala was running 10 times faster than in Spark and investigation of the that resulted in this bug and therefore fix
Thanks for investigation. This is definitely very useful to the other Spark users. Maybe you can make a change the existing document http://spark.apache.org/docs/latest/sql-programming-guide.html and explain our implicit type casting? So far, we do not have a perfect solution to avoid external behavior changes, if we use your changes. |
ping @tanejagagan |
@tanejagagan Can you update? |
I believe another performance impact related to this may be attributed to the
Note the |
@tanejagagan Are you still working on? |
I looked into the code and I thought out another solution for this issue; it tries to detect specific binary comparisons in quick benchmarks
WDYT? cc: @hvanhovell @gatorsmile @tanejagagan |
@maropu . +1 for your idea. |
@dongjoon-hyun Thanks for your checks!! I still wait for other developer's feedbacks. If the approach is positive, I'll make a pr. Also, welcome another idea to solve this. cc: @gatorsmile @hvanhovell @viirya |
Can one of the admins verify this patch? |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
…ignificantly
If BinaryComparison has expression with timestamp and string datatype then cast string to timestamp if string type expression is foldable. This results in order of magnitude performance improvement in query execution
What changes were proposed in this pull request?
If BinaryComparison has expression with timestamp and string datatype then cast string to timestamp if string type expression is foldable. This results in order of magnitude performance improvement in query execution
How was this patch tested?
Added new unit tests to conver the functionality
Please review http://spark.apache.org/contributing.html before opening a pull request.