Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19145][SQL] Timestamp to String casting is slowing the query s… #17174

Closed
wants to merge 1 commit into from

Conversation

tanejagagan
Copy link

…ignificantly

If BinaryComparison has expression with timestamp and string datatype then cast string to timestamp if string type expression is foldable. This results in order of magnitude performance improvement in query execution

What changes were proposed in this pull request?

If BinaryComparison has expression with timestamp and string datatype then cast string to timestamp if string type expression is foldable. This results in order of magnitude performance improvement in query execution

How was this patch tested?

Added new unit tests to conver the functionality

Please review http://spark.apache.org/contributing.html before opening a pull request.

…ignificantly

If BinaryComparison has expression with timestamp and string datatype then cast string to timestamp if string type expression is foldable. This results in order of magnitude performance improvement in query execution

Added new unit tests to conver the functionality
// TimeStamp(2013-01-01 00:00 ...) < Cast( "2014" as timestamp) = true
case p @ BinaryComparison(left @ StringType(), right) if dateOrTimestampType(right) =>
if (left.foldable) {
p.makeCopy(Array(Cast(left, right.dataType), right))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are changing the semantics of the comparison here. For example '10' < current_timestamp currently returns true, if we apply your PR it will return null. This is a breaking change, and I don't think we should do this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out.
After applying the PR '10' < current_timestamp does return null and its breaking the semantics.
The fix to this issue is very important. We have seen order of magnitude performance difference if string is correctly casted to timestamp for Literals in SQL and most of these SQLs are generated using some BI tool.
Any other suggestion to fix this issue? Shall i make it more restrictive so that null cases are better covered. i.e
if( expr.foldable && expressed.eval( null ) !=null ) cast to timestamp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, I think you can still explicitly cast the string to time/date in order to speed up the comparison, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.. You can explicitly cast the string to timestamp and then speed up will be much faster. By default without casting query just runs fine silently , pick up a very bad plan, with no indication to user whatsoever and about order of magnitude slower
Some of the other issue related to comparison such astime < 'abc'will also run just fine which i think should be fail fast and let user know about the issue with casting
Other problem is with BI tools which generate these SQLs where user do not have direct control on the SQL.
We came across this issue when the same query in Impala was running 10 times faster than in Spark and investigation of the that resulted in this bug and therefore fix

@gatorsmile
Copy link
Member

Thanks for investigation. This is definitely very useful to the other Spark users. Maybe you can make a change the existing document http://spark.apache.org/docs/latest/sql-programming-guide.html and explain our implicit type casting?

So far, we do not have a perfect solution to avoid external behavior changes, if we use your changes.

@gatorsmile
Copy link
Member

ping @tanejagagan

@maropu
Copy link
Member

maropu commented Jul 23, 2018

@tanejagagan Can you update?

@hindog
Copy link

hindog commented Sep 5, 2018

I believe another performance impact related to this may be attributed to the cast operator failing to match during filter-pushdown, meaning that the filter on the timestamp will NOT get pushed down to the reader when there's a cast involved. You can also see this in the original ticket's query plan output for both versions:

== Physical Plan ==
CollectLimit 50000
+- *HashAggregate(keys=[], functions=[count(1)], output=count#3290L)
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=count#3339L)
+- *Project
+- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 19:53:51))
+- *FileScan parquet default.cstattime#3314 Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: structtime:timestamp

== Physical Plan ==
CollectLimit 50000
+- *HashAggregate(keys=[], functions=[count(1)], output=count#3238L)
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=count#3287L)
+- *Project
+- *Filter ((isnotnull(time#3262) && (time#3262 >= 1483404831000000)) && (time#3262 <= 1484009631000000))
+- *FileScan parquet default.cstattime#3262 Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters: [], PushedFilters: [IsNotNull(time), GreaterThanOrEqual(time,2017-01-02 19:53:51.0), LessThanOrEqual(time,2017-01-09..., ReadSchema: structtime:timestamp

Note the PushedFilters is missing the GreaterThanOrEqual and LessThanOrEqual predicates on the former query, but present in the latter. I've narrowed where the PushedFilters get removed to DataSourceStrategry.translateFilter, where there's no match for the Cast expression so the cast+filter gets removed from the list to be pushed down.

@wangyum
Copy link
Member

wangyum commented Sep 5, 2018

@tanejagagan Are you still working on?

@maropu
Copy link
Member

maropu commented Sep 5, 2018

I looked into the code and I thought out another solution for this issue; it tries to detect specific binary comparisons in Optimizer, and then replaces them with specialized ones;
POC: master...maropu:SPARK-19145

quick benchmarks

scala> spark.range(10000000L).selectExpr("CAST(current_timestamp AS string) c0", "current_timestamp + interval 1 day AS c1").repartition(1).write.parquet("/tmp/t")
scala> paste:
def timer[R](block: => R): R = {
  val t0 = System.nanoTime()
  val result = block
  val t1 = System.nanoTime()
  println("Elapsed time: " + ((t1 - t0 + 0.0) / 1000000000.0)+ "s")
  result
}

// base number
scala> timer { spark.read.parquet("/tmp/t").repartition(1).selectExpr("CAST(c0 AS TIMESTAMP) < c1").queryExecution.executedPlan.execute().foreach(x => Unit) }
Elapsed time: 21.861871939s    

// without this patch
scala> timer { spark.read.parquet("/tmp/t").repartition(1).selectExpr("CAST(c0 AS TIMESTAMP) < c1").queryExecution.executedPlan.execute().foreach(x => Unit) }
Elapsed time: 48.424095598s   

// with this patch
scala> timer { spark.read.parquet("/tmp/t").repartition(1).selectExpr("c0 < c1").queryExecution.executedPlan.execute().foreach(x => Unit) }
Elapsed time: 22.459832991s                                                     

WDYT? cc: @hvanhovell @gatorsmile @tanejagagan

@dongjoon-hyun
Copy link
Member

@maropu . +1 for your idea.

@maropu
Copy link
Member

maropu commented Oct 24, 2018

@dongjoon-hyun Thanks for your checks!! I still wait for other developer's feedbacks. If the approach is positive, I'll make a pr. Also, welcome another idea to solve this. cc: @gatorsmile @hvanhovell @viirya

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 16, 2020
@github-actions github-actions bot closed this Jan 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants