[SPARK-19145][SQL] Timestamp to String casting is slowing the query s… #17174

tanejagagan · 2017-03-06T07:39:55Z

…ignificantly

If BinaryComparison has expression with timestamp and string datatype then cast string to timestamp if string type expression is foldable. This results in order of magnitude performance improvement in query execution

What changes were proposed in this pull request?

If BinaryComparison has expression with timestamp and string datatype then cast string to timestamp if string type expression is foldable. This results in order of magnitude performance improvement in query execution

How was this patch tested?

Added new unit tests to conver the functionality

Please review http://spark.apache.org/contributing.html before opening a pull request.

…ignificantly If BinaryComparison has expression with timestamp and string datatype then cast string to timestamp if string type expression is foldable. This results in order of magnitude performance improvement in query execution Added new unit tests to conver the functionality

hvanhovell · 2017-03-06T12:21:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+      // TimeStamp(2013-01-01 00:00 ...) < Cast( "2014" as timestamp) = true
+      case p @ BinaryComparison(left @ StringType(), right) if dateOrTimestampType(right) =>
+        if (left.foldable) {
+          p.makeCopy(Array(Cast(left, right.dataType), right))


You are changing the semantics of the comparison here. For example '10' < current_timestamp currently returns true, if we apply your PR it will return null. This is a breaking change, and I don't think we should do this.

Thanks for pointing that out.
After applying the PR '10' < current_timestamp does return null and its breaking the semantics.
The fix to this issue is very important. We have seen order of magnitude performance difference if string is correctly casted to timestamp for Literals in SQL and most of these SQLs are generated using some BI tool.
Any other suggestion to fix this issue? Shall i make it more restrictive so that null cases are better covered. i.e
if( expr.foldable && expressed.eval( null ) !=null ) cast to timestamp

Without this change, I think you can still explicitly cast the string to time/date in order to speed up the comparison, right?

Yes.. You can explicitly cast the string to timestamp and then speed up will be much faster. By default without casting query just runs fine silently , pick up a very bad plan, with no indication to user whatsoever and about order of magnitude slower
Some of the other issue related to comparison such astime < 'abc'will also run just fine which i think should be fail fast and let user know about the issue with casting
Other problem is with BI tools which generate these SQLs where user do not have direct control on the SQL.
We came across this issue when the same query in Impala was running 10 times faster than in Spark and investigation of the that resulted in this bug and therefore fix

gatorsmile · 2017-05-23T17:06:25Z

Thanks for investigation. This is definitely very useful to the other Spark users. Maybe you can make a change the existing document http://spark.apache.org/docs/latest/sql-programming-guide.html and explain our implicit type casting?

So far, we do not have a perfect solution to avoid external behavior changes, if we use your changes.

gatorsmile · 2017-10-27T23:49:46Z

ping @tanejagagan

maropu · 2018-07-23T15:58:00Z

@tanejagagan Can you update?

hindog · 2018-09-05T01:12:32Z

I believe another performance impact related to this may be attributed to the cast operator failing to match during filter-pushdown, meaning that the filter on the timestamp will NOT get pushed down to the reader when there's a cast involved. You can also see this in the original ticket's query plan output for both versions:

== Physical Plan ==
CollectLimit 50000
+- *HashAggregate(keys=[], functions=[count(1)], output=count#3290L)
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=count#3339L)
+- *Project
+- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 19:53:51))
+- *FileScan parquet default.cstattime#3314 Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: structtime:timestamp

== Physical Plan ==
CollectLimit 50000
+- *HashAggregate(keys=[], functions=[count(1)], output=count#3238L)
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=count#3287L)
+- *Project
+- *Filter ((isnotnull(time#3262) && (time#3262 >= 1483404831000000)) && (time#3262 <= 1484009631000000))
+- *FileScan parquet default.cstattime#3262 Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters: [], PushedFilters: [IsNotNull(time), GreaterThanOrEqual(time,2017-01-02 19:53:51.0), LessThanOrEqual(time,2017-01-09..., ReadSchema: structtime:timestamp

Note the PushedFilters is missing the GreaterThanOrEqual and LessThanOrEqual predicates on the former query, but present in the latter. I've narrowed where the PushedFilters get removed to DataSourceStrategry.translateFilter, where there's no match for the Cast expression so the cast+filter gets removed from the list to be pushed down.

wangyum · 2018-09-05T05:31:14Z

@tanejagagan Are you still working on?

maropu · 2018-09-05T12:33:07Z

I looked into the code and I thought out another solution for this issue; it tries to detect specific binary comparisons in Optimizer, and then replaces them with specialized ones;
POC: master...maropu:SPARK-19145

quick benchmarks

scala> spark.range(10000000L).selectExpr("CAST(current_timestamp AS string) c0", "current_timestamp + interval 1 day AS c1").repartition(1).write.parquet("/tmp/t")
scala> paste:
def timer[R](block: => R): R = {
  val t0 = System.nanoTime()
  val result = block
  val t1 = System.nanoTime()
  println("Elapsed time: " + ((t1 - t0 + 0.0) / 1000000000.0)+ "s")
  result
}

// base number
scala> timer { spark.read.parquet("/tmp/t").repartition(1).selectExpr("CAST(c0 AS TIMESTAMP) < c1").queryExecution.executedPlan.execute().foreach(x => Unit) }
Elapsed time: 21.861871939s    

// without this patch
scala> timer { spark.read.parquet("/tmp/t").repartition(1).selectExpr("CAST(c0 AS TIMESTAMP) < c1").queryExecution.executedPlan.execute().foreach(x => Unit) }
Elapsed time: 48.424095598s   

// with this patch
scala> timer { spark.read.parquet("/tmp/t").repartition(1).selectExpr("c0 < c1").queryExecution.executedPlan.execute().foreach(x => Unit) }
Elapsed time: 22.459832991s

WDYT? cc: @hvanhovell @gatorsmile @tanejagagan

dongjoon-hyun · 2018-10-23T22:47:24Z

@maropu . +1 for your idea.

maropu · 2018-10-24T01:36:24Z

@dongjoon-hyun Thanks for your checks!! I still wait for other developer's feedbacks. If the approach is positive, I'll make a pr. Also, welcome another idea to solve this. cc: @gatorsmile @hvanhovell @viirya

AmplabJenkins · 2019-09-16T18:26:30Z

Can one of the admins verify this patch?

github-actions · 2020-01-16T00:08:34Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

hvanhovell reviewed Mar 6, 2017

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 16, 2020

github-actions bot closed this Jan 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19145][SQL] Timestamp to String casting is slowing the query s… #17174

[SPARK-19145][SQL] Timestamp to String casting is slowing the query s… #17174

tanejagagan commented Mar 6, 2017

hvanhovell Mar 6, 2017

tanejagagan Mar 6, 2017

viirya Mar 9, 2017

tanejagagan Mar 10, 2017

gatorsmile commented May 23, 2017

gatorsmile commented Oct 27, 2017

maropu commented Jul 23, 2018

hindog commented Sep 5, 2018

wangyum commented Sep 5, 2018

maropu commented Sep 5, 2018

dongjoon-hyun commented Oct 23, 2018

maropu commented Oct 24, 2018

AmplabJenkins commented Sep 16, 2019

github-actions bot commented Jan 16, 2020

[SPARK-19145][SQL] Timestamp to String casting is slowing the query s… #17174

[SPARK-19145][SQL] Timestamp to String casting is slowing the query s… #17174

Conversation

tanejagagan commented Mar 6, 2017

What changes were proposed in this pull request?

How was this patch tested?

hvanhovell Mar 6, 2017

Choose a reason for hiding this comment

tanejagagan Mar 6, 2017

Choose a reason for hiding this comment

viirya Mar 9, 2017

Choose a reason for hiding this comment

tanejagagan Mar 10, 2017

Choose a reason for hiding this comment

gatorsmile commented May 23, 2017

gatorsmile commented Oct 27, 2017

maropu commented Jul 23, 2018

hindog commented Sep 5, 2018

wangyum commented Sep 5, 2018

maropu commented Sep 5, 2018

dongjoon-hyun commented Oct 23, 2018

maropu commented Oct 24, 2018

AmplabJenkins commented Sep 16, 2019

github-actions bot commented Jan 16, 2020