[SPARK-31982][SQL]Function sequence doesn't handle date increments that cross DST #28856

TJX2014 · 2020-06-18T10:57:16Z

What changes were proposed in this pull request?

Add a unit test.
Logical bug fix in org.apache.spark.sql.catalyst.expressions.Sequence.TemporalSequenceImpl

Why are the changes needed?

Spark sequence doesn't handle date increments that cross DST

Does this PR introduce any user-facing change?

Before the PR, people will not get a correct result：
set spark.sql.session.timeZone to Asia/Shanghai, America/Chicago, GMT,
Before execute
sql("select sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 month)").show(false), People will get [2011-03-01, 2011-04-01, 2011-05-01], [2011-03-01, 2011-03-28, 2011-04-28], [2011-03-01, 2011-04-01, 2011-05-01].

After the PR, sequence date conversion is corrected：
We will get [2011-03-01, 2011-04-01, 2011-05-01] from the former three conditions.

How was this patch tested?

Unit test.

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

TJX2014 · 2020-06-18T11:32:58Z

@cloud-fan @maropu

cloud-fan · 2020-06-18T12:04:29Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

@@ -2623,8 +2623,16 @@ object Sequence {
        // about a month length in days and a day length in microseconds
        val intervalStepInMicros =


it looks like the step should not be physical seconds, but logical interval. cc @MaxGekk

The current implementation is strange mix, it seems. There are following options:

Step is an interval of (months, days, micros):

If start point is TimestampType, we should convert it to local date-time in the session time zone, and add the interval by time components. The intermediate local timestamps should be converted back to micros using the session time zone but we should keep adding the interval to local timestamp "accumulator".

The same for dates - convert start to a local date. Time zone shouldn't involved here.

If the step is a duration in micros or days (this is not our case)

start is TimestampType, we shouldn't convert it to local timestamp, and just add micros to instants. So, time zone will be not involved here.

start is DateType, just add number of days. The same as for timestamps, time zone is not involved here.

hi @cloud-fan，as @MaxGekk explain here, I am not sure if this patch looks ok，I am willing to add more documents to TemporalSequenceImplbut I am not sure if we can follow this way or refactor a little.

MaxGekk · 2020-06-18T12:29:13Z

...st/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala

+        Option(Literal(stringToInterval("interval 1 month"))),
+        Option(tz)),
+        Seq(
+          Date.valueOf("2011-03-01"), Date.valueOf("2011-04-01")))


The guys are still in America/Los_Angeles, right?

Yes, America/Los_Angeles can pass the test.

I mean Date.valueOf always uses America/Los_Angeles independently what you test.

Sure, the result could be tz independently.

Please, wrap the code by withDefaultTimeZone otherwise your expected dates are wrong.

MaxGekk · 2020-06-18T12:30:29Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

@@ -2698,7 +2717,7 @@ object Sequence {
         |  int $i = 0;
         |
         |  while ($t < $exclusiveItem ^ $stepSign < 0) {
-         |    $arr[$i] = ($elemType) ($t / ${scale}L);
+         |    $arr[$i] = ($elemType) (Math.round($t / (float)${scale}L));


Floating point ops looks dangerous here. Can you avoid them?

@MaxGekk Because we may need the Math.round, if not use this float ops，it seems hard to avoid the gap about one day between the output and expected.

cloud-fan · 2020-06-18T13:21:11Z

People will get ...

How do you get the result? df.collect or df.show?

TJX2014 · 2020-06-18T15:11:23Z

People will get ...

How do you get the result? df.collect or df.show?

It seems to me that both df.collect and df.show can show this.

maropu · 2020-06-19T02:13:16Z

ok to test

SparkQA · 2020-06-19T02:37:57Z

Test build #124246 has finished for PR 28856 at commit 9650122.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

SparkQA · 2020-06-19T07:05:02Z

Test build #124254 has finished for PR 28856 at commit b33514f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-19T07:47:40Z

Test build #124273 has finished for PR 28856 at commit 082f9c0.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-19T08:38:22Z

Test build #124275 has finished for PR 28856 at commit 871867b.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-19T09:46:01Z

Test build #124281 has finished for PR 28856 at commit 7196f6e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-06-19T09:58:15Z

The function epochDaysToMicros was removed by #27617. Use daysToMicros instead of it.

SparkQA · 2020-06-19T14:36:23Z

Test build #124285 has finished for PR 28856 at commit 0c5d8fb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-19T15:31:23Z

Test build #124286 has finished for PR 28856 at commit 347fa9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…aysToMicros

SparkQA · 2020-06-20T13:25:04Z

Test build #124321 has finished for PR 28856 at commit 19d8c48.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-20T13:49:05Z

Test build #124322 has finished for PR 28856 at commit c88efec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-22T08:25:46Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+        }
+        else {
+          (daysToMicros(num.toInt(start), zoneId),
+            daysToMicros(num.toInt(stop), zoneId))


can you explain a bit more? It's hard to understand this change without any comment.

Yeh, please, explain why if scale != 1, start and stop contain days.

Because when scale != 1，it is converted to day count，so we may need to use zone info to translate into microseconds to get a correct result，rather than just multiply MICROS_PER_DAY which ignore timezone.

Because when scale != 1，it is converted to day count

How can we tell it?

maybe we should add more documents to TemporalSequenceImpl first, to understand what it is doing

Done, maybe we can pass the scale through constructor：
private class TemporalSequenceImpl[T: ClassTag]
(dt: IntegralType, scale: Long, fromLong: Long => T, zoneId: ZoneId)

MaxGekk · 2020-06-22T08:38:15Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+
+        // Date to timestamp is not equal from GMT and Chicago timezones
+        val (startMicros, stopMicros) = if (scale == 1) {
+          (num.toLong(start), num.toLong(stop))


Don't think this is correct, see my comment https://github.com/apache/spark/pull/28856/files#r442366706 but it is at least backward compatible?

Maybe we could separate this into different methods ?

SparkQA · 2020-06-22T19:38:37Z

Test build #124366 has finished for PR 28856 at commit 6a341bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-23T06:06:21Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

@@ -2589,6 +2589,8 @@ object Sequence {
    }
  }

+  // To generate time sequences, we use scale 1 in TemporalSequenceImpl
+  // for `TimestampType`, while MICROS_PER_DAY for `DateType`


if start/end is date, can the step by seconds/minutes/hours?

Yes, seems we can, the result as follows：
`scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 hour))").count
res19: Long = 1465

scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 minute))").count
res20: Long = 87841

scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 second))").count
res21: Long = 5270401
scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 minute))").head(3)
res25: Array[org.apache.spark.sql.Row] = Array([2011-03-01], [2011-03-01], [2011-03-01])

scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 second))").head(3)

res26: Array[org.apache.spark.sql.Row] = Array([2011-03-01], [2011-03-01], [2011-03-01])

scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 minute))").head(3)
res27: Array[org.apache.spark.sql.Row] = Array([2011-03-01], [2011-03-01], [2011-03-01])

scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 hour))").head(3)
res28: Array[org.apache.spark.sql.Row] = Array([2011-03-01], [2011-03-01], [2011-03-01])
`

does pgsql support it?

Seems pgsql can only support int as follows：
postgres= create sequence seq_test;
CREATE SEQUENCE
postgres= select nextval('seq_test');
1
(1 行记录)
postgres= select nextval('seq_test');
2
(1 行记录)

Actually this function is from presto: https://prestodb.io/docs/current/functions/array.html

Can you check the behavior of presto? It looks confusing to use time fields as the step for date start/stop.

Base presto-server-0.236：
presto> select sequence(date('2011-03-01'),date('2011-03-02'),interval '1' hour);
Query 20200624_122744_00002_pehix failed: sequence step must be a day interval if start and end values are dates
presto> select sequence(date('2011-03-01'),date('2011-03-02'),interval '1' day);
_col0
[2011-03-01, 2011-03-02]
(1 row)
Query 20200624_122757_00003_pehix, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
presto> select sequence(date('2011-03-01'),date('2011-03-02'),interval '1' month);
_col0
[2011-03-01]
(1 row)

Query 20200624_122806_00004_pehix, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
presto> select sequence(date('2011-03-01'),date('2011-03-02'),interval '1' year);
_col0
[2011-03-01]
(1 row)
Query 20200624_122810_00005_pehix, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

@cloud-fan Done, It seems can be sequence day,month,year when start and end are DateType in presto.

I think the presto behavior makes sense. Can we send a PR to follow it first? e.g. throw an exception if the step is time fields while start/end is date. This can also simplify the implementation.

Ok, I will do it tomorrow.

HyukjinKwon · 2020-07-06T05:30:46Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

-        val startMicros: Long = num.toLong(start) * scale
-        val stopMicros: Long = num.toLong(stop) * scale
+
+        // Date to timestamp is not equal from GMT and Chicago timezones


Why do these codes depend on few specific timezones? Also, other comments look valid here. We should take the timezone into account for timestamps too

Seems the date if different from west to east, when it is date, we might need to consider to zone info to convert to time stamp, if it is already a time stamp, not a date here, we may ignore the zone because the time stamp is already consider it.

github-actions · 2020-10-15T00:52:39Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

probot-autolabeler bot added the SQL label Jun 18, 2020

TJX2014 commented Jun 18, 2020

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala Show resolved Hide resolved

TJX2014 commented Jun 18, 2020

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jun 18, 2020

View reviewed changes

MaxGekk reviewed Jun 18, 2020

View reviewed changes

TJX2014 commented Jun 19, 2020

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala Show resolved Hide resolved

TJX2014 commented Jun 19, 2020

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala Outdated Show resolved Hide resolved

TJX2014 commented Jun 19, 2020

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala Outdated Show resolved Hide resolved

TJX2014 force-pushed the master-SPARK-31982-sequence-cross-dst branch from 082f9c0 to 871867b Compare June 19, 2020 08:32

TJX2014 added 6 commits June 20, 2020 15:50

Asia timezone fix

04aabd8

Math.round overflow fix

2a7caf7

clean code

d23d155

add withDefaultTimeZone and epochDaysToMicros => DateTimeUtils.epochD…

d732984

…aysToMicros

simple epochDaysToMicros => daysToMicros

076ebce

codegen epochDaysToMicros => daysToMicros

19d8c48

TJX2014 force-pushed the master-SPARK-31982-sequence-cross-dst branch from 347fa9d to 19d8c48 Compare June 20, 2020 08:51

withDefaultTimeZone restore

c88efec

TJX2014 requested a review from cloud-fan June 21, 2020 12:00

cloud-fan reviewed Jun 22, 2020

View reviewed changes

MaxGekk reviewed Jun 22, 2020

View reviewed changes

doc for TemporalSequenceImpl

6a341bf

cloud-fan reviewed Jun 23, 2020

View reviewed changes

TJX2014 mentioned this pull request Jun 25, 2020

[SPARK-32133][SQL] Forbid time field steps for date start/end in Sequence #28926

Closed

HyukjinKwon reviewed Jul 6, 2020

View reviewed changes

github-actions bot added the Stale label Oct 15, 2020

github-actions bot closed this Oct 16, 2020

		@@ -2623,8 +2623,16 @@ object Sequence {
		// about a month length in days and a day length in microseconds
		val intervalStepInMicros =

[SPARK-31982][SQL]Function sequence doesn't handle date increments that cross DST #28856

[SPARK-31982][SQL]Function sequence doesn't handle date increments that cross DST #28856

Conversation

TJX2014 commented Jun 18, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

TJX2014 commented Jun 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TJX2014 Jun 22, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TJX2014 Jun 18, 2020 • edited

Choose a reason for hiding this comment

cloud-fan commented Jun 18, 2020

TJX2014 commented Jun 18, 2020

maropu commented Jun 19, 2020

SparkQA commented Jun 19, 2020

SparkQA commented Jun 19, 2020

SparkQA commented Jun 19, 2020

SparkQA commented Jun 19, 2020

SparkQA commented Jun 19, 2020

MaxGekk commented Jun 19, 2020

SparkQA commented Jun 19, 2020

SparkQA commented Jun 19, 2020

SparkQA commented Jun 20, 2020

SparkQA commented Jun 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 22, 2020

Choose a reason for hiding this comment

TJX2014 Jun 23, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TJX2014 Jun 24, 2020 • edited

Choose a reason for hiding this comment

TJX2014 Jun 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 15, 2020

TJX2014 commented Jun 18, 2020 •

edited

TJX2014 Jun 22, 2020 •

edited

TJX2014 Jun 18, 2020 •

edited

TJX2014 Jun 23, 2020 •

edited

TJX2014 Jun 24, 2020 •

edited

TJX2014 Jun 24, 2020 •

edited