[SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp #28821

Fokko · 2020-06-13T09:51:23Z

Currently, when you take an average of a Timestamp, you'll end up with a Double, representing the seconds since epoch. This is because of old Hive behavior.

I strongly believe that it is better to return a Timestamp, to make the behavior more congruent with the max function, for example:

scala> df.agg(max("ts"), avg("ts")).show()
+-------------------+----------+
|            max(ts)|   avg(ts)|
+-------------------+----------+
|2018-03-04 06:11:40|7.126971E8|
+-------------------+----------+

This might also improve performance because we get rid of the implicit cast.

Behaviour in Postgres:

root@8c4241b617ec:/# psql postgres postgres
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# CREATE TABLE timestamp_demo (ts TIMESTAMP);
CREATE TABLE
postgres=# INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
INSERT 0 1
postgres=# INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
INSERT 0 1
postgres=# INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
INSERT 0 1
postgres=# SELECT AVG(ts) FROM timestamp_demo;
ERROR:  function avg(timestamp without time zone) does not exist
LINE 1: SELECT AVG(ts) FROM timestamp_demo;

Behaviour in MySQL:

root@bab43a5731e8:/# mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 8.0.20 MySQL Community Server - GPL

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE TABLE timestamp_demo (ts TIMESTAMP);
Query OK, 0 rows affected (0.05 sec)

mysql> INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)

mysql> SELECT AVG(ts) FROM timestamp_demo;
+---------------------+
| AVG(ts)             |
+---------------------+
| 20180101182211.0000 |
+---------------------+
1 row in set (0.00 sec)

Which is a YYYYMMDDHHMMSS format in double.

What changes were proposed in this pull request?

Return a Timestamp when taking the average of a Timestamp column.

Why are the changes needed?

I believe that it is awkward to implicitly change the type when doing an average.

Does this PR introduce any user-facing change?

Yes, it will return a Timestamp instead of a Double when taking an average of a Timestamp column.

How was this patch tested?

Using unit tests and existing tests.

HyukjinKwon · 2020-06-14T05:14:50Z

ok to test

SparkQA · 2020-06-14T07:05:02Z

Test build #123996 has finished for PR 28821 at commit 707b0cf.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-06-14T15:40:31Z

sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestData.scala

+      TestDataTimestamp(new Timestamp(1420140300000L)) :: // 2015-01-01 20:25:00
+        TestDataTimestamp(new Timestamp(1320140300000L)) :: // 2011-11-01 10:38:20
+        TestDataTimestamp(new Timestamp(1520140300000L)) :: // 2018-03-04 06:11:40
+        TestDataTimestamp(new Timestamp(-1409632500000L)) :: // 1925-05-01 19:44:32


Do you set the fractional part of seconds to zeros intentionally?

No, this wasn't intentional. I can add some fractions if you like.

dongjoon-hyun · 2020-06-15T02:06:51Z

Retest this please.

SparkQA · 2020-06-15T05:04:35Z

Test build #124024 has finished for PR 28821 at commit 707b0cf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-06-15T06:31:16Z

Hi, @Fokko . The last failure means that we need to regenerate the output file of ThriftServerQueryTestSuite.sql. To proceed more, please update it.

SparkQA · 2020-06-15T12:39:37Z

Test build #124049 has finished for PR 28821 at commit 83be227.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-06-16T15:35:09Z

Retest this please.

SparkQA · 2020-06-16T17:15:14Z

Test build #124130 has finished for PR 28821 at commit 83be227.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Fokko · 2020-06-16T18:18:20Z

Thanks for the restart @dongjoon-hyun. Another error now; let me dive into it and get back to y'all.

SparkQA · 2020-06-23T00:53:07Z

Test build #124369 has finished for PR 28821 at commit bd99e28.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class MapOutputCommitMessage
sealed trait LogisticRegressionSummary extends ClassificationSummary
class _ClassificationSummary(JavaWrapper):
class _TrainingSummary(JavaWrapper):
class _BinaryClassificationSummary(_ClassificationSummary):
class LogisticRegressionSummary(_ClassificationSummary):
class LogisticRegressionTrainingSummary(LogisticRegressionSummary, _TrainingSummary):
class BinaryLogisticRegressionSummary(_BinaryClassificationSummary,
trait TimestampFormatterHelper extends TimeZoneAwareExpression
case class WidthBucket(
trait PredicateHelper extends Logging
case class TimeFormatters(date: DateFormatter, timestamp: TimestampFormatter)
case class CoalesceBucketsInSortMergeJoin(conf: SQLConf) extends Rule[SparkPlan]
case class ProcessingTimeTrigger(intervalMs: Long) extends Trigger
case class ContinuousTrigger(intervalMs: Long) extends Trigger
class StateStoreConf(

HyukjinKwon · 2020-06-26T07:35:29Z

sql/core/src/test/resources/sql-tests/results/udf/udf-window.sql.out

+2020-12-30 16:00:00	a	2020-12-30 16:00:00
+2017-07-31 17:00:00	b	2017-08-09 03:00:00
+2017-08-17 13:00:00	b	2017-08-17 13:00:00
+2020-12-30 16:00:00	b	2020-12-30 16:00:00


Hm .. so we allow timestamp types whereas other DBMSes disallow?

Well, there isn't a real consensus around other DBMS'es. Keeping it a Timestamp seems like something that you would expect. MySQL's behavior is much more awkward in my opinion. Spark needs to pave the path on this one :)

It seems Oracle cannot accept timestamp for average, too. Any other system supporting this behaivour?

SparkQA · 2020-06-30T17:59:52Z

Test build #124669 has finished for PR 28821 at commit 14dcf36.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Fokko · 2020-07-02T11:04:13Z

Found the issue.

(sum / count).cast(resultType)

Needs to be:

(sum / count.cast(DecimalType.LongDecimal)).cast(resultType)

Similar to the line above. For the division operation, both the dividend and divisor requires to be a Fractional type. Great to have such extensive tests :)

SparkQA · 2020-07-02T11:56:31Z

Test build #124895 has finished for PR 28821 at commit d2c49f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…estamp Currently, when you take an average of a Timestamp, you'll end up with a Double, representing the seconds since epoch. This is because of old Hive behavior. I strongly believe that it is better to return a Timestamp. Behaviour in Postgres: ``` root@8c4241b617ec:/# psql postgres postgres psql (12.3 (Debian 12.3-1.pgdg100+1)) Type "help" for help. postgres=# CREATE TABLE timestamp_demo (ts TIMESTAMP); CREATE TABLE postgres=# INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11'); INSERT 0 1 postgres=# INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11'); INSERT 0 1 postgres=# INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11'); INSERT 0 1 postgres=# SELECT AVG(ts) FROM timestamp_demo; ERROR: function avg(timestamp without time zone) does not exist LINE 1: SELECT AVG(ts) FROM timestamp_demo; ``` Behaviour in MySQL: ``` root@bab43a5731e8:/# mysql Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 9 Server version: 8.0.20 MySQL Community Server - GPL Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> CREATE TABLE timestamp_demo (ts TIMESTAMP); Query OK, 0 rows affected (0.05 sec) mysql> INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11'); Query OK, 1 row affected (0.01 sec) mysql> INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11'); Query OK, 1 row affected (0.01 sec) mysql> INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11'); Query OK, 1 row affected (0.01 sec) mysql> SELECT AVG(ts) FROM timestamp_demo; +---------------------+ | AVG(ts) | +---------------------+ | 20180101182211.0000 | +---------------------+ 1 row in set (0.00 sec) ``` Which is a YYYYMMDDHHMMSS format in double.

Otherwise the division would be (Double / Long) which is incompatible.

SparkQA · 2020-07-02T15:11:29Z

Test build #124898 has finished for PR 28821 at commit 3772374.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-04T14:03:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala


      // Hive lets you do aggregation of timestamps... for some reason
      case Sum(e @ TimestampType()) => Sum(Cast(e, DoubleType))
-      case Average(e @ TimestampType()) => Average(Cast(e, DoubleType))


When chaning the existing bheivour, we need to update the migration guide and might need to add a legacy config to keep the current behaivour.

github-actions · 2020-10-13T00:52:18Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

probot-autolabeler bot added the SQL label Jun 13, 2020

Fokko force-pushed the SPARK-31981 branch from e7f317e to 707b0cf Compare June 13, 2020 10:54

MaxGekk reviewed Jun 14, 2020

View reviewed changes

HyukjinKwon reviewed Jun 26, 2020

View reviewed changes

Fokko force-pushed the SPARK-31981 branch from bd99e28 to 14dcf36 Compare June 30, 2020 12:05

Fokko added 3 commits July 2, 2020 13:58

Update the testcases

a264918

Cast the Long to LongDecimal

3772374

Otherwise the division would be (Double / Long) which is incompatible.

Fokko force-pushed the SPARK-31981 branch from d2c49f3 to 3772374 Compare July 2, 2020 11:58

maropu reviewed Jul 4, 2020

View reviewed changes

github-actions bot added the Stale label Oct 13, 2020

github-actions bot closed this Oct 14, 2020

[SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp #28821

[SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp #28821

Uh oh!

Conversation

Fokko commented Jun 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jun 14, 2020

Uh oh!

SparkQA commented Jun 14, 2020

Uh oh!

MaxGekk Jun 14, 2020

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 30, 2020

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 15, 2020

Uh oh!

SparkQA commented Jun 15, 2020

Uh oh!

dongjoon-hyun commented Jun 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 15, 2020

Uh oh!

dongjoon-hyun commented Jun 16, 2020

Uh oh!

SparkQA commented Jun 16, 2020

Uh oh!

Fokko commented Jun 16, 2020

Uh oh!

SparkQA commented Jun 23, 2020

Uh oh!

HyukjinKwon Jun 26, 2020

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 30, 2020

Choose a reason for hiding this comment

Uh oh!

maropu Jul 4, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 30, 2020

Uh oh!

Fokko commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 2, 2020

Uh oh!

SparkQA commented Jul 2, 2020

Uh oh!

maropu Jul 4, 2020

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fokko commented Jun 13, 2020 •

edited

Loading

dongjoon-hyun commented Jun 15, 2020 •

edited

Loading

Fokko commented Jul 2, 2020 •

edited

Loading