Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Jun 13, 2020

Currently, when you take an average of a Timestamp, you'll end up with a Double, representing the seconds since epoch. This is because of old Hive behavior.

I strongly believe that it is better to return a Timestamp, to make the behavior more congruent with the max function, for example:

scala> df.agg(max("ts"), avg("ts")).show()
+-------------------+----------+
|            max(ts)|   avg(ts)|
+-------------------+----------+
|2018-03-04 06:11:40|7.126971E8|
+-------------------+----------+

This might also improve performance because we get rid of the implicit cast.

Behaviour in Postgres:

root@8c4241b617ec:/# psql postgres postgres
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# CREATE TABLE timestamp_demo (ts TIMESTAMP);
CREATE TABLE
postgres=# INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
INSERT 0 1
postgres=# INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
INSERT 0 1
postgres=# INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
INSERT 0 1
postgres=# SELECT AVG(ts) FROM timestamp_demo;
ERROR:  function avg(timestamp without time zone) does not exist
LINE 1: SELECT AVG(ts) FROM timestamp_demo;

Behaviour in MySQL:

root@bab43a5731e8:/# mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 8.0.20 MySQL Community Server - GPL

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE TABLE timestamp_demo (ts TIMESTAMP);
Query OK, 0 rows affected (0.05 sec)

mysql> INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)

mysql> SELECT AVG(ts) FROM timestamp_demo;
+---------------------+
| AVG(ts)             |
+---------------------+
| 20180101182211.0000 |
+---------------------+
1 row in set (0.00 sec)

Which is a YYYYMMDDHHMMSS format in double.

What changes were proposed in this pull request?

Return a Timestamp when taking the average of a Timestamp column.

Why are the changes needed?

I believe that it is awkward to implicitly change the type when doing an average.

Does this PR introduce any user-facing change?

Yes, it will return a Timestamp instead of a Double when taking an average of a Timestamp column.

How was this patch tested?

Using unit tests and existing tests.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 14, 2020

Test build #123996 has finished for PR 28821 at commit 707b0cf.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines +79 to +82
TestDataTimestamp(new Timestamp(1420140300000L)) :: // 2015-01-01 20:25:00
TestDataTimestamp(new Timestamp(1320140300000L)) :: // 2011-11-01 10:38:20
TestDataTimestamp(new Timestamp(1520140300000L)) :: // 2018-03-04 06:11:40
TestDataTimestamp(new Timestamp(-1409632500000L)) :: // 1925-05-01 19:44:32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you set the fractional part of seconds to zeros intentionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this wasn't intentional. I can add some fractions if you like.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124024 has finished for PR 28821 at commit 707b0cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 15, 2020

Hi, @Fokko . The last failure means that we need to regenerate the output file of ThriftServerQueryTestSuite.sql. To proceed more, please update it.

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124049 has finished for PR 28821 at commit 83be227.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124130 has finished for PR 28821 at commit 83be227.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Fokko
Copy link
Contributor Author

Fokko commented Jun 16, 2020

Thanks for the restart @dongjoon-hyun. Another error now; let me dive into it and get back to y'all.

@SparkQA
Copy link

SparkQA commented Jun 23, 2020

Test build #124369 has finished for PR 28821 at commit bd99e28.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class MapOutputCommitMessage
  • sealed trait LogisticRegressionSummary extends ClassificationSummary
  • class _ClassificationSummary(JavaWrapper):
  • class _TrainingSummary(JavaWrapper):
  • class _BinaryClassificationSummary(_ClassificationSummary):
  • class LogisticRegressionSummary(_ClassificationSummary):
  • class LogisticRegressionTrainingSummary(LogisticRegressionSummary, _TrainingSummary):
  • class BinaryLogisticRegressionSummary(_BinaryClassificationSummary,
  • trait TimestampFormatterHelper extends TimeZoneAwareExpression
  • case class WidthBucket(
  • trait PredicateHelper extends Logging
  • case class TimeFormatters(date: DateFormatter, timestamp: TimestampFormatter)
  • case class CoalesceBucketsInSortMergeJoin(conf: SQLConf) extends Rule[SparkPlan]
  • case class ProcessingTimeTrigger(intervalMs: Long) extends Trigger
  • case class ContinuousTrigger(intervalMs: Long) extends Trigger
  • class StateStoreConf(

2020-12-30 16:00:00 a 2020-12-30 16:00:00
2017-07-31 17:00:00 b 2017-08-09 03:00:00
2017-08-17 13:00:00 b 2017-08-17 13:00:00
2020-12-30 16:00:00 b 2020-12-30 16:00:00
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm .. so we allow timestamp types whereas other DBMSes disallow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there isn't a real consensus around other DBMS'es. Keeping it a Timestamp seems like something that you would expect. MySQL's behavior is much more awkward in my opinion. Spark needs to pave the path on this one :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems Oracle cannot accept timestamp for average, too. Any other system supporting this behaivour?

@SparkQA
Copy link

SparkQA commented Jun 30, 2020

Test build #124669 has finished for PR 28821 at commit 14dcf36.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Fokko
Copy link
Contributor Author

Fokko commented Jul 2, 2020

Found the issue.

(sum / count).cast(resultType)

Needs to be:

(sum / count.cast(DecimalType.LongDecimal)).cast(resultType)

Similar to the line above. For the division operation, both the dividend and divisor requires to be a Fractional type. Great to have such extensive tests :)

@SparkQA
Copy link

SparkQA commented Jul 2, 2020

Test build #124895 has finished for PR 28821 at commit d2c49f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Fokko added 3 commits July 2, 2020 13:58
…estamp

Currently, when you take an average of a Timestamp,
you'll end up with a Double, representing the seconds
since epoch. This is because of old Hive behavior.

I strongly believe that it is better to return a Timestamp.

Behaviour in Postgres:

```
root@8c4241b617ec:/# psql postgres postgres
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# CREATE TABLE timestamp_demo (ts TIMESTAMP);
CREATE TABLE
postgres=# INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
INSERT 0 1
postgres=# INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
INSERT 0 1
postgres=# INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
INSERT 0 1
postgres=# SELECT AVG(ts) FROM timestamp_demo;
ERROR:  function avg(timestamp without time zone) does not exist
LINE 1: SELECT AVG(ts) FROM timestamp_demo;
```

Behaviour in MySQL:

```
root@bab43a5731e8:/# mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 8.0.20 MySQL Community Server - GPL

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE TABLE timestamp_demo (ts TIMESTAMP);
Query OK, 0 rows affected (0.05 sec)

mysql> INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)

mysql> SELECT AVG(ts) FROM timestamp_demo;
+---------------------+
| AVG(ts)             |
+---------------------+
| 20180101182211.0000 |
+---------------------+
1 row in set (0.00 sec)
```

Which is a YYYYMMDDHHMMSS format in double.
Otherwise the division would be (Double / Long) which is incompatible.
@SparkQA
Copy link

SparkQA commented Jul 2, 2020

Test build #124898 has finished for PR 28821 at commit 3772374.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


// Hive lets you do aggregation of timestamps... for some reason
case Sum(e @ TimestampType()) => Sum(Cast(e, DoubleType))
case Average(e @ TimestampType()) => Average(Cast(e, DoubleType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When chaning the existing bheivour, we need to update the migration guide and might need to add a legacy config to keep the current behaivour.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 13, 2020
@github-actions github-actions bot closed this Oct 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants