-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-31981][SQL] Keep TimestampType when taking an average of a Timestamp #28821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
Test build #123996 has finished for PR 28821 at commit
|
| TestDataTimestamp(new Timestamp(1420140300000L)) :: // 2015-01-01 20:25:00 | ||
| TestDataTimestamp(new Timestamp(1320140300000L)) :: // 2011-11-01 10:38:20 | ||
| TestDataTimestamp(new Timestamp(1520140300000L)) :: // 2018-03-04 06:11:40 | ||
| TestDataTimestamp(new Timestamp(-1409632500000L)) :: // 1925-05-01 19:44:32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you set the fractional part of seconds to zeros intentionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this wasn't intentional. I can add some fractions if you like.
|
Retest this please. |
|
Test build #124024 has finished for PR 28821 at commit
|
|
Hi, @Fokko . The last failure means that we need to regenerate the output file of |
|
Test build #124049 has finished for PR 28821 at commit
|
|
Retest this please. |
|
Test build #124130 has finished for PR 28821 at commit
|
|
Thanks for the restart @dongjoon-hyun. Another error now; let me dive into it and get back to y'all. |
|
Test build #124369 has finished for PR 28821 at commit
|
| 2020-12-30 16:00:00 a 2020-12-30 16:00:00 | ||
| 2017-07-31 17:00:00 b 2017-08-09 03:00:00 | ||
| 2017-08-17 13:00:00 b 2017-08-17 13:00:00 | ||
| 2020-12-30 16:00:00 b 2020-12-30 16:00:00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm .. so we allow timestamp types whereas other DBMSes disallow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there isn't a real consensus around other DBMS'es. Keeping it a Timestamp seems like something that you would expect. MySQL's behavior is much more awkward in my opinion. Spark needs to pave the path on this one :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems Oracle cannot accept timestamp for average, too. Any other system supporting this behaivour?
|
Test build #124669 has finished for PR 28821 at commit
|
|
Found the issue. Needs to be: Similar to the line above. For the division operation, both the dividend and divisor requires to be a Fractional type. Great to have such extensive tests :) |
|
Test build #124895 has finished for PR 28821 at commit
|
…estamp
Currently, when you take an average of a Timestamp,
you'll end up with a Double, representing the seconds
since epoch. This is because of old Hive behavior.
I strongly believe that it is better to return a Timestamp.
Behaviour in Postgres:
```
root@8c4241b617ec:/# psql postgres postgres
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.
postgres=# CREATE TABLE timestamp_demo (ts TIMESTAMP);
CREATE TABLE
postgres=# INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
INSERT 0 1
postgres=# INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
INSERT 0 1
postgres=# INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
INSERT 0 1
postgres=# SELECT AVG(ts) FROM timestamp_demo;
ERROR: function avg(timestamp without time zone) does not exist
LINE 1: SELECT AVG(ts) FROM timestamp_demo;
```
Behaviour in MySQL:
```
root@bab43a5731e8:/# mysql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 8.0.20 MySQL Community Server - GPL
Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> CREATE TABLE timestamp_demo (ts TIMESTAMP);
Query OK, 0 rows affected (0.05 sec)
mysql> INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)
mysql> SELECT AVG(ts) FROM timestamp_demo;
+---------------------+
| AVG(ts) |
+---------------------+
| 20180101182211.0000 |
+---------------------+
1 row in set (0.00 sec)
```
Which is a YYYYMMDDHHMMSS format in double.
Otherwise the division would be (Double / Long) which is incompatible.
|
Test build #124898 has finished for PR 28821 at commit
|
|
|
||
| // Hive lets you do aggregation of timestamps... for some reason | ||
| case Sum(e @ TimestampType()) => Sum(Cast(e, DoubleType)) | ||
| case Average(e @ TimestampType()) => Average(Cast(e, DoubleType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When chaning the existing bheivour, we need to update the migration guide and might need to add a legacy config to keep the current behaivour.
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Currently, when you take an average of a Timestamp, you'll end up with a Double, representing the seconds since epoch. This is because of old Hive behavior.
I strongly believe that it is better to return a Timestamp, to make the behavior more congruent with the max function, for example:
This might also improve performance because we get rid of the implicit cast.
Behaviour in Postgres:
Behaviour in MySQL:
Which is a YYYYMMDDHHMMSS format in double.
What changes were proposed in this pull request?
Return a Timestamp when taking the average of a Timestamp column.
Why are the changes needed?
I believe that it is awkward to implicitly change the type when doing an average.
Does this PR introduce any user-facing change?
Yes, it will return a Timestamp instead of a Double when taking an average of a Timestamp column.
How was this patch tested?
Using unit tests and existing tests.