Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add time bucket function #5665

Merged
merged 15 commits into from
Jan 21, 2023
Merged

Conversation

papparapa
Copy link
Contributor

@papparapa papparapa commented Dec 12, 2022

Closes #5589

Edit: PR of non-ICU time_bucket function has been split into #5835.

Implemented four overloads of ICU time_bucket function.
They behave almost same as TimescaleDB's time_bucket.

TIME_BUCKET(bucket_width: INTERVAL, ts: TIMESTAMP)
TIME_BUCKET(bucket_width: INTERVAL, ts: TIMESTAMPTZ)
TIME_BUCKET(bucket_width: INTERVAL, ts: DATE)

TIME_BUCKET(bucket_width: INTERVAL, ts: TIMESTAMP, offset: INTERVAL)
TIME_BUCKET(bucket_width: INTERVAL, ts: TIMESTAMPTZ, offset: INTERVAL)
TIME_BUCKET(bucket_width: INTERVAL, ts: DATE, offset: INTERVAL)

TIME_BUCKET(bucket_width: INTERVAL, ts: TIMESTAMP, origin: TIMESTAMP)
TIME_BUCKET(bucket_width: INTERVAL, ts: TIMESTAMPTZ, origin: TIMESTAMPTZ)
TIME_BUCKET(bucket_width: INTERVAL, ts: DATE, origin: DATE)

TIME_BUCKET(bucket_width: INTERVAL, ts: TIMESTAMPTZ, timezone: TEXT)

To ease the tests, I set the default origin to '2000-01-03 00:00:00+00' (when bucket_width is small) or '2000-01-01 00:00:00+00' (when bucket_width is large) as the same as TimescaleDB.

Copy link
Contributor

@hawkfish hawkfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to remove the TIMESTAMPTZ versions from this PR and implement them separately. The implementations you have now can be accessed by just casting the TSTZ values to TS.

src/common/types/date.cpp Outdated Show resolved Hide resolved
src/include/duckdb/common/types/date.hpp Outdated Show resolved Hide resolved
test/sql/function/date/test_time_bucket_date.test Outdated Show resolved Hide resolved
test/sql/function/date/test_time_bucket_date.test Outdated Show resolved Hide resolved
test/sql/function/date/test_time_bucket_timestamptz.test Outdated Show resolved Hide resolved
test/sql/function/date/test_time_bucket_timestamp.test Outdated Show resolved Hide resolved
src/function/scalar/date/time_bucket.cpp Outdated Show resolved Hide resolved
src/function/scalar/date/time_bucket.cpp Outdated Show resolved Hide resolved
src/function/scalar/date/time_bucket.cpp Outdated Show resolved Hide resolved
src/function/scalar/date/time_bucket.cpp Outdated Show resolved Hide resolved
@papparapa
Copy link
Contributor Author

Thanks @hawkfish ! I will fix them.

@papparapa
Copy link
Contributor Author

Added one more overload

TIME_BUCKET(bucket_width: INTERVAL, ts: TIMESTAMPTZ, timezone: TEXT)

Copy link
Contributor

@hawkfish hawkfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this - especially the TimebaseDB compatibility. And I'm sorry someone posted that this is a "Good First Issue"! It's easy enough for instants (and you did a great job with that), but time zones are a black art.

In particular, "local timestamps" are a dangerous hack and should be avoided whenever possible. We only have support for them because some people have dirty data that uses them and this lets them (mostly) clean them up, and also for compatibility with Postgres and the SQL standard.

I may have mentioned my blog post on time binning, but not you might want to read it if I am not making sense.

extension/icu/icu-timebucket.cpp Outdated Show resolved Hide resolved
extension/icu/icu-timebucket.cpp Outdated Show resolved Hide resolved
extension/icu/icu-timebucket.cpp Outdated Show resolved Hide resolved
extension/icu/icu-timebucket.cpp Outdated Show resolved Hide resolved
extension/icu/icu-timebucket.cpp Outdated Show resolved Hide resolved
src/include/duckdb/common/types/interval.hpp Outdated Show resolved Hide resolved
@hawkfish
Copy link
Contributor

Also I can't resolve the conversations that you have fixed, so can you close them so we can focus on the remaining issues? Thx.

Copy link
Contributor

@hawkfish hawkfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry wanted to ask some more questions!

extension/icu/icu-timebucket.cpp Outdated Show resolved Hide resolved
extension/icu/icu-timebucket.cpp Outdated Show resolved Hide resolved
src/function/scalar/date/time_bucket.cpp Outdated Show resolved Hide resolved
@papparapa
Copy link
Contributor Author

Thanks @hawkfish for careful review!
I'll start with easy fixes.

extension/icu/icu-dateadd.cpp Outdated Show resolved Hide resolved
extension/icu/icu-dateadd.cpp Outdated Show resolved Hide resolved
src/function/scalar/date/time_bucket.cpp Outdated Show resolved Hide resolved
src/function/scalar/date/time_bucket.cpp Outdated Show resolved Hide resolved
src/function/scalar/date/time_bucket.cpp Outdated Show resolved Hide resolved
extension/icu/icu-timebucket.cpp Outdated Show resolved Hide resolved
extension/icu/icu-timebucket.cpp Outdated Show resolved Hide resolved
extension/icu/icu-timebucket.cpp Outdated Show resolved Hide resolved
extension/icu/icu-timebucket.cpp Outdated Show resolved Hide resolved
extension/icu/icu-timebucket.cpp Outdated Show resolved Hide resolved
@hawkfish
Copy link
Contributor

Thanks for all the hard work! This is coming along nicely, so please don't be discouraged. I'd highly recommend merging from master before your next push - there have been some broken CI bits. Oh and thanks for fixing the formatting - I swear I thought it was included!

Copy link
Contributor

@hawkfish hawkfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting close! I found some small bugs in the naïve timestamp handling but I don't think we will collide for now.

extension/icu/include/icu-datefunc.hpp Outdated Show resolved Hide resolved
extension/icu/icu-timezone.cpp Outdated Show resolved Hide resolved
@papparapa
Copy link
Contributor Author

papparapa commented Jan 3, 2023

The current implementation still has bug-like behavior.

time_bucket result of timestamp on the day DST ended gets greater than the original timestamp.
This is unexpected.

In Europe/London timezone, DST in 2000 started at '2000-03-26 01:00:00+00' and ended at '2000-10-29 02:00:00+01'.

D set timezone = 'Europe/London';
D with t2 as (select t1.ts::timestamptz as ts, time_bucket('10 seconds'::interval, t1.ts::timestamptz, 'Europe/London') as tb from generate_series('2000-10-28 00:00:00'::timestamp, '2000-10-31 00:00:00'::timestamp, '3 hours'::interval) t1 (ts)) select ts, tb, ts = tb as always_should_be_true from t2;
┌──────────────────────────┬──────────────────────────┬───────────────────────┐
│            ts            │            tb            │ always_should_be_true │
│ timestamp with time zonetimestamp with time zoneboolean        │
├──────────────────────────┼──────────────────────────┼───────────────────────┤
│ 2000-10-28 00:00:00+012000-10-28 00:00:00+01   │ true                  │
│ 2000-10-28 03:00:00+012000-10-28 03:00:00+01   │ true                  │
│ 2000-10-28 06:00:00+012000-10-28 06:00:00+01   │ true                  │
│ 2000-10-28 09:00:00+012000-10-28 09:00:00+01   │ true                  │
│ 2000-10-28 12:00:00+012000-10-28 12:00:00+01   │ true                  │
│ 2000-10-28 15:00:00+012000-10-28 15:00:00+01   │ true                  │
│ 2000-10-28 18:00:00+012000-10-28 18:00:00+01   │ true                  │
│ 2000-10-28 21:00:00+012000-10-28 21:00:00+01   │ true                  │
│ 2000-10-29 00:00:00+012000-10-29 00:00:00+01   │ true                  │
│ 2000-10-29 03:00:00+002000-10-29 04:00:00+00   │ false                 │
│ 2000-10-29 06:00:00+002000-10-29 07:00:00+00   │ false                 │
│ 2000-10-29 09:00:00+002000-10-29 10:00:00+00   │ false                 │
│ 2000-10-29 12:00:00+002000-10-29 13:00:00+00   │ false                 │
│ 2000-10-29 15:00:00+002000-10-29 16:00:00+00   │ false                 │
│ 2000-10-29 18:00:00+002000-10-29 19:00:00+00   │ false                 │
│ 2000-10-29 21:00:00+002000-10-29 22:00:00+00   │ false                 │
│ 2000-10-30 00:00:00+002000-10-30 00:00:00+00   │ true                  │
│ 2000-10-30 03:00:00+002000-10-30 03:00:00+00   │ true                  │
│ 2000-10-30 06:00:00+002000-10-30 06:00:00+00   │ true                  │
│ 2000-10-30 09:00:00+002000-10-30 09:00:00+00   │ true                  │
│ 2000-10-30 12:00:00+002000-10-30 12:00:00+00   │ true                  │
│ 2000-10-30 15:00:00+002000-10-30 15:00:00+00   │ true                  │
│ 2000-10-30 18:00:00+002000-10-30 18:00:00+00   │ true                  │
│ 2000-10-30 21:00:00+002000-10-30 21:00:00+00   │ true                  │
│ 2000-10-31 00:00:00+002000-10-31 00:00:00+00   │ true                  │
├──────────────────────────┴──────────────────────────┴───────────────────────┤
│ 25 rows                                                           3 columns │
└─────────────────────────────────────────────────────────────────────────────┘

On TimescaleDB, this query returns expected result.

postgres=# set timezone = 'Europe/London';
SET
postgres=# with t2 as (select t1.ts::timestamptz as ts, time_bucket('10 seconds'::interval, t1.ts::timestamptz, 'Europe/London') as tb from generate_series('2000-10-28 00:00:00'::timestamp, '2000-10-31 00:00:00'::timestamp, '3 hours'::interval) t1 (ts)) select ts, tb, ts = tb as always_should_be_true from t2;
           ts           |           tb           | always_should_be_true
------------------------+------------------------+-----------------------
 2000-10-28 00:00:00+01 | 2000-10-28 00:00:00+01 | t
 2000-10-28 03:00:00+01 | 2000-10-28 03:00:00+01 | t
 2000-10-28 06:00:00+01 | 2000-10-28 06:00:00+01 | t
 2000-10-28 09:00:00+01 | 2000-10-28 09:00:00+01 | t
 2000-10-28 12:00:00+01 | 2000-10-28 12:00:00+01 | t
 2000-10-28 15:00:00+01 | 2000-10-28 15:00:00+01 | t
 2000-10-28 18:00:00+01 | 2000-10-28 18:00:00+01 | t
 2000-10-28 21:00:00+01 | 2000-10-28 21:00:00+01 | t
 2000-10-29 00:00:00+01 | 2000-10-29 00:00:00+01 | t
 2000-10-29 03:00:00+00 | 2000-10-29 03:00:00+00 | t
 2000-10-29 06:00:00+00 | 2000-10-29 06:00:00+00 | t
 2000-10-29 09:00:00+00 | 2000-10-29 09:00:00+00 | t
 2000-10-29 12:00:00+00 | 2000-10-29 12:00:00+00 | t
 2000-10-29 15:00:00+00 | 2000-10-29 15:00:00+00 | t
 2000-10-29 18:00:00+00 | 2000-10-29 18:00:00+00 | t
 2000-10-29 21:00:00+00 | 2000-10-29 21:00:00+00 | t
 2000-10-30 00:00:00+00 | 2000-10-30 00:00:00+00 | t
 2000-10-30 03:00:00+00 | 2000-10-30 03:00:00+00 | t
 2000-10-30 06:00:00+00 | 2000-10-30 06:00:00+00 | t
 2000-10-30 09:00:00+00 | 2000-10-30 09:00:00+00 | t
 2000-10-30 12:00:00+00 | 2000-10-30 12:00:00+00 | t
 2000-10-30 15:00:00+00 | 2000-10-30 15:00:00+00 | t
 2000-10-30 18:00:00+00 | 2000-10-30 18:00:00+00 | t
 2000-10-30 21:00:00+00 | 2000-10-30 21:00:00+00 | t
 2000-10-31 00:00:00+00 | 2000-10-31 00:00:00+00 | t
(25 rows)

About the DST beginning day, the current implementation returns expected result.
About the DST beginning day, the current implementation returns wrong results too.

D set timezone = 'Europe/London';
D with t2 as (select t1.ts::timestamptz as ts, time_bucket('10 seconds'::interval, t1.ts::timestamptz, 'Europe/London') as tb from generate_series('2000-03-25 00:00:00'::timestamp, '2000-03-28 00:00:00'::timestamp, '3 hours'::interval) t1 (ts)) select ts, tb, ts = tb as always_should_be_true from t2;
┌──────────────────────────┬──────────────────────────┬───────────────────────┐
│            ts            │            tb            │ always_should_be_true │
│ timestamp with time zonetimestamp with time zoneboolean        │
├──────────────────────────┼──────────────────────────┼───────────────────────┤
│ 2000-03-25 00:00:00+002000-03-25 00:00:00+00   │ true                  │
│ 2000-03-25 03:00:00+002000-03-25 03:00:00+00   │ true                  │
│ 2000-03-25 06:00:00+002000-03-25 06:00:00+00   │ true                  │
│ 2000-03-25 09:00:00+002000-03-25 09:00:00+00   │ true                  │
│ 2000-03-25 12:00:00+002000-03-25 12:00:00+00   │ true                  │
│ 2000-03-25 15:00:00+002000-03-25 15:00:00+00   │ true                  │
│ 2000-03-25 18:00:00+002000-03-25 18:00:00+00   │ true                  │
│ 2000-03-25 21:00:00+002000-03-25 21:00:00+00   │ true                  │
│ 2000-03-26 00:00:00+002000-03-26 00:00:00+00   │ true                  │
│ 2000-03-26 03:00:00+012000-03-26 02:00:00+01   │ false                 │
│ 2000-03-26 06:00:00+012000-03-26 05:00:00+01   │ false                 │
│ 2000-03-26 09:00:00+012000-03-26 08:00:00+01   │ false                 │
│ 2000-03-26 12:00:00+012000-03-26 11:00:00+01   │ false                 │
│ 2000-03-26 15:00:00+012000-03-26 14:00:00+01   │ false                 │
│ 2000-03-26 18:00:00+012000-03-26 17:00:00+01   │ false                 │
│ 2000-03-26 21:00:00+012000-03-26 20:00:00+01   │ false                 │
│ 2000-03-27 00:00:00+012000-03-27 00:00:00+01   │ true                  │
│ 2000-03-27 03:00:00+012000-03-27 03:00:00+01   │ true                  │
│ 2000-03-27 06:00:00+012000-03-27 06:00:00+01   │ true                  │
│ 2000-03-27 09:00:00+012000-03-27 09:00:00+01   │ true                  │
│ 2000-03-27 12:00:00+012000-03-27 12:00:00+01   │ true                  │
│ 2000-03-27 15:00:00+012000-03-27 15:00:00+01   │ true                  │
│ 2000-03-27 18:00:00+012000-03-27 18:00:00+01   │ true                  │
│ 2000-03-27 21:00:00+012000-03-27 21:00:00+01   │ true                  │
│ 2000-03-28 00:00:00+012000-03-28 00:00:00+01   │ true                  │
├──────────────────────────┴──────────────────────────┴───────────────────────┤
│ 25 rows                                                           3 columns │
└─────────────────────────────────────────────────────────────────────────────┘

@hawkfish
Copy link
Contributor

hawkfish commented Jan 3, 2023

time_bucket result of timestamp on the day DST ended gets greater than the original timestamp.

That's odd - I'm going to pull your code and step into it.

@hawkfish
Copy link
Contributor

hawkfish commented Jan 3, 2023

It looks like the problem is that there is a difference between the number of hours between the TB epoch and the non-DST times on the 29th (7203 == 300d 3h) and the interval that DATE_SUB computes using ICU's date subtraction logic (300h 4h).

@hawkfish
Copy link
Contributor

hawkfish commented Jan 3, 2023

There is a bug in the ICU TSTZ difference code. It is moving the start date forward instead of moving the end date backwards. So when it moves the epoch date forward by 300 days, it gains an hour crossing the spring DST boundary and never gives it back, resulting in an extra hour of difference at the next step. This is not symmetrical with the interval addition code. I'll have a closer look tomorrow and hopefully have a fix.

@papparapa
Copy link
Contributor Author

papparapa commented Jan 4, 2023

Thanks for your investigation!
I would like to ask for my future study, did you use lldb or gdb to step into the code?

@hawkfish
Copy link
Contributor

hawkfish commented Jan 4, 2023

Thanks for your investigation!
I would like to ask for my future study, did you use lldb or gdb to step into the code?

Xcode, actually (which wraps lldb)

@papparapa
Copy link
Contributor Author

Xcode, actually (which wraps lldb)

Thanks!

extension/icu/include/icu-datefunc.hpp Outdated Show resolved Hide resolved
extension/icu/icu-timebucket.cpp Show resolved Hide resolved
@hawkfish
Copy link
Contributor

hawkfish commented Jan 4, 2023

With the bug fix your examples now work, except the last one above, which is wrong (all the dates should be the same).

Copy link
Contributor

@hawkfish hawkfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the non-ICU changes are ready to go. Can we split them out into a separate PR?

Copy link
Contributor

@hawkfish hawkfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are down to the µs computation and the DST test conversation.

Copy link
Contributor

@hawkfish hawkfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just resolve the conversations.

@papparapa
Copy link
Contributor Author

Thanks @hawkfish for the long-term review!

@Mytherin Mytherin merged commit c567e87 into duckdb:master Jan 21, 2023
@Mytherin
Copy link
Collaborator

Thanks!

@Swoorup
Copy link

Swoorup commented Jan 21, 2023

Woah 😍

@papparapa papparapa deleted the add-time-bucket-function branch January 22, 2023 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement TIME_BUCKET Function
4 participants