Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timezoned timestamp arithmetic #4546

Merged
merged 15 commits into from
Jul 26, 2023
Merged

Conversation

alexandreyc
Copy link
Contributor

Which issue does this PR close?

Closes #4457.

Rationale for this change

Arithmetic on timezoned timestamps was wrong.

Are there any user-facing changes?

Yes.

Methods add_year_months, add_day_time, add_month_day_nano, subtract_year_months, subtract_day_time and subtract_month_day_nano on TimestampSecondType, TimestampMillisecondType, TimestampMicrosecondType and TimestampNanosecondType now take an additonal parameter tz: Tz as these operations are inherently timezone-dependent. Maybe there is a way to do things differently and not break the API. Feel free to suggest any idea.

Tests

I tested the results against PostgreSQL 14. Feel free to propose new test cases that you find interesting.

Here is the script to reproduce the results:

postgres=# create temporary table tests as (

    select
        (timestamp '1970-01-28 23:00:00' at time zone 'Europe/Paris') as datetime,
        (interval '0 year 1 month') as interval_year_month,
        (interval '0 day 0 millisecond') as interval_day_time,
        (interval '1 month 0 day 0 microsecond') as interval_month_day_nano

    union all

    select
        (timestamp '1970-01-01 00:00:00' at time zone 'Europe/Paris') as datetime,
        (interval '5 year 34 month') as interval_year_month,
        (interval '5 day 454000 millisecond') as interval_day_time,
        (interval '344 month 34 day -43000000 microsecond') as interval_month_day_nano

    union all

    select
        (timestamp '2010-04-01 04:00:20' at time zone 'Europe/Paris') as datetime,
        (interval '-2 year 4 month') as interval_year_month,
        (interval '-34 day 0 millisecond') as interval_day_time,
        (interval '-593 month -33 day 13000000 microsecond') as interval_month_day_nano

    union all

    select
        (timestamp '1960-01-30 04:23:20' at time zone 'Europe/Paris') as datetime,
        (interval '7 year -4 month') as interval_year_month,
        (interval '7 day -4000 millisecond') as interval_day_time,
        (interval '5 month 2 day 493000000 microsecond') as interval_month_day_nano

    union all

    select
        (timestamp '2023-03-25 14:00:00' at time zone 'Europe/Paris') as datetime,
        (interval '0 year 1 month') as interval_year_month,
        (interval '1 day 0 millisecond') as interval_day_time,
        (interval '1 month 0 day 0 microsecond') as interval_month_day_nano

);
SELECT 5
postgres=# select * from tests
postgres=# select * from tests;
        datetime        | interval_year_month | interval_day_time |       interval_month_day_nano
------------------------+---------------------+-------------------+--------------------------------------
 1970-01-28 23:00:00+01 | 1 mon               | 00:00:00          | 1 mon
 1970-01-01 00:00:00+01 | 7 years 10 mons     | 5 days 00:07:34   | 28 years 8 mons 34 days -00:00:43
 2010-04-01 04:00:20+02 | -1 years -8 mons    | -34 days          | -49 years -5 mons -33 days +00:00:13
 1960-01-30 04:23:20+01 | 6 years 8 mons      | 7 days -00:00:04  | 5 mons 2 days 00:08:13
 2023-03-25 14:00:00+01 | 1 mon               | 1 day             | 1 mon
(5 rows)

postgres=# select datetime + interval_year_month from tests;
        ?column?
------------------------
 1970-02-28 23:00:00+01
 1977-11-01 00:00:00+01
 2008-08-01 04:00:20+02
 1966-09-30 04:23:20+01
 2023-04-25 14:00:00+02
(5 rows)

postgres=# select (datetime + interval_year_month) - interval_year_month from tests;
        ?column?
------------------------
 1970-01-28 23:00:00+01
 1970-01-01 00:00:00+01
 2010-04-01 04:00:20+02
 1960-01-30 04:23:20+01
 2023-03-25 14:00:00+01
(5 rows)

postgres=# select datetime + interval_day_time from tests;
        ?column?
------------------------
 1970-01-28 23:00:00+01
 1970-01-06 00:07:34+01
 2010-02-26 04:00:20+01
 1960-02-06 04:23:16+01
 2023-03-26 14:00:00+02
(5 rows)

postgres=# select (datetime + interval_day_time) - interval_day_time from tests;
        ?column?
------------------------
 1970-01-28 23:00:00+01
 1970-01-01 00:00:00+01
 2010-04-01 04:00:20+02
 1960-01-30 04:23:20+01
 2023-03-25 14:00:00+01
(5 rows)

postgres=# select datetime + interval_month_day_nano from tests;
        ?column?
------------------------
 1970-02-28 23:00:00+01
 1998-10-04 23:59:17+02
 1960-09-29 04:00:33+01
 1960-07-02 04:31:33+01
 2023-04-25 14:00:00+02
(5 rows)

postgres=# select (datetime + interval_month_day_nano) - interval_month_day_nano from tests;
        ?column?
------------------------
 1970-01-28 23:00:00+01
 1970-01-02 00:00:00+01
 2010-04-02 04:00:20+02
 1960-01-31 04:23:20+01
 2023-03-25 14:00:00+01
(5 rows)

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 20, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really nice, left some initial comments, will do a more thorough review as soon as I find time

@@ -46,3 +46,4 @@ num = { version = "0.4", default-features = false, features = ["std"] }

[features]
simd = ["arrow-array/simd"]
chrono-tz = ["arrow-array/chrono-tz"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically the way we have handled this is to put these integration style tests in the top-level arrow, as opposed to introducing feature flags on child crates

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this change once we've settled on the rest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.ok_or_else(|| ArrowError::ComputeError("Timestamp out of range".to_string()))?;
let res = res.naive_utc();
T::make_value(res)
.ok_or_else(|| ArrowError::ComputeError("Timestamp out of range".to_string()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the error is always the same, perhaps this method could just return an Option, and this error mapping be handled in the caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the last commits, the error can now be of two kinds: timestamp or interval out of range.

I'm not sure if it's really important to be able to distinguish between those errors... If not we can definitely return an Option here (though we change the API even more).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think distinguishing them is important

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

})?;
TimestampSecondType::make_value(res)
.ok_or_else(|| ArrowError::ComputeError("Timestamp out of range".to_string()))
let delta = IntervalDayTimeType::make_value(-days, -ms);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it matters, but potentially this negation could overflow. In practice I think this will always result in timestamp overflow anyway, so perhaps this doesn't matter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way to overflow an x: i32 with negation is when x = i32::MIN = -2_147_483_648.

It should not be a problem for years, months or days because they would overflow the timestamp anyway. However it can be problematic for milliseconds and nanoseconds.

We can use checked negation for those operations and return an "Interval out of range" error. The only downside I see is performance wise since this check must be done for every array element.

What's your opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could use subtraction instead of negation followed by addition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that ultimately we cannot avoid a checked negation because chrono's Months::new accepts only an u32 and arrow's months are i32. Same thing for days.

I've added checked negation everywhere it's needed.

I may have missed a simple solution without checked ops... Feel free to suggest any idea!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use a combination of abs_unsigned and using the sign to select between addition and subtraction to avoid overflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I didn't know about unsigned_abs... Should be good now!

@tustvold tustvold added the api-change Changes to the arrow API label Jul 20, 2023
@alexandreyc
Copy link
Contributor Author

alexandreyc commented Jul 25, 2023

I just pushed a commit that makes the code simpler and shorter (IMHO).

In summary, I moved all arithmetic methods to the TimestampOp trait and made this trait public. This introduces API changes because users now need to have the trait in scope in order to call these methods. I also slightly renamed some methods.

What's your opinion on this change?

Self::add_year_months(left, right)
}
/// Arithmetic trait for timestamp arrays
pub trait TimestampOp: ArrowTimestampType + Sized {
Copy link
Contributor

@tustvold tustvold Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of making this trait public, as it is intended as an internal implementation detail of how the kernel chooses to implement arithmetic. I could definitely see it evolving in future to support vectorisation or some other extensions.

I think I would prefer to just expose the dyn kernels

@@ -350,650 +350,6 @@ impl ArrowTimestampType for TimestampNanosecondType {
}
}

impl TimestampSecondType {
Copy link
Contributor

@tustvold tustvold Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing these methods may cause non-trivial downstream friction. Perhaps we could deprecate them as a first step instead of removing them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the last commit to undo this change because I'm not sure where to put these methods if we deprecate them and don't make TimestampOp public... Maybe we can create another trait? I'm not sure it's worth it...

@tustvold tustvold merged commit 0b75e8f into apache:master Jul 26, 2023
25 checks passed
@tustvold
Copy link
Contributor

Thank you this looks good to me, I will file a follow on PR to remove the chrono-tz feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timestamp Interval Arithmetic Ignores Timezone
2 participants