-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix timezoned timestamp arithmetic #4546
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really nice, left some initial comments, will do a more thorough review as soon as I find time
@@ -46,3 +46,4 @@ num = { version = "0.4", default-features = false, features = ["std"] } | |||
|
|||
[features] | |||
simd = ["arrow-array/simd"] | |||
chrono-tz = ["arrow-array/chrono-tz"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically the way we have handled this is to put these integration style tests in the top-level arrow, as opposed to introducing feature flags on child crates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do this change once we've settled on the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow-array/src/types.rs
Outdated
.ok_or_else(|| ArrowError::ComputeError("Timestamp out of range".to_string()))?; | ||
let res = res.naive_utc(); | ||
T::make_value(res) | ||
.ok_or_else(|| ArrowError::ComputeError("Timestamp out of range".to_string())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the error is always the same, perhaps this method could just return an Option, and this error mapping be handled in the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the last commits, the error can now be of two kinds: timestamp or interval out of range.
I'm not sure if it's really important to be able to distinguish between those errors... If not we can definitely return an Option here (though we change the API even more).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think distinguishing them is important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
arrow-array/src/types.rs
Outdated
})?; | ||
TimestampSecondType::make_value(res) | ||
.ok_or_else(|| ArrowError::ComputeError("Timestamp out of range".to_string())) | ||
let delta = IntervalDayTimeType::make_value(-days, -ms); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it matters, but potentially this negation could overflow. In practice I think this will always result in timestamp overflow anyway, so perhaps this doesn't matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way to overflow an x: i32
with negation is when x = i32::MIN = -2_147_483_648
.
It should not be a problem for years, months or days because they would overflow the timestamp anyway. However it can be problematic for milliseconds and nanoseconds.
We can use checked negation for those operations and return an "Interval out of range" error. The only downside I see is performance wise since this check must be done for every array element.
What's your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could use subtraction instead of negation followed by addition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that ultimately we cannot avoid a checked negation because chrono's Months::new
accepts only an u32
and arrow's months are i32
. Same thing for days.
I've added checked negation everywhere it's needed.
I may have missed a simple solution without checked ops... Feel free to suggest any idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to use a combination of abs_unsigned and using the sign to select between addition and subtraction to avoid overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I didn't know about unsigned_abs
... Should be good now!
I just pushed a commit that makes the code simpler and shorter (IMHO). In summary, I moved all arithmetic methods to the What's your opinion on this change? |
arrow-arith/src/numeric.rs
Outdated
Self::add_year_months(left, right) | ||
} | ||
/// Arithmetic trait for timestamp arrays | ||
pub trait TimestampOp: ArrowTimestampType + Sized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of making this trait public, as it is intended as an internal implementation detail of how the kernel chooses to implement arithmetic. I could definitely see it evolving in future to support vectorisation or some other extensions.
I think I would prefer to just expose the dyn kernels
@@ -350,650 +350,6 @@ impl ArrowTimestampType for TimestampNanosecondType { | |||
} | |||
} | |||
|
|||
impl TimestampSecondType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing these methods may cause non-trivial downstream friction. Perhaps we could deprecate them as a first step instead of removing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the last commit to undo this change because I'm not sure where to put these methods if we deprecate them and don't make TimestampOp
public... Maybe we can create another trait? I'm not sure it's worth it...
This reverts commit 8ec4be4.
Thank you this looks good to me, I will file a follow on PR to remove the chrono-tz feature |
Which issue does this PR close?
Closes #4457.
Rationale for this change
Arithmetic on timezoned timestamps was wrong.
Are there any user-facing changes?
Yes.
Methods
add_year_months
,add_day_time
,add_month_day_nano
,subtract_year_months
,subtract_day_time
andsubtract_month_day_nano
onTimestampSecondType
,TimestampMillisecondType
,TimestampMicrosecondType
andTimestampNanosecondType
now take an additonal parametertz: Tz
as these operations are inherently timezone-dependent. Maybe there is a way to do things differently and not break the API. Feel free to suggest any idea.Tests
I tested the results against PostgreSQL 14. Feel free to propose new test cases that you find interesting.
Here is the script to reproduce the results: