-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set adjusted to UTC if UTC timezone (#1932) #1937
Conversation
@@ -300,10 +300,15 @@ fn arrow_to_parquet_type(field: &Field) -> Result<Type> { | |||
.with_repetition(repetition) | |||
.build() | |||
} | |||
DataType::Timestamp(time_unit, _) => { | |||
DataType::Timestamp(time_unit, tz) => { | |||
let is_utc = tz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially played around with using chrono-tz, but this is a non-trivial additional dependency, and does not appear to be being actively maintained. I think this should be good enough, until such a time as we potentially add coerce types functionality as part of #1666
DataType::Timestamp(time_unit, tz) => { | ||
let is_utc = tz | ||
.as_ref() | ||
.map(|tz| tz == "UTC" || tz == "+00:00" || tz == "-00:00") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically GMT is the timezone - https://www.timeanddate.com/time/gmt-utc-time.html
Should GMT
be added in the checks too ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would rather keep this small and simple, rather than cover all timezones that happen to be equivalent. Longer term this may be handled as part of #1938
While applying this fix in arrow2, I broke some of our integration tests against pyarrow. Coming to the specs, when the tz string is set in Arrow, it means that
In this PR's notation, the values are by definition normalized to be UTC, it is the user of the logical type that needs to de-normalize them if they need to apply offsets (e.g. to represent). AFAIK this matches with what "is adjusted to UTC" in parquet means:
In other words, I think that the previous behavior was correct (but I may be wrong) |
I think you might be right, will re-read tomorrow and potentially file a PR. Thank you |
Which issue does this PR close?
Closes #1932
Rationale for this change
Prior to https://github.com/apache/arrow-rs/pull/1682/files#diff-103f6c92559ee00d93eff806b411e0c8c8d249564fb3f8674da1da7693f86cb1L393 the writer would set
is_adjusted_to_u_t_c
to true if any timezone was present in the arrow datatype. This was incorrect as normalization to UTC would never actually occur.The fix was to always set
is_adjusted_to_u_t_c
to be false, which whilst correct, is overly conservative and represents a regression for the case where the timezone actually is UTC.What changes are included in this PR?
If the source array is already normalized to UTC, set
is_adjusted_to_u_t_c
to be trueAre there any user-facing changes?
Yes, not sure if counts as breaking though