Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python][R] PrettyPrint ignores timezone #30117

Closed
asfimport opened this issue Nov 3, 2021 · 14 comments · Fixed by #39272
Closed

[C++][Python][R] PrettyPrint ignores timezone #30117

asfimport opened this issue Nov 3, 2021 · 14 comments · Fixed by #39272

Comments

@asfimport
Copy link
Collaborator

asfimport commented Nov 3, 2021

When printing TimestampArray in pyarrow the timezone information is ignored by PrettyPrint (str  calls to_string() in array.pxi).

import pyarrow as pa

a = pa.array([0], pa.timestamp('s', tz='+02:00'))

print(a) # representation not correct?
# <pyarrow.lib.TimestampArray object at 0x7f834c7cb9a8>
# [
#  1970-01-01 00:00:00
# ]

Reporter: Alenka Frim / @AlenkaF
Watchers: Rok Mihevc / @rok

Related issues:

Note: This issue was originally created as ARROW-14567. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
The PrettyPrint for Timestamp type is implemented in StringFormatter<TimestampType> at formatting.h (

class StringFormatter<TimestampType> {
public:
using value_type = int64_t;
explicit StringFormatter(const std::shared_ptr<DataType>& type)
: unit_(checked_cast<const TimestampType&>(*type).unit()) {}
template <typename Duration, typename Appender>
Return<Appender> operator()(Duration, value_type value, Appender&& append) {
using arrow_vendored::date::days;
const Duration since_epoch{value};
if (!ARROW_PREDICT_TRUE(detail::IsDateTimeInRange(since_epoch))) {
return detail::FormatOutOfRange(value, append);
}
const auto timepoint = detail::kEpoch + since_epoch;
// Round days towards zero
// (the naive approach of using arrow_vendored::date::floor() would
// result in UB for very large negative timestamps, similarly as
// https://github.com/HowardHinnant/date/issues/696)
auto timepoint_days = std::chrono::time_point_cast<days>(timepoint);
Duration since_midnight;
if (timepoint_days <= timepoint) {
// Year >= 1970
since_midnight = timepoint - timepoint_days;
} else {
// Year < 1970
since_midnight = days(1) - (timepoint_days - timepoint);
timepoint_days -= days(1);
}
constexpr size_t buffer_size =
detail::BufferSizeYYYY_MM_DD() + 1 + detail::BufferSizeHH_MM_SS<Duration>();
std::array<char, buffer_size> buffer;
char* cursor = buffer.data() + buffer_size;
detail::FormatHH_MM_SS(arrow_vendored::date::make_time(since_midnight), &cursor);
detail::FormatOneChar(' ', &cursor);
detail::FormatYYYY_MM_DD(timepoint_days, &cursor);
return append(detail::ViewDigitBuffer(buffer, cursor));
}
).

Currently this simply does not take into account any timezone information of the type, and formats the stored UTC epoch. Personally, I find this confusing as the current repr for the array has no indication whatsoever how to interpret the printed time values (and without any indication, my first expectation would be local time in the timezone of the type, which isn't the case).

The formatting code already uses the date.h utilities (and which we have vendored anyway), so in principle we could use date.h to first localize the epoch value. However, that makes printing dependent on finding a timezone database (eg not yet supported on Windows at the moment).

An alternative could be to keep the printed value in UTC but add a +00:00 indication to make it it clear this are the UTC times that are printed (and not the wall clock time in the timezone of the type).

cc @rok @pitrou @lidavidm

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
I think it should be a Z at the end of the string?

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Agreed with Antoine. It may confuse end users slightly but represents what the actual stored value is.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Indeed, Z is more correct than +00:00 (which could still indicate any timezone that just happens to have no offset to UTC at that point in time)

Personally I would still prefer actual localized times, but adding the UTC indication would already help a lot.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
If we want to allow outputting localized timestamps, it should probably be an option, to avoid breaking existing uses.

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
Z and an option to localize sounds good.

What would the final (once timezone database is always present) default be? Z or localized?

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
Agreed that at the very least falling back to a timestamp with Z is good (e.g. when we can't find the timezone database). But IMO the least confusing to users would be to default to localized.

Separately from the default, we should also indicate in some way that though that the type does include a (non-UTC) timezone. If I only saw a string of timestamps like below (+ Z), I would assume the array was set to UTC (and try casting it to a different TZ, and be confused/annoyed that it 'didn't seem to work')

arr <- Array$create(ts)
arr
#> Array
#> <timestamp[us]>                     <------ maybe add timezone information here?
#> [
#>   2020-01-01 09:00:00.000000,
#>   2020-01-01 10:00:00.000000,
#>   2020-01-01 11:00:00.000000,
#>   2020-01-01 12:00:00.000000,
#>   2020-01-01 13:00:00.000000,
#>   2020-01-01 14:00:00.000000,
#>   2020-01-01 15:00:00.000000,
#>   2020-01-01 16:00:00.000000,
#>   2020-01-01 17:00:00.000000,
#>   2020-01-01 18:00:00.000000
#> ]

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Yes, I think adding the timezone information in the type output is the primary fix here. It would be nice to fix the values to add a "Z" as well (only if the timezone is set on the type).

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
I agree with @jonkeane that IMO the least confusing display would be to use localized strings (with timezone offset indication, so like "1970-01-01 02:00:00+02:00"). Adding "Z" is certainly better than the current situation, but it still doesn't give a quick idea about the local time that the timestamp actually represents.

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
Agreed! Offset ("1970-01-01 02:00:00+02:00") per value plus timezone string (<timestamp[us, "Pacific/Marquesas"]>) in the header would be great.

@asfimport
Copy link
Collaborator Author

@AlenkaF
Copy link
Member

AlenkaF commented Dec 18, 2023

I created a PR for the simple version of the fix which is adding a "Z" to the end of the string in case timezone is defined.

>>> import pyarrow as pa

>>> # tz defined
>>> pa.array([0], pa.timestamp('s', tz='+02:00'))
<pyarrow.lib.TimestampArray object at 0x131399fc0>
[
  1970-01-01 00:00:00Z
]

>>> # tz not defined
>>> pa.array([0], pa.timestamp('s'))
<pyarrow.lib.TimestampArray object at 0x13139a020>
[
  1970-01-01 00:00:00
]

I haven't added any change in the type output as the timezone information is already included:

>>> a = pa.array([0], pa.timestamp('s', tz='+02:00'))
>>> a.type
TimestampType(timestamp[s, tz=+02:00])

or do we want info about the timezone in array object print?

@rok
Copy link
Member

rok commented Dec 18, 2023

Thanks for doing this @AlenkaF !
Adding Z is definitely an improvement and removes current ambiguity.

Did you consider adding an option to print in local time as per ISO8601 convention displaying local time + offset from UTC?
This could be done with logic from local_timestamp, what's missing is logic to add timezone offset information. This is currently won't work for cases where timezone is given as a fixed offset (see #12865) but those cases could be handled as exceptions. Given that we could also make a separate issue.

@AlenkaF
Copy link
Member

AlenkaF commented Dec 18, 2023

Printing in local time would definitely be good to have. What I was thinking is to have the simple fix asap and then do a follow-up with localized strings when possible.

jorisvandenbossche pushed a commit that referenced this issue Jan 8, 2024
…when tz defined (#39272)

### What changes are included in this PR?

This PR updates the PrettyPrint for Timestamp type so that "Z" is printed at the end of the output string if the timezone has been defined. This way we add minimum information about the values being stored in UTC.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

There is a change in how `TimestampArray` prints out the data. With this change "Z" would be added to the end of the string if the timezone is defined.
* Closes: #30117

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…tring when tz defined (apache#39272)

### What changes are included in this PR?

This PR updates the PrettyPrint for Timestamp type so that "Z" is printed at the end of the output string if the timezone has been defined. This way we add minimum information about the values being stored in UTC.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

There is a change in how `TimestampArray` prints out the data. With this change "Z" would be added to the end of the string if the timezone is defined.
* Closes: apache#30117

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…tring when tz defined (apache#39272)

### What changes are included in this PR?

This PR updates the PrettyPrint for Timestamp type so that "Z" is printed at the end of the output string if the timezone has been defined. This way we add minimum information about the values being stored in UTC.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

There is a change in how `TimestampArray` prints out the data. With this change "Z" would be added to the end of the string if the timezone is defined.
* Closes: apache#30117

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…tring when tz defined (apache#39272)

### What changes are included in this PR?

This PR updates the PrettyPrint for Timestamp type so that "Z" is printed at the end of the output string if the timezone has been defined. This way we add minimum information about the values being stored in UTC.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

There is a change in how `TimestampArray` prints out the data. With this change "Z" would be added to the end of the string if the timezone is defined.
* Closes: apache#30117

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment