Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Table / RecordBatch repr displays the wrong timezone for non-UTC timestamps #38629

Closed
nph opened this issue Nov 7, 2023 · 4 comments

Comments

@nph
Copy link
Contributor

nph commented Nov 7, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Printing a Table or a RecordBatch containing timezone-aware timestamps displays the time values in UTC but shows the original (possibly non-UTC) timezone in the schema header.

import datetime as dt

import pyarrow as pa
from pytz import timezone


# Create a table from a non-UTC timestamp
pacific_tz = timezone('US/Pacific')
mapping = [{'datetime': pacific_tz.localize(dt.datetime(2023, 11, 7, 22, 0, 0))}]
print(mapping)
# [{'datetime': datetime.datetime(2023, 11, 7, 22, 0, tzinfo=<DstTzInfo 'US/Pacific' PST-1 day, 16:00:00 STD>)}]
table = pa.Table.from_pylist(mapping)

# Table repr displays the timestamp values as UTC but shows the original timezone in the schema header
print(table)
# pyarrow.Table
# datetime: timestamp[us, tz=US/Pacific]   <-- incorrect timezone
# ----
# datetime: [[2023-11-08 06:00:00.000000]]   <-- timestamps in UTC

# Confirm the underlying table data is correct
print(table.to_pylist())
# [{'datetime': datetime.datetime(2023, 11, 7, 22, 0, tzinfo=<DstTzInfo 'US/Pacific' PST-1 day, 16:00:00 STD>)}]

# Create a table from a UTC timestamp
mapping = [{'datetime': dt.datetime(2023, 11, 7, 22, 0, 0, tzinfo=dt.timezone.utc)}]
table = pa.Table.from_pylist(mapping)

# Printing the table shows the correct timezone in the schema header
print(table)
# pyarrow.Table
# datetime: timestamp[us, tz=UTC]
# ----
# datetime: [[2023-11-07 22:00:00.000000]]

See also this related DuckDB issue

Pyarrow Version:
13.0.0

Platform:
macOS 12.7

Component(s)

Python

@AlenkaF
Copy link
Member

AlenkaF commented Dec 13, 2023

Thank you for reporting the issue @nph!

Yes, this is a known issue with timezone aware timestamps. The problem is the dependency on finding a timezone database which is not yet supported on Windows at the moment for example.

I will be adding an alternative solution which would keep the printed value in UTC but adding an indication at the end of the string to make it a bit clearer that UTC times are printed.

See #30117.

Will close this issue as a duplicate, please follow the linked one for future fix.

@AlenkaF
Copy link
Member

AlenkaF commented Dec 13, 2023

Duplicate of #30117

@AlenkaF AlenkaF marked this as a duplicate of #30117 Dec 13, 2023
@AlenkaF AlenkaF closed this as not planned Won't fix, can't repro, duplicate, stale Dec 13, 2023
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 13, 2023

(I was by coincidence replying on exactly the same moment ;), keeping a small part of my answer)

# datetime: timestamp[us, tz=US/Pacific] <-- incorrect timezone

Just one additional note on this: it doesn't show the "incorrect" timezone in the schema. Arrow does support keeping track of a timezone parameter on the timestamp type, and therefore when creating the table from tz-aware values, we preserve that information in the schema.
But regardless of which timezone you have, tz-aware Arrow data is always stored as UTC values. But of course showing this original timezone, in combination with the data repr showing those UTC values, is rather confusing.

@nph
Copy link
Contributor Author

nph commented Dec 13, 2023

Thanks @jorisvandenbossche, @AlenkaF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants