-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support storing different timezone in an array #31901
Comments
Rok Mihevc / @rok: we currently don't have a variable timezone array in arrow and introducing it would be a big change. Incidentally the taxi example you're giving is really a single timezone with DST/non-DST period. See TZ database name here to see definition of timezone arrow uses. |
Gaurav Sheni: I don't believe they are correct even if the timezone was unified. Yes, for my application, I need to keep different timezones in a timezone array. I cited the Taxi example because imagine someone was picked up in CT timezone, and dropped off in a ET timezone. Then the same taxi picks up a person in the ET timezone. You could have a pickup_datetime column where two different timezones are in 1 column. The other example I am bringing up is let's say you are collecting datetimes for 1 location. It would be important to know the EDT datetimes from the EST datetimes (they would be in 1 array).
|
Joris Van den Bossche / @jorisvandenbossche:
Unfortunately, you are running into some confusing behaviour of Taking one of your datetime values: In [8]: print(datetime(year=2010, month=1, day=1, hour=9, minute=0, second=0, tzinfo=pytz.timezone('US/Eastern')))
2010-01-01 09:00:00-04:56 You can see this strange "-04:56" offset string (while we would expect it to be either "-04:00" or "-05:00"), and it is this offset that pyarrow applies to get the UTC value (2010-01-01 13:56:00), and then when converting back to python and attaching the timezone correctly, you end up with: In [10]: print(arr[0].as_py())
2010-01-01 08:56:00-05:00 But so this is due to the initial creating of the In [11]: print(pytz.timezone('US/Eastern').localize(datetime(year=2010, month=1, day=1, hour=9, minute=0, second=0)))
2010-01-01 09:00:00-05:00 See https://bugs.launchpad.net/pytz/+bug/1746179 and https://blog.ganssle.io/articles/2018/03/pytz-fastest-footgun.html for a more detailed explanation about this. |
Joris Van den Bossche / @jorisvandenbossche: For storing data in a single timezone (eg "US/Eastern" or "America/New_York") but wanting to know if it is EDT or EST, there are a few ways to do this. First, there is an In [14]: arr = pa.array([datetime(2010, 1, 1), datetime(2010, 6, 1)], pa.timestamp("us", "America/New_York"))
In [15]: pc.is_dst(arr)
Out[15]:
<pyarrow.lib.BooleanArray object at 0x7fe60da45760>
[
false,
true
] You could also calculate the time difference if you localize the times to the local timezone and to UTC, and then calculate the time difference. But I don't think we already have a compute function to get a naive time from the tz-aware timestamp. — I think the best workaround we can currently mention for actually storing multiple timezones, is to store the timestamps itself in UTC and have a separate column that keeps track of the timezone. In theory we could consider adding a timezone argument to functions that need a timezone (and currently get that from the type). |
Rok Mihevc / @rok: The two column approach @jorisvandenbossche suggests could perhaps already be done with a binary UDF. See examples here: |
As a user, I wish I could use pyarrow to store a column of datetimes with different timezones. In certain datasets, it is ideal to a column with mixed timezones (ex - taxi pickups). Even if the data is limited to a single location (let's say a business in NYC for example) over the time span of a single year... then your timezones will be EDT/EST with offsets of -4:00 and -5:00.
Currently, it is not possible to keep a column with different timezones.
Reporter: Gaurav Sheni
Watchers: Rok Mihevc / @rok
Note: This issue was originally created as ARROW-16540. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: