Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support storing different timezone in an array #31901

Open
asfimport opened this issue May 11, 2022 · 5 comments
Open

Support storing different timezone in an array #31901

asfimport opened this issue May 11, 2022 · 5 comments

Comments

@asfimport
Copy link

As a user, I wish I could use pyarrow to store a column of datetimes with different timezones. In certain datasets, it is ideal to a column with mixed timezones (ex - taxi pickups). Even if the data is limited to a single location (let's say a business in NYC for example) over the time span of a single year... then your timezones will be EDT/EST with offsets of -4:00 and -5:00.

 

Currently, it is not possible to keep a column with different timezones.

 

import pytz
import pyarrow as pa
import pytz
from datetime import datetime
arr = pa.array(
    [datetime(year=2010, month=1, day=1, hour=9, minute=0, second=0, tzinfo=pytz.timezone('US/Eastern')), 
     datetime(year=2010, month=1, day=1, hour=6, minute=0, second=0, tzinfo=pytz.timezone('US/Pacific'))
    ]
)
# value at index 0, 9AM ET
# value at index 1, 6AM PT is 9AM ET
assert arr[0].as_py().hour == 9 # fail
assert arr[1].as_py().hour == 9 # fail
assert arr[0].as_py().hour == 6 # fail
assert arr[1].as_py().hour == 6 # fail

 

Both datetime values are actually at the same time (although different timezones)

 

Reporter: Gaurav Sheni
Watchers: Rok Mihevc / @rok

Note: This issue was originally created as ARROW-16540. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Rok Mihevc / @rok:
Hey Gaurav,

we currently don't have a variable timezone array in arrow and introducing it would be a big change.
Do notice times in your output array are still correct even though their timezone was unified. Do you need to keep timezone information for your application?

Incidentally the taxi example you're giving is really a single timezone with DST/non-DST period. See TZ database name here to see definition of timezone arrow uses.

@asfimport
Copy link
Author

Gaurav Sheni:
Hi @rok ,

I don't believe they are correct even if the timezone was unified.
I updated my code example. Both should be 9AM ET // 6PM PT.

Yes, for my application, I need to keep different timezones in a timezone array.

I cited the Taxi example because imagine someone was picked up in CT timezone, and dropped off in a ET timezone. Then the same taxi picks up a person in the ET timezone. You could have a pickup_datetime column where two different timezones are in 1 column.

The other example I am bringing up is let's say you are collecting datetimes for 1 location. It would be important to know the EDT datetimes from the EST datetimes (they would be in 1 array).

 

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:

I don't believe they are correct even if the timezone was unified. I updated my code example. Both should be 9AM ET // 6PM PT.

Unfortunately, you are running into some confusing behaviour of pytz, and so the fact that the times are indeed incorrect is not pyarrow's fault.

Taking one of your datetime values:

In [8]: print(datetime(year=2010, month=1, day=1, hour=9, minute=0, second=0, tzinfo=pytz.timezone('US/Eastern')))
2010-01-01 09:00:00-04:56

You can see this strange "-04:56" offset string (while we would expect it to be either "-04:00" or "-05:00"), and it is this offset that pyarrow applies to get the UTC value (2010-01-01 13:56:00), and then when converting back to python and attaching the timezone correctly, you end up with:

In [10]: print(arr[0].as_py())
2010-01-01 08:56:00-05:00

But so this is due to the initial creating of the datetime.datetime object with a pytz timezone. The "correct" way to do this for pytz:

In [11]: print(pytz.timezone('US/Eastern').localize(datetime(year=2010, month=1, day=1, hour=9, minute=0, second=0)))
2010-01-01 09:00:00-05:00

See https://bugs.launchpad.net/pytz/+bug/1746179 and https://blog.ganssle.io/articles/2018/03/pytz-fastest-footgun.html for a more detailed explanation about this.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
On your actual question about supporting different timezones in a single array, @rok already mentioned this is not supported, and I also don't think we currently have plans to do so.

For storing data in a single timezone (eg "US/Eastern" or "America/New_York") but wanting to know if it is EDT or EST, there are a few ways to do this. First, there is an is_dst compute function (new in 8.0.0):

In [14]: arr = pa.array([datetime(2010, 1, 1), datetime(2010, 6, 1)], pa.timestamp("us", "America/New_York"))

In [15]: pc.is_dst(arr)
Out[15]: 
<pyarrow.lib.BooleanArray object at 0x7fe60da45760>
[
  false,
  true
]

You could also calculate the time difference if you localize the times to the local timezone and to UTC, and then calculate the time difference. But I don't think we already have a compute function to get a naive time from the tz-aware timestamp.

I think the best workaround we can currently mention for actually storing multiple timezones, is to store the timestamps itself in UTC and have a separate column that keeps track of the timezone. In theory we could consider adding a timezone argument to functions that need a timezone (and currently get that from the type).

@asfimport
Copy link
Author

Rok Mihevc / @rok:
Sorry I misread timestamps in your original example.

The two column approach @jorisvandenbossche suggests could perhaps already be done with a binary UDF. See examples here:
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_udf.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant