Support storing different timezone in an array #31901

asfimport · 2022-05-11T21:13:57Z

As a user, I wish I could use pyarrow to store a column of datetimes with different timezones. In certain datasets, it is ideal to a column with mixed timezones (ex - taxi pickups). Even if the data is limited to a single location (let's say a business in NYC for example) over the time span of a single year... then your timezones will be EDT/EST with offsets of -4:00 and -5:00.

Currently, it is not possible to keep a column with different timezones.

import pytz
import pyarrow as pa
import pytz
from datetime import datetime
arr = pa.array(
    [datetime(year=2010, month=1, day=1, hour=9, minute=0, second=0, tzinfo=pytz.timezone('US/Eastern')), 
     datetime(year=2010, month=1, day=1, hour=6, minute=0, second=0, tzinfo=pytz.timezone('US/Pacific'))
    ]
)

# value at index 0, 9AM ET
# value at index 1, 6AM PT is 9AM ET

assert arr[0].as_py().hour == 9 # fail
assert arr[1].as_py().hour == 9 # fail
assert arr[0].as_py().hour == 6 # fail
assert arr[1].as_py().hour == 6 # fail

Both datetime values are actually at the same time (although different timezones)

Reporter: Gaurav Sheni
Watchers: Rok Mihevc / @rok

_{Note: This issue was originally created as ARROW-16540. Please see the migration documentation for further details.}

asfimport · 2022-05-12T00:30:28Z

Rok Mihevc / @rok:
Hey Gaurav,

we currently don't have a variable timezone array in arrow and introducing it would be a big change.
Do notice times in your output array are still correct even though their timezone was unified. Do you need to keep timezone information for your application?

Incidentally the taxi example you're giving is really a single timezone with DST/non-DST period. See TZ database name here to see definition of timezone arrow uses.

asfimport · 2022-05-12T01:58:57Z

Gaurav Sheni:
Hi @rok ,

I don't believe they are correct even if the timezone was unified.
I updated my code example. Both should be 9AM ET // 6PM PT.

Yes, for my application, I need to keep different timezones in a timezone array.

I cited the Taxi example because imagine someone was picked up in CT timezone, and dropped off in a ET timezone. Then the same taxi picks up a person in the ET timezone. You could have a pickup_datetime column where two different timezones are in 1 column.

The other example I am bringing up is let's say you are collecting datetimes for 1 location. It would be important to know the EDT datetimes from the EST datetimes (they would be in 1 array).

asfimport · 2022-05-12T06:57:24Z

Joris Van den Bossche / @jorisvandenbossche:

I don't believe they are correct even if the timezone was unified. I updated my code example. Both should be 9AM ET // 6PM PT.

Unfortunately, you are running into some confusing behaviour of pytz, and so the fact that the times are indeed incorrect is not pyarrow's fault.

Taking one of your datetime values:

In [8]: print(datetime(year=2010, month=1, day=1, hour=9, minute=0, second=0, tzinfo=pytz.timezone('US/Eastern')))
2010-01-01 09:00:00-04:56

You can see this strange "-04:56" offset string (while we would expect it to be either "-04:00" or "-05:00"), and it is this offset that pyarrow applies to get the UTC value (2010-01-01 13:56:00), and then when converting back to python and attaching the timezone correctly, you end up with:

In [10]: print(arr[0].as_py())
2010-01-01 08:56:00-05:00

But so this is due to the initial creating of the datetime.datetime object with a pytz timezone. The "correct" way to do this for pytz:

In [11]: print(pytz.timezone('US/Eastern').localize(datetime(year=2010, month=1, day=1, hour=9, minute=0, second=0)))
2010-01-01 09:00:00-05:00

See https://bugs.launchpad.net/pytz/+bug/1746179 and https://blog.ganssle.io/articles/2018/03/pytz-fastest-footgun.html for a more detailed explanation about this.

asfimport · 2022-05-12T07:18:15Z

Joris Van den Bossche / @jorisvandenbossche:
On your actual question about supporting different timezones in a single array, @rok already mentioned this is not supported, and I also don't think we currently have plans to do so.

For storing data in a single timezone (eg "US/Eastern" or "America/New_York") but wanting to know if it is EDT or EST, there are a few ways to do this. First, there is an is_dst compute function (new in 8.0.0):

In [14]: arr = pa.array([datetime(2010, 1, 1), datetime(2010, 6, 1)], pa.timestamp("us", "America/New_York"))

In [15]: pc.is_dst(arr)
Out[15]: 
<pyarrow.lib.BooleanArray object at 0x7fe60da45760>
[
  false,
  true
]

You could also calculate the time difference if you localize the times to the local timezone and to UTC, and then calculate the time difference. But I don't think we already have a compute function to get a naive time from the tz-aware timestamp.

—

I think the best workaround we can currently mention for actually storing multiple timezones, is to store the timestamps itself in UTC and have a separate column that keeps track of the timezone. In theory we could consider adding a timezone argument to functions that need a timezone (and currently get that from the type).

asfimport · 2022-05-12T09:53:51Z

Rok Mihevc / @rok:
Sorry I misread timestamps in your original example.

The two column approach @jorisvandenbossche suggests could perhaps already be done with a binary UDF. See examples here:
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_udf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support storing different timezone in an array #31901

Support storing different timezone in an array #31901

asfimport commented May 11, 2022

asfimport commented May 12, 2022

asfimport commented May 12, 2022

asfimport commented May 12, 2022

asfimport commented May 12, 2022

asfimport commented May 12, 2022

Support storing different timezone in an array #31901

Support storing different timezone in an array #31901

Comments

asfimport commented May 11, 2022

asfimport commented May 12, 2022

asfimport commented May 12, 2022

asfimport commented May 12, 2022

asfimport commented May 12, 2022

asfimport commented May 12, 2022