-
Notifications
You must be signed in to change notification settings - Fork 722
Description
Describe the bug
I have a parquet file where the timestamps are already localised and stored as UTC.
I am unable to read the file using wrangler with a dtype_backend of pyarrow as it returns the following error.
ArrowInvalid: Timestamps already have a timezone: 'UTC'. Cannot localize to 'UTC'.
Upon digging into the traceback, it seems the issue is in awswrangler/_arrow.py
if col_name in df.columns and c["pandas_type"] == "datetimetz":
timezone: datetime.tzinfo = pa.lib.string_to_tzinfo(c["metadata"]["timezone"])
_logger.debug("applying timezone (%s) on column %s", timezone, col_name)
if hasattr(df[col_name].dtype, "tz") is False:
df[col_name] = df[col_name].dt.tz_localize(tz="UTC")
df[col_name] = df[col_name].dt.tz_convert(tz=timezone)
That following block seems to be always applying a timezone localisation despite the Parquet timestamp being localised as UTC.
I can use numpy_nullable backend and this error does not occur, but i am wanting to use pyarrow for my application.
If there are parameters in the function call that will allow this to work with pyarrow i am not aware of them based on the latest 3.2.1 documentation
How to Reproduce
wr.s3.read_parquet(file_path, dtype_backend="pyarrow", boto3_session=session)
The boto3_session has correct permissions i can load fine if i change the backend, or if i point to a parquet files that contains no timestamps.
Expected behavior
To read in to a pandas dataframe with the datatype of datetime64[ns, UTC] in place of parquets Timestamp[ns, UTC]
Your project
No response
Screenshots
OS
Mac Local/Linux AWS Node
Python version
3.9.1
AWS SDK for pandas version
3.2.1
Additional context
No response
