Skip to content

read_parquet with Pyarrow backed tries to tz_localize(tz="UTC") on already localised UTC timestamps #2410

@BenTaylorProfusion

Description

@BenTaylorProfusion

Describe the bug

I have a parquet file where the timestamps are already localised and stored as UTC.
I am unable to read the file using wrangler with a dtype_backend of pyarrow as it returns the following error.
ArrowInvalid: Timestamps already have a timezone: 'UTC'. Cannot localize to 'UTC'.

Upon digging into the traceback, it seems the issue is in awswrangler/_arrow.py

if col_name in df.columns and c["pandas_type"] == "datetimetz":
timezone: datetime.tzinfo = pa.lib.string_to_tzinfo(c["metadata"]["timezone"])
_logger.debug("applying timezone (%s) on column %s", timezone, col_name)
if hasattr(df[col_name].dtype, "tz") is False:
df[col_name] = df[col_name].dt.tz_localize(tz="UTC")
df[col_name] = df[col_name].dt.tz_convert(tz=timezone)

That following block seems to be always applying a timezone localisation despite the Parquet timestamp being localised as UTC.

I can use numpy_nullable backend and this error does not occur, but i am wanting to use pyarrow for my application.

If there are parameters in the function call that will allow this to work with pyarrow i am not aware of them based on the latest 3.2.1 documentation

How to Reproduce

wr.s3.read_parquet(file_path, dtype_backend="pyarrow", boto3_session=session)

The boto3_session has correct permissions i can load fine if i change the backend, or if i point to a parquet files that contains no timestamps.

Expected behavior

To read in to a pandas dataframe with the datatype of datetime64[ns, UTC] in place of parquets Timestamp[ns, UTC]

Your project

No response

Screenshots

Screenshot 2023-07-26 at 10 17 16

OS

Mac Local/Linux AWS Node

Python version

3.9.1

AWS SDK for pandas version

3.2.1

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions