-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] pyarrow.parquet.read_table with filters is broken for timezone aware datetime since 13.0.0 release #37355
Comments
@kardaj thanks a lot for the report. I can reproduce this with pyarrow 13.0.0 (installed from conda-forge, on ubuntu linux), the bizarre thing is that it does work on the main branch (and it worked on 12.0.0). |
Pinging @rok @westonpace @wjones127 in case they know where to look. |
It works for me on the main branch, but also on the git tag |
That is odd. Are we sure timestamp (from pytz) is correctly timezoned? The casting logic for relevant casting is here: arrow/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc Lines 143 to 169 in b9453a2
But it didn't change much lately. |
I just installed the wheels from our nightlies:
And it seems to work:
|
As a clarification the same test fails installing parrow==13.0.0:
|
A simpler reproducer: import pyarrow as pa
import pyarrow.compute as pc
# create table with tz-aware nanosecond resolution timestamp
table = pa.table({'timestamp': pa.array([1], pa.timestamp("ns", "UTC"))})
# comparison with microseconds works if there are microseconds
table.filter(pc.field("timestamp") <= pa.scalar(1, pa.timestamp("us", "UTC")))
# comparison fails with microseconds if there are no microseconds
table.filter(pc.field("timestamp") <= pa.scalar(0, pa.timestamp("us", "UTC")))
# ...
# ArrowNotImplementedError: Function 'less_equal' has no kernel
# matching input types (timestamp[ns, tz=UTC], timestamp[s])
# but works again if the resolution matches
table.filter(pc.field("timestamp") <= pa.scalar(0, pa.timestamp("ns", "UTC"))) It somehow completely looses the type information of the scalar (both the resolution and the timezone) somewhere inside Acero. Calling the compute kernel directly instead of going through an expression and execute with Acero seems to work fine:
The above reproducer actually also fails with pyarrow 12.0.0, but still seems fixed on the main branch. So some other change in 13.0.0 might have changed the dataset filtering to take the code path from the reproducer above. |
Would it related to #37135 ? |
Ah, indeed, that looks very much related! |
Could we close this issue? |
Describe the bug, including details regarding any error messages, version, and platform.
From what I gathered, a timezone-aware
datetime.datetime
is cast into a naive timestamp if its microseconds=0.I managed to replicate the error in this snippet:
with pyarrow<13.0.0, I get the following output:
with pyarrow==13.0.0, I get the following output:
Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: