Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow.parquet.read_table with filters is broken for timezone aware datetime since 13.0.0 release #37355

Open
kardaj opened this issue Aug 24, 2023 · 10 comments

Comments

@kardaj
Copy link

kardaj commented Aug 24, 2023

Describe the bug, including details regarding any error messages, version, and platform.

From what I gathered, a timezone-aware datetime.datetime is cast into a naive timestamp if its microseconds=0.

I managed to replicate the error in this snippet:

import io
import pytz
import datetime
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc

timezone = "Europe/Paris"
field_name = "timestamp"

table = pa.Table.from_pydict(
    {field_name: []},
    schema=pa.schema(
        [
            pa.field(
                field_name,
                pa.timestamp("ns", tz=timezone),
                nullable=False,
            )
        ]
    ),
)
print(table)
buffer = io.BytesIO()
pq.write_table(table, buffer)

filters = None
table = pq.read_table(buffer, filters=filters)
assert len(table.to_pylist()) == 0
print(f"filters={filters}", "ok")

for microsecond in [1, 0]:
    timestamp = pytz.timezone(timezone).localize(
        datetime.datetime.combine(
            datetime.date.today(),
            datetime.time(hour=12, microsecond=microsecond),
        )
    )
    filters = pc.field("timestamp") <= timestamp
    table = pq.read_table(buffer, filters=filters)
    assert len(table.to_pylist()) == 0
    print(f"filters={filters}", "ok")

with pyarrow<13.0.0, I get the following output:

pyarrow.Table
timestamp: timestamp[ns, tz=Europe/Paris] not null
----
timestamp: [[]]
filters=None ok
filters=(timestamp <= 2023-08-24 10:00:00.000001) ok
filters=(timestamp <= 2023-08-24 10:00:00.000000) ok
terminate called without an active exception
Aborted (core dumped)

with pyarrow==13.0.0, I get the following output:

pyarrow.Table
timestamp: timestamp[ns, tz=Europe/Paris] not null
----
timestamp: [[]]
filters=None ok
filters=(timestamp <= 2023-08-24 10:00:00.000001) ok
Traceback (most recent call last):
  File "/workspaces/mapping-tools/broken_pyarrow.py", line 43, in <module>
    table = pq.read_table(buffer, filters=filters)
  File "/workspaces/mapping-tools/env/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 3002, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/workspaces/mapping-tools/env/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 2630, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 547, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 393, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 3391, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 3309, in pyarrow._dataset.Scanner._make_scan_options
  File "pyarrow/_dataset.pyx", line 3243, in pyarrow._dataset._populate_builder
  File "pyarrow/_compute.pyx", line 2595, in pyarrow._compute._bind
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'less_equal' has no kernel matching input types (timestamp[ns, tz=Europe/Paris], timestamp[s])

Component(s)

Parquet, Python

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 24, 2023

@kardaj thanks a lot for the report. I can reproduce this with pyarrow 13.0.0 (installed from conda-forge, on ubuntu linux), the bizarre thing is that it does work on the main branch (and it worked on 12.0.0).

@pitrou pitrou changed the title pyarrow.parquet.read_table with filters is broken for timezone aware datetime since 13.0.0 release [C++][Python] pyarrow.parquet.read_table with filters is broken for timezone aware datetime since 13.0.0 release Aug 24, 2023
@jorisvandenbossche jorisvandenbossche changed the title [C++][Python] pyarrow.parquet.read_table with filters is broken for timezone aware datetime since 13.0.0 release [Python] pyarrow.parquet.read_table with filters is broken for timezone aware datetime since 13.0.0 release Aug 24, 2023
@pitrou
Copy link
Member

pitrou commented Aug 24, 2023

Pinging @rok @westonpace @wjones127 in case they know where to look.

@pitrou
Copy link
Member

pitrou commented Aug 24, 2023

It works for me on the main branch, but also on the git tag apache-arrow-13.0.0. @raulcd

@rok
Copy link
Member

rok commented Aug 24, 2023

That is odd. Are we sure timestamp (from pytz) is correctly timezoned? The casting logic for relevant casting is here:

// <TimestampType, TimestampType> and <DurationType, DurationType>
template <typename O, typename I>
struct CastFunctor<
O, I,
enable_if_t<(is_timestamp_type<O>::value && is_timestamp_type<I>::value) ||
(is_duration_type<O>::value && is_duration_type<I>::value)>> {
static Status Exec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
const auto& in_type = checked_cast<const I&>(*batch[0].type());
const auto& out_type = checked_cast<const O&>(*out->type());
if (in_type.unit() == out_type.unit()) {
return ZeroCopyCastExec(ctx, batch, out);
}
ArrayData* out_arr = out->array_data().get();
DCHECK_EQ(0, out_arr->offset);
int value_size = batch[0].type()->byte_width();
DCHECK_OK(ctx->Allocate(out_arr->length * value_size).Value(&out_arr->buffers[1]));
ArraySpan output_span;
output_span.SetMembers(*out_arr);
const ArraySpan& input = batch[0].array;
auto conversion = util::GetTimestampConversion(in_type.unit(), out_type.unit());
return ShiftTime<int64_t, int64_t>(ctx, conversion.first, conversion.second, input,
&output_span);
}
};

But it didn't change much lately.

@raulcd
Copy link
Member

raulcd commented Aug 24, 2023

I just installed the wheels from our nightlies:

$ pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ --prefer-binary --pre pyarrow
Looking in indexes: https://pypi.org/simple, https://pypi.fury.io/arrow-nightlies/
Collecting pyarrow
  Downloading https://pypi.fury.io/arrow-nightlies/-/ver_1bxX5l/pyarrow-14.0.0.dev10-cp311-cp311-manylinux_2_28_x86_64.whl (37.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37.4/37.4 MB 12.9 MB/s eta 0:00:00
Collecting numpy>=1.16.6 (from pyarrow)
  Obtaining dependency information for numpy>=1.16.6 from https://files.pythonhosted.org/packages/35/8b/b669836be53d7b6697bc290abcdda701fd924a22c702713bbdf2c6c5bef5/numpy-1.26.0b1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading numpy-1.26.0b1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.5/58.5 kB 1.8 MB/s eta 0:00:00
Downloading numpy-1.26.0b1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.2/19.2 MB 3.6 MB/s eta 0:00:00
Installing collected packages: numpy, pyarrow
Successfully installed numpy-1.26.0b1 pyarrow-14.0.0.dev10

And it seems to work:

$ python test.py 
pyarrow.Table
timestamp: timestamp[ns, tz=Europe/Paris] not null
----
timestamp: [[]]
filters=None ok
filters=(timestamp <= 2023-08-24 10:00:00.000001) ok
filters=(timestamp <= 2023-08-24 10:00:00.000000) ok
terminate called without an active exception
Aborted (core dumped)

@raulcd
Copy link
Member

raulcd commented Aug 24, 2023

As a clarification the same test fails installing parrow==13.0.0:

pyarrow.Table
timestamp: timestamp[ns, tz=Europe/Paris] not null
----
timestamp: [[]]
filters=None ok
filters=(timestamp <= 2023-08-24 10:00:00.000001) ok
Traceback (most recent call last):
  File "/home/raulcd/code/raul-test13/test.py", line 40, in <module>
    table = pq.read_table(buffer, filters=filters)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raulcd/code/raul-test13/test/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 3002, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raulcd/code/raul-test13/test/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 2630, in read
    table = self._dataset.to_table(
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 547, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 393, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 3391, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 3309, in pyarrow._dataset.Scanner._make_scan_options
  File "pyarrow/_dataset.pyx", line 3243, in pyarrow._dataset._populate_builder
  File "pyarrow/_compute.pyx", line 2595, in pyarrow._compute._bind
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'less_equal' has no kernel matching input types (timestamp[ns, tz=Europe/Paris], timestamp[s])

@jorisvandenbossche
Copy link
Member

A simpler reproducer:

import pyarrow as pa
import pyarrow.compute as pc

# create table with tz-aware nanosecond resolution timestamp
table = pa.table({'timestamp': pa.array([1], pa.timestamp("ns", "UTC"))})

# comparison with microseconds works if there are microseconds
table.filter(pc.field("timestamp") <= pa.scalar(1, pa.timestamp("us", "UTC")))

# comparison fails with microseconds if there are no microseconds 
table.filter(pc.field("timestamp") <= pa.scalar(0, pa.timestamp("us", "UTC")))
# ...
# ArrowNotImplementedError: Function 'less_equal' has no kernel
# matching input types (timestamp[ns, tz=UTC], timestamp[s])

# but works again if the resolution matches
table.filter(pc.field("timestamp") <= pa.scalar(0, pa.timestamp("ns", "UTC")))

It somehow completely looses the type information of the scalar (both the resolution and the timezone) somewhere inside Acero.

Calling the compute kernel directly instead of going through an expression and execute with Acero seems to work fine:

>>> pc.less_equal(table["timestamp"], pa.scalar(0, pa.timestamp("us", "UTC")))
<pyarrow.lib.ChunkedArray object at 0x7f804bbb0c20>
[
  [
    false
  ]
]

The above reproducer actually also fails with pyarrow 12.0.0, but still seems fixed on the main branch. So some other change in 13.0.0 might have changed the dataset filtering to take the code path from the reproducer above.

@mapleFU
Copy link
Member

mapleFU commented Aug 25, 2023

Would it related to #37135 ?

@jorisvandenbossche
Copy link
Member

Ah, indeed, that looks very much related!

@mapleFU
Copy link
Member

mapleFU commented Nov 4, 2023

Could we close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants