Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL query filter not working with pyarrow dataset with timestamp index #9371

Closed
1 task done
samm0ss opened this issue Oct 17, 2023 · 4 comments · Fixed by #9377
Closed
1 task done

SQL query filter not working with pyarrow dataset with timestamp index #9371

samm0ss opened this issue Oct 17, 2023 · 4 comments · Fixed by #9377

Comments

@samm0ss
Copy link

samm0ss commented Oct 17, 2023

What happens?

  • Create pandas dataframe with timestamp index (i.e. ts)
  • Write to parquet file
  • Create pyarrow dataset using this parquet file
  • Run filter query on dataset and select specific ts: "SELECT * FROM my_arrow_dataset WHERE ts = ?"
  • The filter query does not return any rows.

This query does work when you query the parquet file directly (see reproducible). Only when you make a dataset out of it it doesn't work. It's possible that this is inherent to dataset + parquet files with timestamp indices, however, I could find any documentation suggesting this.

To Reproduce

# python==3.10.9
# duckdb==0.9.1
# pandas==2.1.1
# pyarrow==13.0.0

import datetime
import pathlib
import tempfile

import duckdb.duckdb
import pyarrow as pa
import pyarrow.dataset as ds

# connect to an in-memory database
con = duckdb.connect()
con.execute("SET TimeZone='UTC';")

# Example data
dt = datetime.datetime(2023, 8, 29, 1, tzinfo=datetime.timezone.utc)
my_arrow_table = pa.Table.from_pydict(
    {'ts': [dt, dt, dt],
      'value': [1, 2, 3]})
df = my_arrow_table.to_pandas()
df = df.set_index("ts")  # SET INDEX! (It all works correctly when the index is not set)

# create example parquet files and save in a folder
base_path = pathlib.Path(tempfile.gettempdir()) / "parquet_folder"
base_path.mkdir(exist_ok=True)
file_path = base_path / "test.parquet"
df.to_parquet(str(file_path))
my_arrow_dataset = ds.dataset(str(base_path))


try:
    # Fails with `duckdb.duckdb.NotImplementedException:
    # Not implemented Error: DatetimeType not recognized in ConvertTimestampUnit`
    con.execute("SELECT * FROM my_arrow_dataset WHERE ts = ?", parameters=[dt]).arrow()
except duckdb.duckdb.NotImplementedException:
    pass  # Ok let's try timestamp

# Doesn't fail but doesn't return anything
timestamp = pa.scalar(dt, type=pa.timestamp("us", tz='UTC'))
results_dataset = con.execute("SELECT * FROM my_arrow_dataset WHERE ts = ?", parameters=[timestamp.value]).arrow()

# Works correctly
result_parquet_file = con.execute(
    f"SELECT * prediction_moment_ts FROM read_parquet('{str(file_path)}') WHERE ts = ?", parameters=[dt]
).arrow()

assert results_dataset == result_parquet_file, f"{repr(results_dataset)} != {repr(result_parquet_file)}"

OS:

OSX aarch64

DuckDB Version:

0.9.0

DuckDB Client:

python

Full Name:

Sam VL

Affiliation:

source.ag

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have
@Tishj
Copy link
Contributor

Tishj commented Oct 17, 2023

Probably similar to #8856
The issue is explained there, I haven't run your example yet though

@samm0ss
Copy link
Author

samm0ss commented Oct 17, 2023

Indeed it does look like it's the same issue (or very similar at least).

I tried running the reproducible script with latest main duckdb-0.9.2.dev48.
It still has the same issue that the datetime in the filter cannot be cast to the column type:

duckdb.duckdb.NotImplementedException: Not implemented Error: DatetimeType not recognized in ConvertTimestampUnit

Perhaps that is the real bug (instead of working around and passing TIMESTAMP_TZ(us) as the filter argument).

@Tishj
Copy link
Contributor

Tishj commented Oct 17, 2023

Yea the PR I mentioned is included in 0.9.1, I didn't mean it was already fixed, sorry for the confusion.

I had a look and it seems the ArrowDateTimeType is DAYS, which is not something we're expecting there currently

An unfortunate detail about your reproduction is that it only works on pyarrow 13+, and that makes it almost impossible to properly debug because when I attach lldb and pyarrow13+ gets imported it causes lldb to crash

* thread #2, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0x4a03000)
    frame #0: 0x0000000171abb568 libarrow.1300.dylib`_armv8_sve_probe
libarrow.1300.dylib`:
->  0x171abb568 <+0>: eor    z0.d, z0.d, z0.d
    0x171abb56c <+4>: ret    

libarrow.1300.dylib`:
    0x171abb570 <+0>: xar    z0.d, z0.d, z0.d, #0x20
    0x171abb574 <+4>: ret    
Target 0: (Python) stopped.

@Tishj
Copy link
Contributor

Tishj commented Oct 17, 2023

Actually, got a debugger attached to it with pyarrow 12 and reproducing the issue 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants