-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Labels
Description
What happens?
Running a filter of type
WHERE x in (long list of integers)
hangs when running on a dataset imported from Arrow. This is new in DuckDB v1.3, and is also an issue in the latest build of v1.4.
To Reproduce
Script below will
- Write some random data in hive partitioning
- Read the data through pyarrow
- Import it to duckdb
- Run a filter of form "id in (0,4,8,12,16,...,9996)".
Step four hangs in v1.3 and beyond. I am not sure if it actually has anything to do with the hive partitioning.
import time
import duckdb
import pyarrow.dataset as ds
import tempfile
IDS = 10000
class measure_execution_time:
def __init__(self, name: str):
self._name = name
self._start_time = None
def __enter__(self):
self._start_time = time.perf_counter()
def __exit__(self, *args):
assert self._start_time is not None
end_time = time.perf_counter()
elapsed_time = end_time - self._start_time
print(f"{self._name} took {elapsed_time:.2f} seconds")
def write_data(rootdir: str):
duckdb.sql(
f"""
with ids as (
select unnest(range({IDS})) as id
),
dates as (
select unnest(range('2020-01-01'::date, '2021-01-01'::date, '1 day'::interval)) as date
),
data as (
select
*, random() as value, random() as sort_order,
date_part('year', date) as year,
date_part('month', date) as month,
date_part('day', date) as day
from ids cross join dates
order by sort_order
)
select * from data
"""
).to_table("tmptable")
duckdb.sql(f"copy tmptable to '{rootdir}' (format parquet, partition_by (year, month, day))")
duckdb.sql("drop table tmptable")
def read_data(rootdir: str):
dataset = ds.dataset(
rootdir,
partitioning="hive",
format="parquet",
)
# select every fourth id
id_filter = tuple(range(0, IDS, 4))
(
duckdb.from_arrow(dataset)
# This specific line is extremely slow on DuckDB v1.3
.filter(f"id in {id_filter}")
.to_table("tmptable")
)
if __name__ == "__main__":
with tempfile.TemporaryDirectory() as rootdir:
with measure_execution_time("Writing data"):
write_data(rootdir)
with measure_execution_time("Reading data"):
read_data(rootdir)
This code, when run with
pip install pyarrow==21.0.0 duckdb==1.2.2
prints
Writing data took 2.52 seconds
Reading data took 1.40 seconds
When run with
pip install pyarrow==21.0.0 duckdb==1.3.2
or
pip install pyarrow==21.0.0 duckdb==1.4.0.dev215
prints
Writing data took 1.92 seconds
and then hangs.
Specifically, it hangs on line 67,
.filter(f"id in {id_filter}")
This seems specific to when importing data from pyarrow. If I read the data using DuckDB's parquet reader, this is not an issue.
OS:
Ubuntu
DuckDB Version:
1.3.2
DuckDB Client:
Python 3.10
Hardware:
No response
Full Name:
Jakob Stigenberg
Affiliation:
Qubos Systematic
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have