Severe performance regression filtering dataset imported from pyarrow in duckdb>=1.3.0

### What happens?

Running a filter of type
```
WHERE x in (long list of integers)
```
hangs when running on a dataset imported from Arrow. This is new in DuckDB v1.3, and is also an issue in the latest build of v1.4.

### To Reproduce
Script below will
1. Write some random data in hive partitioning
2. Read the data through pyarrow
3. Import it to duckdb
4. Run a filter of form "id in (0,4,8,12,16,...,9996)".

Step four hangs in v1.3 and beyond. I am not sure if it actually has anything to do with the hive partitioning.

```python
import time
import duckdb
import pyarrow.dataset as ds
import tempfile


IDS = 10000


class measure_execution_time:
    def __init__(self, name: str):
        self._name = name
        self._start_time = None

    def __enter__(self):
        self._start_time = time.perf_counter()

    def __exit__(self, *args):
        assert self._start_time is not None
        end_time = time.perf_counter()
        elapsed_time = end_time - self._start_time
        print(f"{self._name} took {elapsed_time:.2f} seconds")



def write_data(rootdir: str):
    duckdb.sql(
        f"""
        with ids as (
            select unnest(range({IDS})) as id
        ),
        dates as (
            select unnest(range('2020-01-01'::date, '2021-01-01'::date, '1 day'::interval)) as date
        ),
        data as (
            select
                *, random() as value, random() as sort_order,
                date_part('year', date) as year,
                date_part('month', date) as month,
                date_part('day', date) as day
            from ids cross join dates
            order by sort_order
        )

        select * from data
        """
    ).to_table("tmptable")

    duckdb.sql(f"copy tmptable to '{rootdir}' (format parquet, partition_by (year, month, day))")
    duckdb.sql("drop table tmptable")


def read_data(rootdir: str):

    dataset = ds.dataset(
        rootdir,
        partitioning="hive",
        format="parquet",
    )

    # select every fourth id
    id_filter = tuple(range(0, IDS, 4))

    (
        duckdb.from_arrow(dataset)
        # This specific line is extremely slow on DuckDB v1.3
        .filter(f"id in {id_filter}")
        .to_table("tmptable")
    )


if __name__ == "__main__":
    with tempfile.TemporaryDirectory() as rootdir:
        with measure_execution_time("Writing data"):
            write_data(rootdir)

        with measure_execution_time("Reading data"):
            read_data(rootdir)

```

This code, when run with
```
pip install pyarrow==21.0.0 duckdb==1.2.2
```
prints
```
Writing data took 2.52 seconds
Reading data took 1.40 seconds
```

When run with
```
pip install pyarrow==21.0.0 duckdb==1.3.2
```
or
```
pip install pyarrow==21.0.0 duckdb==1.4.0.dev215
```
prints
```
Writing data took 1.92 seconds
```
and then hangs.

Specifically, it hangs on line 67,
```
.filter(f"id in {id_filter}")
```


This seems specific to when importing data from pyarrow. If I read the data using DuckDB's parquet reader, this is not an issue.

### OS:

Ubuntu

### DuckDB Version:

1.3.2

### DuckDB Client:

Python 3.10

### Hardware:

_No response_

### Full Name:

Jakob Stigenberg

### Affiliation:

Qubos Systematic

### What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

### Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

### Did you include all code required to reproduce the issue?

- [x] Yes, I have

### Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

- [x] Yes, I have

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Severe performance regression filtering dataset imported from pyarrow in duckdb>=1.3.0 #52

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Severe performance regression filtering dataset imported from pyarrow in duckdb>=1.3.0 #52

Description

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions