Skip to content

Severe performance regression filtering dataset imported from pyarrow in duckdb>=1.3.0 #52

@jakkes

Description

@jakkes

What happens?

Running a filter of type

WHERE x in (long list of integers)

hangs when running on a dataset imported from Arrow. This is new in DuckDB v1.3, and is also an issue in the latest build of v1.4.

To Reproduce

Script below will

  1. Write some random data in hive partitioning
  2. Read the data through pyarrow
  3. Import it to duckdb
  4. Run a filter of form "id in (0,4,8,12,16,...,9996)".

Step four hangs in v1.3 and beyond. I am not sure if it actually has anything to do with the hive partitioning.

import time
import duckdb
import pyarrow.dataset as ds
import tempfile


IDS = 10000


class measure_execution_time:
    def __init__(self, name: str):
        self._name = name
        self._start_time = None

    def __enter__(self):
        self._start_time = time.perf_counter()

    def __exit__(self, *args):
        assert self._start_time is not None
        end_time = time.perf_counter()
        elapsed_time = end_time - self._start_time
        print(f"{self._name} took {elapsed_time:.2f} seconds")



def write_data(rootdir: str):
    duckdb.sql(
        f"""
        with ids as (
            select unnest(range({IDS})) as id
        ),
        dates as (
            select unnest(range('2020-01-01'::date, '2021-01-01'::date, '1 day'::interval)) as date
        ),
        data as (
            select
                *, random() as value, random() as sort_order,
                date_part('year', date) as year,
                date_part('month', date) as month,
                date_part('day', date) as day
            from ids cross join dates
            order by sort_order
        )

        select * from data
        """
    ).to_table("tmptable")

    duckdb.sql(f"copy tmptable to '{rootdir}' (format parquet, partition_by (year, month, day))")
    duckdb.sql("drop table tmptable")


def read_data(rootdir: str):

    dataset = ds.dataset(
        rootdir,
        partitioning="hive",
        format="parquet",
    )

    # select every fourth id
    id_filter = tuple(range(0, IDS, 4))

    (
        duckdb.from_arrow(dataset)
        # This specific line is extremely slow on DuckDB v1.3
        .filter(f"id in {id_filter}")
        .to_table("tmptable")
    )


if __name__ == "__main__":
    with tempfile.TemporaryDirectory() as rootdir:
        with measure_execution_time("Writing data"):
            write_data(rootdir)

        with measure_execution_time("Reading data"):
            read_data(rootdir)

This code, when run with

pip install pyarrow==21.0.0 duckdb==1.2.2

prints

Writing data took 2.52 seconds
Reading data took 1.40 seconds

When run with

pip install pyarrow==21.0.0 duckdb==1.3.2

or

pip install pyarrow==21.0.0 duckdb==1.4.0.dev215

prints

Writing data took 1.92 seconds

and then hangs.

Specifically, it hangs on line 67,

.filter(f"id in {id_filter}")

This seems specific to when importing data from pyarrow. If I read the data using DuckDB's parquet reader, this is not an issue.

OS:

Ubuntu

DuckDB Version:

1.3.2

DuckDB Client:

Python 3.10

Hardware:

No response

Full Name:

Jakob Stigenberg

Affiliation:

Qubos Systematic

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions