Skip to content

window performance #7809

@dxe4

Description

@dxe4

What happens?

window functions seem to be slow, there was a previous issue here #1367
i thought i should re-raise because this may help fix the issue.
pandas takes 1 sec, duckdb takes 30, maybe its worth checking the logic in pandas and see if you can do the same? i think its here https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/window/aggregations.pyx

To Reproduce

file:
https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2021-04.parquet

duckdb:

select
    sum(driver_pay) over _win,
    dropoff_datetime
from read_parquet("~/dev/data/taxi3/fhvhv_tripdata_2021-04.parquet")
WINDOW _win as (
        order by dropoff_datetime asc
        range between
        interval 3 days preceding and
        interval 0 days following
    )
)

pandas:

    df = pd.read_parquet(
        file1,
        columns=["driver_pay", "dropoff_datetime"],
    )
    df.index = pd.DatetimeIndex(df["dropoff_datetime"])
    df = df.sort_index()
    df = df.drop("dropoff_datetime", axis=1)
    df = df.rolling("3d").sum()

OS:

linux

DuckDB Version:

0.8

DuckDB Client:

latest

Full Name:

harry

Affiliation:

harry

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions