-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Closed
Labels
Description
What happens?
window functions seem to be slow, there was a previous issue here #1367
i thought i should re-raise because this may help fix the issue.
pandas takes 1 sec, duckdb takes 30, maybe its worth checking the logic in pandas and see if you can do the same? i think its here https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/window/aggregations.pyx
To Reproduce
file:
https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2021-04.parquet
duckdb:
select
sum(driver_pay) over _win,
dropoff_datetime
from read_parquet("~/dev/data/taxi3/fhvhv_tripdata_2021-04.parquet")
WINDOW _win as (
order by dropoff_datetime asc
range between
interval 3 days preceding and
interval 0 days following
)
)
pandas:
df = pd.read_parquet(
file1,
columns=["driver_pay", "dropoff_datetime"],
)
df.index = pd.DatetimeIndex(df["dropoff_datetime"])
df = df.sort_index()
df = df.drop("dropoff_datetime", axis=1)
df = df.rolling("3d").sum()
OS:
linux
DuckDB Version:
0.8
DuckDB Client:
latest
Full Name:
harry
Affiliation:
harry
Have you tried this on the latest master branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree