Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limits of aggregate functions when used with windows #2614

Closed
the42 opened this issue Nov 17, 2021 · 10 comments
Closed

Limits of aggregate functions when used with windows #2614

the42 opened this issue Nov 17, 2021 · 10 comments

Comments

@the42
Copy link

the42 commented Nov 17, 2021

Using v0.3.1 88aa81c 64 bit on Windows 10.

I have a table like

Timestamp TIMESTAMP
level VARCHAR
thread INTEGER
id BIGINT
message VARCHAR
User VARCHAR
Klasse VARCHAR
Action VARCHAR
Laufzeit INTEGER

it contains ~ 26000 rows.

When I run select median(laufzeit) from flog the query succeeds.

When I run

    select timestamp
    , median(laufzeit) over w
    , count(*) over w
    from flog 
    window w as (order by timestamp asc range between interval 55 seconds preceding and interval 58 seconds following) order by 3 desc limit 1;

the query succeeds:

+---------------------------------------------+
¦         timestamp          ¦ median ¦ count ¦
+----------------------------+--------+-------¦
¦ 2020-10-15 16:45:00.657551 ¦ 17.0   ¦ 993   ¦
+---------------------------------------------+

When I change the interval so that the number of rows within a frame would exceed 1000 (my guess), I get the following error:

Error: INTERNAL Error: INTERNAL Error Cannot compute window aggregation: bounds are to large

This errors for the median or quantile_cont aggregate, but works eg. for avg.

Issue
1000 seems like an arbitrary and low limit. Can this limit be raised?
And if I am doing sthg. wrong how can I calculate the running median of Laufzeit = duration?

@the42 the42 changed the title Limits of aggreagte functions when used with windows Limits of aggregate functions when used with windows Nov 17, 2021
@hannes
Copy link
Member

hannes commented Nov 17, 2021

maybe @hawkfish can have a look?

@hawkfish
Copy link
Contributor

There is a limit of 1024 (the standard vector size) right now for windowed aggregation that does not use segment trees. We can remove this (and I've been meaning to!), but it would require some work to avoid excess copying.

@hawkfish
Copy link
Contributor

@the42 I think I have a way to do this, but it would help if I could have a rough idea of the data. Nothing important, but just the statistics of the timestamp and laufzeit columns? Min, max count distinct maybe?

@hawkfish
Copy link
Contributor

I've made a change that removes this limitation for non-segment tree aggregation, and it seems to improve performance in the existing benchmarks for fixed frames. Variable seems about the same.

@the42
Copy link
Author

the42 commented Nov 19, 2021

@hawkfish that sounds great. Will this change be generally available possibly in the next public release? As your performance test doesn't show any adverse side-effects I guess you no longer need some stats about the data I want to analyze.

@hawkfish
Copy link
Contributor

I expect it will be in the next release. I have generated some data that I think looks like yours, but I don't think it matters much. It looks like you have about 15 minutes worth of timestamp data and want the moving median of a column with small integer values?

@hawkfish
Copy link
Contributor

select timestamp
, median(laufzeit) over w
, count(*) over w
from flog
window w as (order by timestamp asc range between interval 55 seconds preceding and interval 58 seconds following)
order by 3 desc
limit 1;
----
2020-10-15 16:53:22	51.000000	3403

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 19, 2021
Make the bias argument explicit so the caller can optimise data usage.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 19, 2021
Switch to using and appending chunks.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 19, 2021
Add wide frame test and fix reference/copy transition.
@the42
Copy link
Author

the42 commented Nov 19, 2021

Yeah, I abbreviated somehow my report as I originally did a min(laufzeit), quantile_cont(laufzeit, [0.25, 0.5, 0.75]), max(laufzeit) to mimic a running boxplot. The window function should dampen server spikes which results in longer runtime (=duration to fulfill a request = laufzeit) yet a prolonged excessive load of the server needs inspection.

I have yet to figure out what plotting lib will give me a nice visualization, vega, shiny R ....

BTW the performance is fantastic! And my dataset even spans a longer period of time, yet with bigger gaps in the data, therefore the range between interval comes in handy.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Nov 24, 2021
Move coverage test to coverage suite.
Mytherin added a commit that referenced this issue Nov 24, 2021
@Mytherin
Copy link
Collaborator

This should be fixed now, correct? Can this be closed?

@hawkfish
Copy link
Contributor

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants