-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize "per partition" top-k : ROW_NUMBER < 5
/ TopK
#6899
Comments
I think we need an optimization step that transforms the plan you gave to one that uses a fetching sort and does away with the filter. It seems to me the window operator would still be used as is. |
I agree the window operator probably should remain as is Maybe we could use a specialized sort operator like
Where the 🤔 |
Spark does the similar way: it sorts and limits data per partition then sends the output to single partition where final sort/limit performed. Spark has the logic encapsulated in separate operator and looks like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L311 it also contains optimization like if tuples already ordered it will skip the excessive ordering, the same happens for projection |
I'd like to help this. (Looks not so difficult I think)
A question is whether the |
This is one where I would recommend you try hacking up a prototype that works enough to show some performance results, and then get feedback on it before spending too much time polishing. I think this one could easily turn into a large project |
@alamb Thanks for your reply! Also I'd like to do it after I got enough knowledge. 😎 |
ROW_NUMBER < 5
ROW_NUMBER < 5
/ TopK
Is your feature request related to a problem or challenge?
DataFusion optimizes queries like
... ORDER BY value LIMIT 10
by only keeping the top 10 ("limit") rows when sorting which is great!Another common pattern (that we also have in IOx) (https://github.com/influxdata/influxdb_iox/pull/8187/files#r1257834347) is queries like the following to select the top N values "per partition"
Currently the plan will be something like:
The problem with this plan is that it will sort (and copy) the ENTIRE input even when the query only needs the first 10 rows of each partition
Describe the solution you'd like
It would be awesome to optimize this case somehow so that it did not need to sort the entire input (and somehow could only keep the top N values per partition). I am not sure how easy this would be to do for sorting
Describe alternatives you've considered
Maybe we could at least teach the window operator to only emit the top N values per partition if there was a row number predicate at at least save some of that work -- the sort would still be required, but at least the window operator would do less work
Additional context
No response
The text was updated successfully, but these errors were encountered: