-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Open
Labels
Component: C++Status: stale-warningIssues and PRs flagged as stale which are due to be closed if no indication otherwiseIssues and PRs flagged as stale which are due to be closed if no indication otherwiseType: enhancement
Description
Right now some early runs with Arrowbench and the OT PR (#12100) shows that we spend a fair amount of time in TPC-H queries on filter nodes. There are a few improvements we know could be made to our filtering approach at the moment. I'm creating this parent issue to help categorize and track those:
We can use a selection vector in our filters to reduce the amount of materialization needed. While long term we may want to support a selection vector throughout the exec plan a good start would be to use it when we encounter a chain of filters to avoid excess materialization (e.g. x < 10 && x > 5 && y < 20)- If a filter if very selective then we may end up outputting a lot of very small batches. We could probably hold onto the data at the filter node until we've accumulated enough rows for a decent sized batch.
- The filter node is currently creating new thread tasks instead of appending its work onto an existing thread task.
- If we have a chain of filters we could potentially use runtime selectivity statistics / estimates to reorder our filters so that the most selective filters are evaluated first.
Reporter: Weston Pace / @westonpace
Subtasks:
- [C++] Investigate batching filter node output
- [C++] Investigate reporting filter selectivity for filter order optimization
Related issues:
Note: This issue was originally created as ARROW-15519. Please see the migration documentation for further details.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Component: C++Status: stale-warningIssues and PRs flagged as stale which are due to be closed if no indication otherwiseIssues and PRs flagged as stale which are due to be closed if no indication otherwiseType: enhancement