-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add field to DynamicPhysicalExpr to indicate when the filter is complete or updated #18799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add field to DynamicPhysicalExpr to indicate when the filter is complete or updated #18799
Conversation
5528b6e to
dffd77b
Compare
|
Will take a look at this one soon |
|
This makes sense and is a relatively simple change, but could you share an example use case? Would the scan node care if the filter is complete/in progress? |
28c5406 to
278c3bb
Compare
adriangb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I do think it would be good to have some tests:
- Unit tests for
DynamicFilterPhysicalExpr - Unit tests for
TopK/SortExecandHashJoinExecto ensure they mark the filter as completed.
3c40331 to
c6a91b6
Compare
c10ef72 to
3115cfb
Compare
gabotechs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
|
Will give it until tomorrow in case anyone else want to chime in, and otherwise merge it then. |
|
Thanks for the reviews @adriangb and @gabotechs 🙇 |
2010YOUY01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
| let array: ArrayRef = Arc::new(Int32Array::from(vec![Some(3), Some(1), Some(2)])); | ||
| let batch = RecordBatch::try_new(Arc::clone(&schema), vec![array])?; | ||
| topk.insert_batch(batch)?; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also do an assertion for 'in progress' here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the issue is that its hard to set up in the tests a wait_update before the update happens without doing something like tokio::spawn (which I think is not allowed to use in tests) or setting a timeout which would introduce some indeterminism to the test
|
I think this PR might slightly conflict logically with #18451 -- I have to update the branch before merging |
|
Thanks @LiaCastaneda for the PR and @adriangb and @2010YOUY01 for the reviews! |
…ete or updated (apache#18799) <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes #. Dynamic filter pushdown in DataFusion currently lacks an API to determine when filters are "complete" (all contributing partitions have reported), this creates an ambiguity issue where it's impossible to differentiate between: 1. **Complete filter with no data**: Build side produced 0 rows, filter remains as placeholder `lit(true)`, no more updates coming 2. **Incomplete filter**: Filter is still being computed, updates are pending I think this could be especially useful when we want to make the filter updates progressively in the future. - Calls `mark_complete()` after barrier completes, regardless of whether bounds exist. - Exposes` is_complete() f`unction on the `DynamicFilterPhysicalExpr`. I didn't add any tests because the change is minimal , and comprehensive testing would require making the `DynamicFilterPhysicalExpr` public or running through the full optimizer pipeline. Exposing is_complete() function. (cherry picked from commit 7fa2a69)
…ete or updated (apache#18799) <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes #. Dynamic filter pushdown in DataFusion currently lacks an API to determine when filters are "complete" (all contributing partitions have reported), this creates an ambiguity issue where it's impossible to differentiate between: 1. **Complete filter with no data**: Build side produced 0 rows, filter remains as placeholder `lit(true)`, no more updates coming 2. **Incomplete filter**: Filter is still being computed, updates are pending I think this could be especially useful when we want to make the filter updates progressively in the future. - Calls `mark_complete()` after barrier completes, regardless of whether bounds exist. - Exposes` is_complete() f`unction on the `DynamicFilterPhysicalExpr`. I didn't add any tests because the change is minimal , and comprehensive testing would require making the `DynamicFilterPhysicalExpr` public or running through the full optimizer pipeline. Exposing is_complete() function. (cherry picked from commit 7fa2a69)
…ete or updated (apache#18799) (#60) <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes #. Dynamic filter pushdown in DataFusion currently lacks an API to determine when filters are "complete" (all contributing partitions have reported), this creates an ambiguity issue where it's impossible to differentiate between: 1. **Complete filter with no data**: Build side produced 0 rows, filter remains as placeholder `lit(true)`, no more updates coming 2. **Incomplete filter**: Filter is still being computed, updates are pending I think this could be especially useful when we want to make the filter updates progressively in the future. - Calls `mark_complete()` after barrier completes, regardless of whether bounds exist. - Exposes` is_complete() f`unction on the `DynamicFilterPhysicalExpr`. I didn't add any tests because the change is minimal , and comprehensive testing would require making the `DynamicFilterPhysicalExpr` public or running through the full optimizer pipeline. Exposing is_complete() function. (cherry picked from commit 7fa2a69)
Which issue does this PR close?
Rationale for this change
Dynamic filter pushdown in DataFusion currently lacks an API to determine when filters are "complete" (all contributing partitions have reported), this creates an ambiguity issue where it's impossible to differentiate between:
lit(true), no more updates comingI think this could be especially useful when we want to make the filter updates progressively in the future.
What changes are included in this PR?
mark_complete()after barrier completes, regardless of whether bounds exist.is_complete() function on theDynamicFilterPhysicalExpr.Are these changes tested?
I didn't add any tests because the change is minimal , and comprehensive testing would require making the
DynamicFilterPhysicalExprpublic or running through the full optimizer pipeline.Are there any user-facing changes?
Exposing is_complete() function.