-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix: add yield point to RepartitionExec
#5299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
76289b2
to
700e991
Compare
This prevents endless spinning and locked up tokio tasks if the inputs never yield `pending`. Fixes apache#5278.
700e991
to
042a834
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. LGTM, However, this test has many duplicates with UnboundedExec
in datafusion/core/src/test/exec.rs
. Maybe we can reduce code size by refactoring UnboundedExec
to be used here.
done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Have you resolved the thread::sleep
issue related to the CoalescePartitionsExec
usage? The test does not seem to suggest that.
Alternatively, do you think we should be cautious when using thread::sleep
? I would like to hear your opinion on the matter.
|
||
/// See <https://github.com/apache/arrow-datafusion/issues/5278> | ||
#[tokio::test] | ||
async fn unbounded_repartition_sa() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can insert a timeout here in case of an error. You can check unbounded_file_with_swapped_join
for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot get that timeout macro to work reliably. It only seems to work w/ multi-thread tokio executors that have more than 1 thread. However than the test itself is a bit of a lucky shot either, because the main test method will no longer block just because the repartition task spins forever. So I guess we have to live w/o a timeout.
Not sure what "the The test blocks forever without the yield point, so IMHO this fixes the original issue. |
I see. Thanks for the contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @crepererum and @metesynnada
I verified that the test covers this code
I removed the yield
, and the new test blocks:
cargo test --test repartition_exec_blocks
...
running 1 test
test unbounded_repartition_sa has been running for over 60 seconds
(killed)
And with the yield it completes quickly:
running 1 test
test unbounded_repartition_sa ... ok�(B
test result: ok�(B. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
timer.done(); | ||
} | ||
|
||
// If the input stream is endless, we may spin forever and never yield back to tokio. Hence let us yield. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too, thank you!
Thanks everyone! |
Benchmark runs are scheduled for baseline = 22b974f and contender = ded897e. ded897e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
* fix: add yield point to `RepartitionExec` This prevents endless spinning and locked up tokio tasks if the inputs never yield `pending`. Fixes apache#5278. * refactor: use a single `UnboundedExec` for testing * refactor: rename test
Which issue does this PR close?
Fixes #5278.
Rationale for this change
This prevents endless spinning and locked up tokio tasks if the inputs never yield
pending
.What changes are included in this PR?
One yield point.
Are these changes tested?
Regression test taken from #5278 (with some clean-ups to make it a real test).
Are there any user-facing changes?
Probably a more stable DF execution.