New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dask flaky tests #2471
Fix dask flaky tests #2471
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2471 +/- ##
=======================================
- Coverage 99.7% 99.7% -0.0%
=======================================
Files 283 283
Lines 25575 25574 -1
=======================================
- Hits 25473 25472 -1
Misses 102 102
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks interesting! any idea why 8 fails but 12 doesnt?
dask_cluster = LocalCluster( | ||
n_workers=1, threads_per_worker=2, dashboard_address=None | ||
) | ||
dask_cluster = LocalCluster(n_workers=1, dashboard_address=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol but why
dask_cluster = LocalCluster( | ||
n_workers=1, threads_per_worker=2, dashboard_address=None | ||
) | ||
dask_cluster = LocalCluster(n_workers=1, dashboard_address=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find! I guess the thinking is that the test relies on TestPipelineFast
finishing before TestPipelineWithFitError
but since we set threads_per_worker
to 2 and there are three pipelines, we can't guarantee that all three pipelines start evaluating at the same time.
To be extra safe, can you verify what the value of CPU_COUNT
(referenced in the dask source code) in our gh workers is?
One way to do that may be to add a test that fails like this:
from dask.system import CPU_COUNT
assert CPU_COUNT == None
The benefit of doing this is verifying that threads_per_worker gets set to 16 as opposed to something else. Like you said, even going with 8 can cause failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I think I'm good with merging this in and seeing how it goes. I just want to know what CPU_COUNT
is on gh so we know what to expect with regards to future flakes. If this fails, we can try to rewrite the test to not rely a specific pipeline evaluation order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's 16!
That was on local. I can see what it is on GH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Word, let's wait to merge until we can verify. If it's <= 8 then maybe we have to hit the drawing board again? I'd be good with merging even if it's <=8 just to see what happens lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😨 😂 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, how about we change the test? The test is supposed to verify that all computations stop after we hit an exception with the raise_error_callback
.
I think this line that we already have in the test (assert len(automl.full_rankings) < len(pipelines)
) is sufficient. Whether or not TestPipelineFast
actually finishes is not really relevant. So we can delete
assert TestPipelineFast.custom_name in set(
automl.full_rankings["pipeline_name"]
)
assert TestPipelineSlow.custom_name not in set(
automl.full_rankings["pipeline_name"]
)
assert TestPipelineWithFitError.custom_name not in set(
automl.full_rankings["pipeline_name"]
)
What do you think? FYI @chukarsten
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds good to me. I agree with not requiring the fast pipeline be in the results
@jeremyliweishih @chukarsten Unfortunately, I wasn't sure why 8 fails. I initially thought that having 4 threads would be enough (since there were 3 pipelnes), but it seems like dask's distributed might not work like that or there's additional stuff happening. I got some performance results here: But I'm not sure how to further evaluate what is happening. Would appreciate any advice or tips if yall have any! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the good work digging into this @bchen1116 !
Fix #2341
Previously, running this test 10 times resulted in some >1 test failures:

This seemed to be an issue related to the parallelization of the search. By removing the
threads_per_worker
parameter, we default the threads to 16, which is more than enough to handle AutoMLSearch.50 runs of this test:

100 runs of this test:

Interestingly, 8 threads fails, but 12 threads passes.