Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dask flaky tests #2471

Merged
merged 10 commits into from Jul 7, 2021
Merged

Fix dask flaky tests #2471

merged 10 commits into from Jul 7, 2021

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Jul 6, 2021

Fix #2341

Previously, running this test 10 times resulted in some >1 test failures:
image

This seemed to be an issue related to the parallelization of the search. By removing the threads_per_worker parameter, we default the threads to 16, which is more than enough to handle AutoMLSearch.

50 runs of this test:
image

100 runs of this test:
image

Interestingly, 8 threads fails, but 12 threads passes.

@bchen1116 bchen1116 self-assigned this Jul 6, 2021
@codecov
Copy link

codecov bot commented Jul 6, 2021

Codecov Report

Merging #2471 (d2216cf) into main (7145f2f) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2471     +/-   ##
=======================================
- Coverage   99.7%   99.7%   -0.0%     
=======================================
  Files        283     283             
  Lines      25575   25574      -1     
=======================================
- Hits       25473   25472      -1     
  Misses       102     102             
Impacted Files Coverage Δ
.../tests/automl_tests/dask_tests/test_automl_dask.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7145f2f...d2216cf. Read the comment docs.

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks interesting! any idea why 8 fails but 12 doesnt?

dask_cluster = LocalCluster(
n_workers=1, threads_per_worker=2, dashboard_address=None
)
dask_cluster = LocalCluster(n_workers=1, dashboard_address=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol but why

dask_cluster = LocalCluster(
n_workers=1, threads_per_worker=2, dashboard_address=None
)
dask_cluster = LocalCluster(n_workers=1, dashboard_address=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find! I guess the thinking is that the test relies on TestPipelineFast finishing before TestPipelineWithFitError but since we set threads_per_worker to 2 and there are three pipelines, we can't guarantee that all three pipelines start evaluating at the same time.

To be extra safe, can you verify what the value of CPU_COUNT (referenced in the dask source code) in our gh workers is?

One way to do that may be to add a test that fails like this:

from dask.system import CPU_COUNT
assert CPU_COUNT == None

The benefit of doing this is verifying that threads_per_worker gets set to 16 as opposed to something else. Like you said, even going with 8 can cause failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think I'm good with merging this in and seeing how it goes. I just want to know what CPU_COUNT is on gh so we know what to expect with regards to future flakes. If this fails, we can try to rewrite the test to not rely a specific pipeline evaluation order?

Copy link
Contributor Author

@bchen1116 bchen1116 Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 16!

That was on local. I can see what it is on GH.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Word, let's wait to merge until we can verify. If it's <= 8 then maybe we have to hit the drawing board again? I'd be good with merging even if it's <=8 just to see what happens lol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 2 lol
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😨 😂 🙈

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, how about we change the test? The test is supposed to verify that all computations stop after we hit an exception with the raise_error_callback.

I think this line that we already have in the test (assert len(automl.full_rankings) < len(pipelines)) is sufficient. Whether or not TestPipelineFast actually finishes is not really relevant. So we can delete

        assert TestPipelineFast.custom_name in set(
            automl.full_rankings["pipeline_name"]
        )
        assert TestPipelineSlow.custom_name not in set(
            automl.full_rankings["pipeline_name"]
        )
        assert TestPipelineWithFitError.custom_name not in set(
            automl.full_rankings["pipeline_name"]
        )

What do you think? FYI @chukarsten

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds good to me. I agree with not requiring the fast pipeline be in the results

@bchen1116
Copy link
Contributor Author

@jeremyliweishih @chukarsten Unfortunately, I wasn't sure why 8 fails. I initially thought that having 4 threads would be enough (since there were 3 pipelnes), but it seems like dask's distributed might not work like that or there's additional stuff happening. I got some performance results here:
image

But I'm not sure how to further evaluate what is happening. Would appreciate any advice or tips if yall have any!

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good work digging into this @bchen1116 !

@bchen1116 bchen1116 merged commit cda2c2e into main Jul 7, 2021
@chukarsten chukarsten mentioned this pull request Jul 22, 2021
@freddyaboulton freddyaboulton deleted the bc_2341_flaky branch May 13, 2022 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky Dask test
4 participants