New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unit Test Timeouts (Dask Instability) #2354
Comments
Just adding some data from this 3.8 core deps run series of checks. Adding the logs from that run. I think one thing I've noticed is that they're all pausing around the 91-93% completed mark. I doubt there's any value to figuring out which tests those are, but that might be a route to pursue. |
Here's another for 3.9 non-core deps. |
Thanks for filing @chukarsten Thankfully we can rule out conda as a cause, since this happens for our normal unit test builds and not just for Is there any other info we should collect which could help us figure this out? A few ideas below
|
(@freddyaboulton I added you on here since this connects to #2298 and #1815) |
Changing the Makefile to do verbose logging with pytest, we get the following log |
After adding timeout I've seen the same timeout on |
I think @freddyaboulton is certainly onto something here and we're pointing firmly at Dask. Made this PR to separate out the dask unit tests. I think we have the option of not preventing merge upon their failing. This PR failed on test_automl_immediate_quit, which is still in the array of dask tests. Looking into the root cause of the dask unit test failures is puzzling. The logs generate a lot of this :
Why does this happen? Well it seems that wherever the data that's being acted upon is losing reference to that data. Additionally the 'workers: []' suggests that perhaps the nanny process is killing the workers. I suspect there's something going on with how the data is scattered but am also suspicious of what's happening going under the covers with these four jobs running together in pseudo parallel/series. This dask distributed issue suggests disabling adaptive scaling for the cluster. Unfortunately we don't use adaptive clusters, just regular local, static clusters, so that's not the issue. This issue points at scattering of the data as the potential cause of the issue, where workers are being abandoned, but we're not getting the same connection errors. |
I am now seeing the following stacktrace in build_conda_pkg
This seems to be a known issue in dask dask/distributed#4612 |
Deleted my old post but here's a red one: https://github.com/alteryx/evalml/actions/runs/939673304, seems to be the same stack trace @freddyaboulton posted above. |
I believe this issue no longer blocks per [this PR] to separate out the dask jobs(#2376), this PR to refactor the dask jobs to cut down on flakes, and this PR to make the separate dask jobs not blocking for merge to main and this PR to add a timeout to prevent pathological dask tests from taking 6 hours to ultimately be cancelled by GH Actions. Going to move this to closed because the dask related timeouts are now no longer an issue and shouldn't be for the foreseeable future. However, the underlying cause is still unknown. |
We're currently seeing unit tests go to the GH Actions limit of 6hrs. This is not good for obvious reasons.
3.8 core deps 6 hr timeout (in progress)
build_conda_pkg, 3.8 core deps, 3.7 non-core deps 6 hr timeout (in progress)
3.7 non-core deps 6 hr timeout
3.8 non-core deps 6 hr timeout
3.7 non-core deps 1.5 hrs
build_conda_pkg
3.7 non-core deps
3.8
The text was updated successfully, but these errors were encountered: