Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_spilling flaky #280

Open
ian-r-rose opened this issue Aug 25, 2022 · 9 comments
Open

test_spilling flaky #280

ian-r-rose opened this issue Aug 25, 2022 · 9 comments

Comments

@ian-r-rose
Copy link
Contributor

tests/stability/test_spill.py::test_spilling, introduced in #229, has been quite flaky, failing in ~25% of it's test runs (example). Since it is tested across a wide test matrix, this means that most workflows fail on test_spill.

For the most part, it seems that they are failing in the client setup fixture during the wait_for_workers() call. It seems that the high memory usage from that test is making cluster restarts pretty unreliable.

cc @hendrikmakait

@hendrikmakait
Copy link
Member

From what I have seen so far, those flakes seem to occur in #235, not main, is that correct?

@ian-r-rose
Copy link
Contributor Author

Hmm, that's a good point. I'll dig a bit deeper

@hendrikmakait
Copy link
Member

hendrikmakait commented Aug 26, 2022

My guess is that the reused cluster fixture is either busy cleaning up after running the first spilling test or getting somehow corrupted.

@ian-r-rose
Copy link
Contributor Author

My guess is that the cluster is either busy cleaning up after running the first spilling test or getting somehow corrupted.

I agree, I wonder if dask/distributed#6944 would help here as well

@ian-r-rose
Copy link
Contributor Author

This test is sensitive to disk space, no? I wonder if package_sync uses more disk and changes the performance characteristics here.

@hendrikmakait
Copy link
Member

It is; the disk size has been set through trial-and-error, I guess it won't hurt to just up it by another couple of GiBs.

@ian-r-rose
Copy link
Contributor Author

I believe that the platform team recommends using at least a t3.large for package_sync. I'd imagine if we did this we'd want to change the size of the allocated data, correct?

@hendrikmakait
Copy link
Member

Yes and no, by increasing the data size we'd stick with the original "goal" of writing 10x the memory size to disk but I'm wondering if we would lose a lot if we just stuck to the same size or something that only accounts for the additional 4 GiB of memory per worker. After all, writing to disk is so slow.

@ian-r-rose
Copy link
Contributor Author

I'm trying a t3.large over here. If that improves things, would you have any objections, or would that harm the integrity of the test? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants