Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes Operator is blindly killing workers #807

Closed
BitTheByte opened this issue Aug 27, 2023 · 7 comments
Closed

Kubernetes Operator is blindly killing workers #807

BitTheByte opened this issue Aug 27, 2023 · 7 comments

Comments

@BitTheByte
Copy link

Currently, dask's operator 2023.8.1 is using Kubernetes replicas to scale up / down the workers as seen in

wg = await DaskWorkerGroup(
f"{self.name}-{worker_group}", namespace=self.namespace
)
await wg.scale(n)

This results in cases such as #659, Since Kubernetes doesn't know the state or data stored in workers it would kill those workers in an attempt to scale up/down as requested by the operator resulting in instability issues or partial data loss if it interrupted data moving operation.

Scaling up wouldn't cause much trouble as it's just adding new workers, however, problems occur during the scaling down

@BitTheByte BitTheByte closed this as not planned Won't fix, can't repro, duplicate, stale Aug 27, 2023
@BitTheByte
Copy link
Author

I feel like I should add more context so here it is, after some digging around I found that I was wrong and Dask is actually trying to shut down workers as it should using retire_workers however seems like this function had multiple layers of fallback. my cluster was suffering from unexpectedly killed workers and to narrow down the fallback cases of the function I had to enable distributed.http.scheduler.api to match the default case and the scaling process worked effortlessly. my feeling is that I had problems with the operator - scheduler communication thus always falling back to the last retirement strategy which is killing the last n of workers blindly

@jacobtomlinson
Copy link
Member

Thanks for sharing your findings here. I'm glad enabling the HTTP API resolved this for you. Our goal is to enable the API by default in future versions of distributed but there are ongoing discussions on how we should authenticate it.

The last fallback strategy is a worst case scenario but I wonder if we can do more to highlight to the user that this is happening.

@BitTheByte
Copy link
Author

BitTheByte commented Sep 4, 2023

Thanks @jacobtomlinson for taking the time to look into closed issues. A warning suggesting a problem with dask RPC would have helped a lot in this case.

For the authentication I implemented my own authentication gateway so that's not a concern

@jacobtomlinson
Copy link
Member

A warning suggesting a problem with dask RPC would have helped a lot in this case.

Where would it be useful to surface this warning? The problem happens within the controller, so would a warning log line be enough? Or do you mean passing a warning back to the KubeCluster end to surface in the user code?

For the authentication I implemented my own authentication gateway so that's not a concern

Yeah I would expect this to be the case for many users. However some folks expose their dashboard to the internet so we treat it as a read-only resource. Enabling the API turns it into a read/write resource and we should probably implement some kind of default authenticaiton.

@BitTheByte
Copy link
Author

Where would it be useful to surface this warning? The problem happens within the controller, so would a warning log line be enough? Or do you mean passing a warning back to the KubeCluster end to surface in the user code?

I believe a warning within the controller would be enough

@tasansal
Copy link
Contributor

tasansal commented Sep 19, 2023

Sorry for re-writing on a closed issue, but what is the current way to remedy this? Makes auto-scaling very problematic. We end up losing a lot of active workers which slow down the whole system. Worker count goes under our "min" workers quite often.

P.S. is activating the HTTP API the only way to go?

I couldn't find much information about how to turn it on in distributed documentation. Is it just a matter of setting distributed.http.scheduler.api to True in the configuration?

@jacobtomlinson
Copy link
Member

@tasansal it would be interesting if you could check your logs and see why it is falling back to LIFO scaling. It should fall back to the RPC if you don't have the HTTP API enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants