Kubernetes Operator is blindly killing workers #807

BitTheByte · 2023-08-27T16:31:17Z

Currently, dask's operator 2023.8.1 is using Kubernetes replicas to scale up / down the workers as seen in

dask-kubernetes/dask_kubernetes/operator/kubecluster/kubecluster.py

Lines 745 to 748 in 7c09b57

    
           wg = await DaskWorkerGroup( 
        
               f"{self.name}-{worker_group}", namespace=self.namespace 
        
           ) 
        
           await wg.scale(n)

This results in cases such as #659, Since Kubernetes doesn't know the state or data stored in workers it would kill those workers in an attempt to scale up/down as requested by the operator resulting in instability issues or partial data loss if it interrupted data moving operation.

Scaling up wouldn't cause much trouble as it's just adding new workers, however, problems occur during the scaling down

The text was updated successfully, but these errors were encountered:

BitTheByte · 2023-08-27T18:44:29Z

I feel like I should add more context so here it is, after some digging around I found that I was wrong and Dask is actually trying to shut down workers as it should using retire_workers however seems like this function had multiple layers of fallback. my cluster was suffering from unexpectedly killed workers and to narrow down the fallback cases of the function I had to enable distributed.http.scheduler.api to match the default case and the scaling process worked effortlessly. my feeling is that I had problems with the operator - scheduler communication thus always falling back to the last retirement strategy which is killing the last n of workers blindly

jacobtomlinson · 2023-09-04T09:05:53Z

Thanks for sharing your findings here. I'm glad enabling the HTTP API resolved this for you. Our goal is to enable the API by default in future versions of distributed but there are ongoing discussions on how we should authenticate it.

The last fallback strategy is a worst case scenario but I wonder if we can do more to highlight to the user that this is happening.

BitTheByte · 2023-09-04T14:07:09Z

Thanks @jacobtomlinson for taking the time to look into closed issues. A warning suggesting a problem with dask RPC would have helped a lot in this case.

For the authentication I implemented my own authentication gateway so that's not a concern

jacobtomlinson · 2023-09-04T14:12:07Z

A warning suggesting a problem with dask RPC would have helped a lot in this case.

Where would it be useful to surface this warning? The problem happens within the controller, so would a warning log line be enough? Or do you mean passing a warning back to the KubeCluster end to surface in the user code?

For the authentication I implemented my own authentication gateway so that's not a concern

Yeah I would expect this to be the case for many users. However some folks expose their dashboard to the internet so we treat it as a read-only resource. Enabling the API turns it into a read/write resource and we should probably implement some kind of default authenticaiton.

BitTheByte · 2023-09-07T22:58:29Z

Where would it be useful to surface this warning? The problem happens within the controller, so would a warning log line be enough? Or do you mean passing a warning back to the KubeCluster end to surface in the user code?

I believe a warning within the controller would be enough

tasansal · 2023-09-19T16:01:16Z

Sorry for re-writing on a closed issue, but what is the current way to remedy this? Makes auto-scaling very problematic. We end up losing a lot of active workers which slow down the whole system. Worker count goes under our "min" workers quite often.

P.S. is activating the HTTP API the only way to go?

I couldn't find much information about how to turn it on in distributed documentation. Is it just a matter of setting distributed.http.scheduler.api to True in the configuration?

jacobtomlinson · 2023-09-19T17:30:40Z

@tasansal it would be interesting if you could check your logs and see why it is falling back to LIFO scaling. It should fall back to the RPC if you don't have the HTTP API enabled.

BitTheByte closed this as not planned Won't fix, can't repro, duplicate, stale Aug 27, 2023

BitTheByte closed this as completed Aug 27, 2023

jacobtomlinson mentioned this issue Sep 8, 2023

Promote LIFO scaling log to warning level #812

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes Operator is blindly killing workers #807

Kubernetes Operator is blindly killing workers #807

BitTheByte commented Aug 27, 2023

BitTheByte commented Aug 27, 2023

jacobtomlinson commented Sep 4, 2023

BitTheByte commented Sep 4, 2023 •

edited

jacobtomlinson commented Sep 4, 2023

BitTheByte commented Sep 7, 2023

tasansal commented Sep 19, 2023 •

edited

jacobtomlinson commented Sep 19, 2023

Kubernetes Operator is blindly killing workers #807

Kubernetes Operator is blindly killing workers #807

Comments

BitTheByte commented Aug 27, 2023

BitTheByte commented Aug 27, 2023

jacobtomlinson commented Sep 4, 2023

BitTheByte commented Sep 4, 2023 • edited

jacobtomlinson commented Sep 4, 2023

BitTheByte commented Sep 7, 2023

tasansal commented Sep 19, 2023 • edited

jacobtomlinson commented Sep 19, 2023

BitTheByte commented Sep 4, 2023 •

edited

tasansal commented Sep 19, 2023 •

edited