Spawn Dask scheduler on separate Kubernetes pod #84

chrish42 · 2018-07-09T20:02:15Z

This feature request is a followup to #82. Having at least the option to spawn the Dask scheduler on a remote Kubernetes pod would enable more use cases for Dask-kubernetes, including at least the "run Jupyter on my notebook, but do the Dask computation on a remote cluster" one.

mrocklin · 2018-07-09T20:22:55Z

I agree that this would be useful. It also seems doable to me, but would require moderate effort. I'm a bit curious how often people have access to a Kubernetes cluster when they're not in the cluster themselves.

…

On Mon, Jul 9, 2018 at 4:02 PM, Christian Hudon ***@***.***> wrote: This feature request is a followup to #82 <#82>. Having at least the option to spawn the Dask scheduler on a remote Kubernetes pod would enable more use cases for Dask-kubernetes, including at least the "run Jupyter on my notebook, but do the Dask computation on a remote cluster" one. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#84>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszJ5ei1hMTwHvSOepbQqFn2Xs2ROIks5uE7bHgaJpZM4VITcF> .

jacobtomlinson · 2018-07-10T07:28:57Z

This is definitely something we would use in anger.

Running Jupyter on a laptop and then connecting to a cloud based cluster for computation would be really useful.

mrocklin · 2018-07-10T11:19:31Z

@jacobtomlinson is this something that is important enough to you all to help develop? I imagine that some parts of this, like making adaptive work across a network, I'll probably have to do, but I'm curious if there are parts for which I can lean on others. My guess is that the first pass of just putting the scheduler in another pod is probably something that you are as if not more capable of than I am :)

…

On Tue, Jul 10, 2018 at 3:28 AM, Jacob Tomlinson ***@***.***> wrote: This is definitely something we would use in anger. Running Jupyter on a laptop and then connecting to a cloud based cluster for computation would be really useful. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#84 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszCnnQNZDziUxrF9J7VD-3YTPfmDdks5uFFe5gaJpZM4VITcF> .

jacobtomlinson · 2018-07-10T15:35:19Z

@mrocklin I will have to check my schedule and see if I can dedicate time to this.

chrish42 · 2018-07-10T18:53:15Z

@mrocklin From my perspective, everyone is 5 minutes (and a credit card) away from having access to a Kubernetes cluster when they outgrow their laptop for computations. Especially if #85 is also implemented. Then they don't need to install anything else but dask-kubernetes; they just instantiate a dask_kubernetes.Cluster object, passing it the cluster credentials given when they created the cluster in the GCP (or other) console. It's getting to a JupyterHub-like setup on Kubernetes, where people can launch notebooks, etc. that requires more work (and/or money), and has potentially a smaller audience.

jgerardsimcock · 2018-08-03T20:07:05Z

We would really like to explore the idea of having there be a central scheduler that users via jupyter notebook/terminal could log into and run large batch jobs. I'd like to understand how to do something like the financial modeling team in the dask use cases

What is the best place to start on getting this moving?

putting the scheduler in another pod (I have no idea how to do this)
?

mturok · 2018-09-21T17:55:44Z

+1

arpit1997 · 2019-01-21T20:06:43Z

Is it still open? Can I try this?

@jgerardsimcock
if we spawn dask scheduler in another pod and let scheduler manage jobs wouldn't that be enough? Am I missing something here?

@chrish42 Certainly it would have small audience but it would really help the data scientists kind of people to just connect with a cluster and do processing on huge datasets.

rmccorm4 · 2019-02-07T02:40:09Z

@mrocklin @jacobtomlinson Am I wrong in thinking that you could create and start the KubeCluster client/scheduler on a master in the Kubernetes cluster, serialize/send the cluster/client object to your laptop, and issue client.submit(func, x) commands from your laptop? (Which will just work with because of the existing dask.distributed functionalities since this client object will have the IP and port of the remote master you started it on?)

Or is this issue saying a whole new RPC setup needs to be developed, where you start some listening socket on a master in the cluster that will take RPCs to initialize dask-kubernetes cluster/client/scheduler, take RPCs to submit tasks, take RPCs to scale, etc.?

Or neither? I'm just trying to better understand the problem statement here.

mrocklin · 2019-02-07T02:53:39Z

You could do something like that, yes, but it's a bit reversed of what I suspect will be a common workflow.

The situation we care about is that someone has a laptop and they have a kubernetes cluster. From their laptop they want to ask Kubernetes to create a Dask scheduler and a bunch of Dask workers that talk to the scheduler, and then they create a client locally on their laptop and connect to that scheduler.

Currently with the way KubeCluster is designed the scheduler is only created locally, so the user has to be on the Kubernetes cluster. So instead of something like this.

class KubeCluster:
    def start(...)
        self.scheduler = Scheduler(...)  # actually this happens inside of the `self.cluster = LocalCluster(...)` today

We want something like this

class KubeCluster:
    def start(...):
        self.scheduler_pod = kubernetes.create_pod(scheduler_pod_specification)
        self.scheduler_connection = connect_to(scheduler_pod)

Or at least we want the option to do that. There will be other cases where we'll want the old behavior as well.

martindurant · 2019-03-06T17:10:46Z

Doesn't dask-yarn work with a scheduler-in-a-container?

mrocklin · 2019-03-06T17:17:17Z

Yes

…

On Wed, Mar 6, 2019 at 11:10 AM Martin Durant ***@***.***> wrote: Doesn't dask-yarn work with a scheduler-in-a-container? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#84 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszFj3vyE0ufHSDx_9qdhSYTyEPkgcks5vT_aXgaJpZM4VITcF> .

dhirschfeld · 2019-03-07T21:31:24Z

I think this refactor would be fairly tightly coupled to whatever results from dask/distributed#2235?

beberg · 2019-03-08T19:03:31Z

Now that the Helm chart updates are in progress in #128 we have a better handle on the use cases and variables needed to put the scheduler in cluster.

Add an optional scheduler spec to KubeCluster.
Add additional parameters~1:1 with the Helm chart variables.
Usage may be tricky to keep 100% backward compatible, but can still default to local scheduler.
The kwargs approach doesnt scale great. The YAML defaults + YAML overrides may be an easier and clearer UX, matching kubectl, Helm, konfigure, etc in the K8s ecosystem.
Depending on the type of cluster and the NodePort/LoadBalancer/ClusterIP, things get messy so this will need some care, similar to how the chart spits out all the needed addresses.

mrocklin · 2019-03-08T19:33:44Z

We currently ask the user for a worker pod specification template. We should probably expand this to also include a scheduler pod specification template. I agree that expanding everything out as keyword arguments would probably be difficult to maintain.

So I think that all we really need here from an API perspective is to add a scheduler_pod_template= keyword, and then we should be good to go (at least until we actually have to implement things).

SeguinBe · 2019-04-16T15:00:24Z

Maybe that will be of some help to others, but we had the situation where we wanted to have the client on our own machines and just dispatch dask workers/scheduler to kubernetes. So I made a tiny library to do it, it is not very well tested but it works on our University settings.

https://github.com/SeguinBe/dask_k8

jacobtomlinson · 2019-10-14T10:31:15Z

This has now been implemented in #162 and can be enabled with cluster = KubeCluster(deploy_mode="remote", ...).

This created the scheduler in a pod on the cluster and also creates a service to enable traffic to be routed to the pod. This is a ClusterIP service by default but can be configured to use a LoadBalancer.

This is currently not the default as it is non-trivial to route traffic to the dashboard using this setup.

Further testing and development would be great!

wwoods · 2019-11-13T00:25:16Z

I may not be following this thread correctly, but running dask_kubernetes.KubeCluster.from_yaml(..., deploy_mode="remote") is still resulting in a hanging client when I run it on my laptop (though the scheduler does appear to run in its own pod). What additional routing/configuration is needed to have a local laptop installation talk to a dask_kubernetes cluster spawned via from_yaml? dask_kubernetes version 0.10.0

jacobtomlinson · 2019-11-13T09:30:22Z

@wwoods the scheduler will be placed behind a service, by default this will be of type LoadBalancer. The client will then attempt to connect to the address of the service.

So if you leave the defaults then your kubernetes cluster must be able to assign a Load Balancer and your client must be able to connect to it. If you directly have access to the internal kubernetes network then I recommend setting the service type to ClusterIP which means the client will connect directly to the kubernetes service, but you must be able to route to it.

Without knowing more about you kubernetes cluster setup I can't provide more advice.

wwoods · 2019-11-13T18:09:26Z

@jacobtomlinson Just trying to get this working in a "kind" context (local cluster), from scratch. It looks like the service is configured as ClusterIP... Here's the error:

Timed out trying to connect to 'tcp://dask-waltw-bfc87757-2.default:8786' after 10 s: [Errno -2] Name or service not known

And here's the output of kubectl get service:

dask-waltw-bfc87757-2 ClusterIP 10.111.76.46 <none> 8786/TCP,8787/TCP 69s

It looks like the service.namespace which dask-kubernetes is trying to connect to is correct, so perhaps kind hasn't configured the local DNS correctly? Not knowing more about kubernetes, it's hard to say if this is because I'm misusing deploy="remote". From the original use case though, it seems like running a python script locally and doing the dask computation remotely is exactly the use case.

jacobtomlinson · 2019-11-14T09:37:36Z

It looks like your local machine can't resolve kubernetes DNS in your local cluster. If you were using a managed cluster with a load balanacer you would be provided with an address you could resolve.

You will need to configure things correctly to enable this. What local implementation are you using? (docker desktop, k3s, minikube, etc)

wwoods · 2019-11-14T15:22:25Z

@jacobtomlinson right - I haven't used kubernetes much, but this particular bit of documentation is proving rather hard to find. I know it's not exactly a dask-kubernetes problem, but I do appreciate you pointing me in the right direction. It's somewhat necessary information to take dask-kubernetes for a test drive.

I was using kind (https://github.com/kubernetes-sigs/kind), but I'm not attached. Also tried minikube since it explicitly mentioned DNS, but no luck.

jacobtomlinson · 2019-11-14T15:33:42Z

I'm afraid I don't have experience with kind. I personally use docker desktop on my Mac for local development. We use minikube for CI and testing.

wwoods · 2019-11-14T16:26:13Z

Any idea how you'd get it working with minikube?

…

On Thu, Nov 14, 2019, 7:33 AM Jacob Tomlinson ***@***.***> wrote: I'm afraid I don't have experience with kind. I personally use docker desktop on my Mac for local development. We use minikube for CI and testing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#84?email_source=notifications&email_token=AABWH2B33NIAMW72QDQDX5LQTVVVPA5CNFSM4FJBG4C2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEECHOEA#issuecomment-553940752>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABWH2BWN2EPNWYECVGNHXLQTVVVPANCNFSM4FJBG4CQ> .

jacobtomlinson · 2019-11-14T19:38:10Z

If you have a look at the ci config you can see how it is done there.

detroyejr · 2020-01-16T16:05:52Z

And here's the output of kubectl get service:

dask-waltw-bfc87757-2 ClusterIP 10.111.76.46 <none> 8786/TCP,8787/TCP 69s

It looks like the service.namespace which dask-kubernetes is trying to connect to is correct, so perhaps kind hasn't configured the local DNS correctly? Not knowing more about kubernetes, it's hard to say if this is because I'm misusing deploy="remote". From the original use case though, it seems like running a python script locally and doing the dask computation remotely is exactly the use case.

My Kubernetes cluster is provisioned with KOPS which doesn't create a load balancer in AWS by default. I was looking for a good solution yesterday and discovered kubefwd which will add Kubernetes services to your hosts file so that urls like dask-waltw-bfc87757-2:8787 are accessible locally. My limited testing shows that it works, but maybe isn't as stable as a load balancer would be. Hope this helps someone.

epizut · 2020-08-20T00:46:58Z

Hi,

I am in the same situation here. I am trying to run a remote scheduler and workers on a managed k8 from my laptop over the internet.

The auth part is fine with the recent kubeconfig dev (thanks a lot)
kubernetes.scheduler-service-type=LoadBalancer allows me to access the Dashboard where I see my worker(s) (dask.config thanks to Nuno)
distributed.comm.timeouts.connect=180 helps too (dask.config)
scheduler_service_wait_timeout=180 idem (KubeCluster constructor)
But then I got both a remote scheduler exception (on k8 logs):

[...]
distributed.scheduler - INFO -   Scheduler at:   tcp://100.64.1.223:8786
distributed.scheduler - INFO -   dashboard at:                     :8787
tornado.application - ERROR - Exception in callback functools.partial(<function TCPServer._handle_connection. at 0x7f74b62473a0>,  exception=KeyError('pickle-protocol')>)
[...]
  File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 222, in on_connection
    comm.handshake_options = comm.handshake_configuration(
  File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 138, in handshake_configuration
    "pickle-protocol": min(local["pickle-protocol"], remote["pickle-protocol"])
KeyError: 'pickle-protocol'
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP  local=tcp://100.64.1.223:8786 remote=tcp://x.x.x.x:53488>
distributed.scheduler - INFO - Register worker <Worker 'tcp://100.64.0.56:40725', name: 0, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://100.64.0.56:40725

and a KubeCluster exception while computing the ipython output on my laptop kernel:

[...]/cluster.py in __repr__(self)
    390             self._cluster_class_name,
    391             self.scheduler_address,
--> 392             len(self.scheduler_info["workers"]),
    393             sum(w["nthreads"] for w in self.scheduler_info["workers"].values()),
    394         )
KeyError: 'workers'

I believe managed k8 users are more data science people than k8 admins.
So maybe the "LoadBalancer" solution can be a bit more documented?

Also dask-gateway seems overkill for a single user.

What do you think?

jacobtomlinson · 2020-08-20T09:37:11Z

That sounds to me like a version mismatch in Dask between the scheduler, workers or client. I'm not sure it is really related to this issue. Could you check all your Dask versions are up to date and if you're still having trouble raise a new issue.

I also agree that dask-gateway is overkill for a single user. That's not really its target audience. We have dask-kubernetes and the Dask helm-chart for single users, and dask-gateway (and soon the new dask-hub helm chart) for teams.

epizut · 2020-08-20T18:01:31Z

You are right, It's working for me now.

For the record I had to face:

Of course, deploy_mode=remote and scheduler-service-type=LoadBalancer
Increase distributed.comm.timeouts.connect (my managed k8 takes about 15s to start the sched pod)
Strips dots in username while waiting for the next pypi package removed . from valid characters for pod names #225. The doc points to kubernetes.worker-name but I only had success with kubernetes.name or KubeCluster(name)

Thank again Jacob

jacobtomlinson · 2022-01-12T12:33:24Z

I'm going to close this out as this is now the default behaviour.

chrish42 mentioned this issue Jul 9, 2018

Doc unclear on actual usage scenarios that will work #82

Closed

jacobtomlinson added the enhancement label Aug 5, 2018

mturok mentioned this issue Sep 21, 2018

Running a single dask-scheduler-style process but with dask-kubernetes? #94

Closed

dsludwig mentioned this issue Nov 16, 2018

Feat/cluster auth #111

Merged

joshmeek mentioned this issue Mar 4, 2019

DaskOnKubernetes Environment PrefectHQ/prefect#695

Merged

3 tasks

mrocklin mentioned this issue Mar 6, 2019

distributed.scheduler - ERROR - Workers don't have promised key #127

Closed

traubms mentioned this issue Mar 22, 2019

Remove LocalCluster from KubeCluster #130

Closed

dhirschfeld mentioned this issue Apr 30, 2019

Replace internal fork of nbserverproxy with jupyter-server-proxy dask/dask-labextension#58

Merged

ericmjl mentioned this issue Jun 28, 2019

Can we submit functions without requiring containers to have exact same environment? #159

Closed

alando46 mentioned this issue Jul 1, 2019

Best practices for remotely connecting to dask worker cluster #160

Closed

jacobtomlinson mentioned this issue Jul 24, 2019

how to obtain KubeCluster dask/distributed#2878

Closed

This was referenced Oct 14, 2019

Getting error urllib3.exceptions.SSLError: [Errno 2] No such file or directory running example script #113

Closed

Detach from running cluster #186

Closed

hhuuggoo mentioned this issue Nov 14, 2019

Run KubeCluster in a separate pod #204

Closed

Timost mentioned this issue Dec 2, 2019

Configure scheduler pod separately from worker pods #208

Closed

hhuuggoo mentioned this issue Jul 21, 2021

Support for nprocs? #349

Closed

jacobtomlinson added the kubecluster (classic) label Jan 11, 2022

jacobtomlinson closed this as completed Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spawn Dask scheduler on separate Kubernetes pod #84

Spawn Dask scheduler on separate Kubernetes pod #84

chrish42 commented Jul 9, 2018

mrocklin commented Jul 9, 2018 via email

jacobtomlinson commented Jul 10, 2018

mrocklin commented Jul 10, 2018 via email

jacobtomlinson commented Jul 10, 2018

chrish42 commented Jul 10, 2018

jgerardsimcock commented Aug 3, 2018 •

edited

Loading

mturok commented Sep 21, 2018

arpit1997 commented Jan 21, 2019

rmccorm4 commented Feb 7, 2019

mrocklin commented Feb 7, 2019

martindurant commented Mar 6, 2019

mrocklin commented Mar 6, 2019 via email

dhirschfeld commented Mar 7, 2019

beberg commented Mar 8, 2019

mrocklin commented Mar 8, 2019

SeguinBe commented Apr 16, 2019

jacobtomlinson commented Oct 14, 2019

wwoods commented Nov 13, 2019

jacobtomlinson commented Nov 13, 2019

wwoods commented Nov 13, 2019 •

edited

Loading

jacobtomlinson commented Nov 14, 2019

wwoods commented Nov 14, 2019

jacobtomlinson commented Nov 14, 2019

wwoods commented Nov 14, 2019 via email

jacobtomlinson commented Nov 14, 2019

detroyejr commented Jan 16, 2020

epizut commented Aug 20, 2020 •

edited

Loading

jacobtomlinson commented Aug 20, 2020

epizut commented Aug 20, 2020

jacobtomlinson commented Jan 12, 2022

Spawn Dask scheduler on separate Kubernetes pod #84

Spawn Dask scheduler on separate Kubernetes pod #84

Comments

chrish42 commented Jul 9, 2018

mrocklin commented Jul 9, 2018 via email

jacobtomlinson commented Jul 10, 2018

mrocklin commented Jul 10, 2018 via email

jacobtomlinson commented Jul 10, 2018

chrish42 commented Jul 10, 2018

jgerardsimcock commented Aug 3, 2018 • edited Loading

mturok commented Sep 21, 2018

arpit1997 commented Jan 21, 2019

rmccorm4 commented Feb 7, 2019

mrocklin commented Feb 7, 2019

martindurant commented Mar 6, 2019

mrocklin commented Mar 6, 2019 via email

dhirschfeld commented Mar 7, 2019

beberg commented Mar 8, 2019

mrocklin commented Mar 8, 2019

SeguinBe commented Apr 16, 2019

jacobtomlinson commented Oct 14, 2019

wwoods commented Nov 13, 2019

jacobtomlinson commented Nov 13, 2019

wwoods commented Nov 13, 2019 • edited Loading

jacobtomlinson commented Nov 14, 2019

wwoods commented Nov 14, 2019

jacobtomlinson commented Nov 14, 2019

wwoods commented Nov 14, 2019 via email

jacobtomlinson commented Nov 14, 2019

detroyejr commented Jan 16, 2020

epizut commented Aug 20, 2020 • edited Loading

jacobtomlinson commented Aug 20, 2020

epizut commented Aug 20, 2020

jacobtomlinson commented Jan 12, 2022

jgerardsimcock commented Aug 3, 2018 •

edited

Loading

wwoods commented Nov 13, 2019 •

edited

Loading

epizut commented Aug 20, 2020 •

edited

Loading