Skip to content
This repository has been archived by the owner on Feb 10, 2021. It is now read-only.

Launch Scheduler on remotely or on the local machine? #2

Open
mrocklin opened this issue Oct 13, 2016 · 9 comments
Open

Launch Scheduler on remotely or on the local machine? #2

mrocklin opened this issue Oct 13, 2016 · 9 comments

Comments

@mrocklin
Copy link
Member

Do we want to launch the dask-scheduler process on the cluster or do we want to keep it local?

Keeping the scheduler local simplifies things, but makes it harder for multiple users to share the same scheduler.

Launching the scheduler on the cluster is quite doable, but we'll need to learn on which machine the scheduler launched. I know how to do this through the SGE interface, but it's not clear to me that it is exposed through DRMAA. See this stackoverflow question. Alternatively we could pass the location of the scheduler back through some other means. This could be through a file on NFS (do we want to assume the presence of a shared filesystem?) or by setting up a tiny TCP Server to which the scheduler connects with its information.

The current implementation just launches the scheduler on the user's machine. Barring suggestions to the contrary my intention is to move forward with this approach until there is an obvious issue.

@davidr
Copy link

davidr commented Oct 17, 2016

This is a good question!

This is a personal preference, but in my mind, I think it's entirely reasonable to keep it local, which would usually mean a login node in a traditional HPC environment. If you were to launch the scheduler as a separate task, it would have some advantages as you note, but would be restricted in other ways (maximum walltime, etc.) Additionally, even though you could get a hostname, it's not actually guaranteed the hostname would get you the preferred network interface name (for instance, $hostname vs $hostname-ib0 or $hostname-10g or something like that.)

@richardotis
Copy link

On every HPC cluster I've ever used, users were restricted in long-running background tasks on login nodes. On some clusters, firewalls prevent connections to outside machines.

@basnijholt
Copy link
Contributor

In our research group (~20 people) we use a PBS cluster and all do the following:

We use ipyparallel to start both an ipcontroller at the headnode and ipengines on the nodes. Then we use a Jupyterhub on a different machine and connect the Client over ssh (with some custom scripts).

I am currently trying to figure out how we can use dask in our workflow. This package seems like a something that we could definitely use.

The ideal for us would be something like (running on the remote machine):

from dask_drmaa import DRMAACluster
cluster = DRMAACluster(ssh='sshserver')

I am very willing to help you to test this :)

@mrocklin
Copy link
Member Author

@basnijholt a simple way to use this would be the following:

ssh loginnode
dask-drmaa 20  # launch scheduler and 20 workers

@basnijholt
Copy link
Contributor

I know how to start the scheduler and the workers. The issue is to connect the client to the remote cluster.

I tried to tunnel the scheduler port, but that it's working.

@mrocklin
Copy link
Member Author

Generally I think that trying to help people get around local network policies is probably out of Dask's scope. I'm open to suggestions if you think there are general solutions to this, but I suspect that the right solution is "talk to your network administrator".

@basnijholt
Copy link
Contributor

basnijholt commented Jan 22, 2017

I don't think our network needs anything special. I am able to tunnel ports and connect to an IPython Parallel cluster for example.

Maybe I did something incorrectly, but I only tunneled the scheduler port to the machine on which I run a notebook. Do I need to do something more, like manually transfer files?

@mrocklin
Copy link
Member Author

Workers will need to be able to run dask-worker. Dask-drmaa generally assumes that your worker machines have the same software environment as your login node. This conversation is getting a bit off topic from the original topic of this issue. If you have further questions I recommend opening a new issue.

@jakirkham
Copy link
Member

Figured I'd leave this note in case it helps someone even though this issue seems dormant.

FWIW have been successfully using dask-drmaa by doing the following.

  1. Logging in to the login node.
  2. Launching an interactive job.
  3. Starting the Jupyter notebook.
  4. Opening a two hop ssh tunnel to the notebook.
  5. Launching dask-drmaa as one of the steps in the notebook.

Did the same thing with ipyparallel previously and that also worked quite well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants