Skip to content

DOC: Add usage example to SSHCluster docstring#3864

Open
stsievert wants to merge 14 commits intodask:mainfrom
stsievert:sshcluster-docstring-examples
Open

DOC: Add usage example to SSHCluster docstring#3864
stsievert wants to merge 14 commits intodask:mainfrom
stsievert:sshcluster-docstring-examples

Conversation

@stsievert
Copy link
Copy Markdown
Member

What does this PR implement?
It adds a usage example to the SSHCluster docstring.

I am unsure if this is "best practice" – please review with that in mind.

Reference issues/PRs

@jacobtomlinson
Copy link
Copy Markdown
Member

Thanks for raising this. Best practice with SSH is always to use keys, not passwords.

I think adding this example is valuable, but it shouldn't be a best practice example.

It is also possible to configure port forwards with SSHCluster in a similar way, so if your remote system only has port 22 exposed in its firewall you can still connect to the Dask cluster. I wonder if adding an example of that would be valuable too.

@stsievert
Copy link
Copy Markdown
Member Author

I think adding this example is valuable, but it shouldn't be a best practice example.

I hope you mean "it should be an example that uses the best practices." Most of that example is working for me, but I'm having difficulty creating an SSHCluster object even though I can SSH to my machine with my keyfile. Can you help me out @jacobtomlinson? Here are the details:

Details

I've followed the SSH key generation guide (which is also in the docstring:

$ ssh-keygen -t rsa -b 4096 -f ~/.ssh/dask-ssh -P ""
$ ssh-copy-id -i ~/.ssh/dask-ssh user@machine

That means I can login with this command:

ssh -i ~/.ssh/dask-ssh user@machine

However I'm having difficulty creating the SSHCluster object; the documentation for SSHClientConnectionOptions is relevant but a little obtuse.

I think I need to specify the correct connect_options in SSHCluster, so I've left that blank in the docstring (so this PR is still a WIP).

I wonder if adding an example of that would be valuable too.

That'd be a good example I think. I think there should also be an example with different conda environments.

Comment thread distributed/deploy/ssh.py Outdated
Co-authored-by: Jacob Tomlinson <jacobtomlinson@users.noreply.github.com>
@stsievert
Copy link
Copy Markdown
Member Author

stsievert commented Jun 9, 2020

Thanks! Now I can log in to the cluster. But, I'm having some issues starting the dask-worker. I'm getting some errors about a timeout:

OSError: Timed out trying to connect to 'tcp://144.92.142.180:34284' after 10 s: Timed out trying to connect to 'tcp://144.92.142.180:34284' after 10 s: connect() didn't finish in time

Here's the full traceback:

Details
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - -----------------------------------------------
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - Local Directory:    /tmp/scheduler-q6jocy6z
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - -----------------------------------------------
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - Clear task state
distributed.deploy.ssh - INFO - /mnt/ws/home/ssievert/anaconda3/lib/python3.7/site-packages/distributed/dashboard/core.py:72: UserWarning:
distributed.deploy.ssh - INFO - Port 8797 is already in use.
distributed.deploy.ssh - INFO - Perhaps you already have a cluster running?
distributed.deploy.ssh - INFO - Hosting the diagnostics dashboard on a random port instead.
distributed.deploy.ssh - INFO - warnings.warn("\n" + msg)
distributed.deploy.ssh - INFO - distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.deploy.ssh - INFO - distributed.scheduler - INFO -   Scheduler at: tcp://144.92.142.180:34284

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
~/Developer/stsievert/distributed/distributed/comm/core.py in connect(addr, timeout, deserialize, **connection_args)
    233             if not comm:
--> 234                 _raise(error)
    235         except FatalCommClosedError:

~/Developer/stsievert/distributed/distributed/comm/core.py in _raise(error)
    214         )
--> 215         raise IOError(msg)
    216 

OSError: Timed out trying to connect to 'tcp://144.92.142.180:34284' after 10 s: connect() didn't finish in time

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-2-d45d5074e674> in <module>
      4     scheduler_options={"port": 0, "dashboard_address": ":8797"},
      5     connect_options={"username": "ssievert", "client_keys": "/Users/scott/.ssh/dask-ssh"},
----> 6     remote_python="~/anaconda3/bin/python"
      7 )
      8 client = Client(cluster)

~/Developer/stsievert/distributed/distributed/deploy/ssh.py in SSHCluster(hosts, connect_options, worker_options, scheduler_options, worker_module, remote_python, **kwargs)
    363         for i, host in enumerate(hosts[1:])
    364     }
--> 365     return SpecCluster(workers, scheduler, name="SSHCluster", **kwargs)

~/Developer/stsievert/distributed/distributed/deploy/spec.py in __init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name)
    254         if not self.asynchronous:
    255             self._loop_runner.start()
--> 256             self.sync(self._start)
    257             self.sync(self._correct_state)
    258 

~/Developer/stsievert/distributed/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    161             return future
    162         else:
--> 163             return sync(self.loop, func, *args, **kwargs)
    164 
    165     async def _get_logs(self, scheduler=True, workers=True):

~/Developer/stsievert/distributed/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    337     if error[0]:
    338         typ, exc, tb = error[0]
--> 339         raise exc.with_traceback(tb)
    340     else:
    341         return result[0]

~/Developer/stsievert/distributed/distributed/utils.py in f()
    321             if callback_timeout is not None:
    322                 future = asyncio.wait_for(future, callback_timeout)
--> 323             result[0] = yield future
    324         except Exception as exc:
    325             error[0] = sys.exc_info()

~/anaconda3/envs/dask-ml-test/lib/python3.7/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

~/Developer/stsievert/distributed/distributed/deploy/spec.py in _start(self)
    287             connection_args=self.security.get_connection_args("client"),
    288         )
--> 289         await super()._start()
    290 
    291     def _correct_state(self):

~/Developer/stsievert/distributed/distributed/deploy/cluster.py in _start(self)
     58 
     59     async def _start(self):
---> 60         comm = await self.scheduler_comm.live_comm()
     61         await comm.write({"op": "subscribe_worker_status"})
     62         self.scheduler_info = await comm.read()

~/Developer/stsievert/distributed/distributed/core.py in live_comm(self)
    694                 self.timeout,
    695                 deserialize=self.deserialize,
--> 696                 **self.connection_args,
    697             )
    698             comm.name = "rpc"

~/Developer/stsievert/distributed/distributed/comm/core.py in connect(addr, timeout, deserialize, **connection_args)
    243                 backoff = min(backoff, 1)  # wait at most one second
    244             else:
--> 245                 _raise(error)
    246         else:
    247             break

~/Developer/stsievert/distributed/distributed/comm/core.py in _raise(error)
    213             error,
    214         )
--> 215         raise IOError(msg)
    216 
    217     backoff = 0.01

OSError: Timed out trying to connect to 'tcp://144.92.142.180:34284' after 10 s: Timed out trying to connect to 'tcp://144.92.142.180:34284' after 10 s: connect() didn't finish in time

@jacobtomlinson
Copy link
Copy Markdown
Member

It looks like your client cannot speak directly to the scheduler. It is looking for it at tcp://144.92.142.180:34284.

Could there be some firewall rules blocking your Python session from speaking to that host and IP?

@stsievert
Copy link
Copy Markdown
Member Author

Could there be some firewall rules blocking your Python session from speaking to that host and IP?

Yes. The machines I use are owned by UW–Madison, and they have a pretty stringent firewall. I think this would be a valuable example then:

It is also possible to configure port forwards with SSHCluster in a similar way, so if your remote system only has port 22 exposed in its firewall you can still connect to the Dask cluster.

@jacobtomlinson
Copy link
Copy Markdown
Member

jacobtomlinson commented Jun 10, 2020

You wont be able to run the scheduler on 22 as SSH is already using it.

What you should do is port forward the dask scheduler port to your local machine. However the SSHCluster will still try to connect to the scheduler address, so we need to fix that. I'll open an issue to track this (#3881).

The other assumption that Dask makes is that all workers and the scheduler can speak on any high port number at any time. Do you think the firewall rules at UW–Madison will hinder this too?

@stsievert
Copy link
Copy Markdown
Member Author

I think this PR is done, at least until #3881 is resolved. I've added what I'm using to get around the firewall restrictions.

all workers and the scheduler can speak on any high port number at any time. Do you think the firewall rules at UW–Madison will hinder this too?

I think it did interfere, but it's not an issue anymore.

Base automatically changed from master to main March 8, 2021 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants