Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiate between binding and advertising address for dask workers #1258

Closed
sohaibiftikhar opened this issue Jul 14, 2017 · 6 comments · Fixed by #1278
Closed

Differentiate between binding and advertising address for dask workers #1258

sohaibiftikhar opened this issue Jul 14, 2017 · 6 comments · Fixed by #1278

Comments

@sohaibiftikhar
Copy link
Contributor

Currently the bind and the advertise ip of dask-workers is the same. In case the dask-worker is running inside a docker container in bridge mode these addresses will not be the same. While the worker binds to some internal address the scheduler needs to communicate to it using some external IP.

Currently we only have the option of settings the host address or the interface on which the Dask-worker listens on but nothing about which address is advertised to the scheduler.

@rbubley
Copy link
Contributor

rbubley commented Jul 17, 2017

Unless I misunderstand, I have another similar but different use case:

My scheduler is running in the cloud, but my worker is behind a router/nat. I had started off thinking that port-forwarding would be enough, but seem to have this same issue. If I give the worker the local IP (or let it discover it), the scheduler will try and fail to connect to the 192.168.x.x. address. If I give dask-worker the external IP, with the --host option then I get:

$ dask-worker --nanny --worker-port 7971 --nthreads 1 <redacted scheduler IP>:8786 --no-bokeh -host <redacted external IP>

tornado.application - ERROR - Exception in callback <functools.partial object at 0x10e07e520> Traceback (most recent call last): File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tornado/ioloop.py", line 605, in _run_callback ret = callback() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tornado/stack_context.py", line 277, in null_wrapper return fn(*args, **kwargs) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tornado/ioloop.py", line 626, in _discard_future_result future.result() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tornado/gen.py", line 307, in wrapper yielded = next(result) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/distributed/nanny.py", line 131, in _start self.listen(addr_or_port, listen_args=self.listen_args) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/distributed/core.py", line 206, in listen self.listener.start() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/distributed/comm/tcp.py", line 350, in start backlog=backlog) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tornado/netutil.py", line 197, in bind_sockets sock.bind(sockaddr) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 228, in meth return getattr(self._sock,name)(*args) error: [Errno 49] Can't assign requested address

I guess a work-around would be to do something with SSH tunnelling, but that seems messy.

@sohaibiftikhar
Copy link
Contributor Author

sohaibiftikhar commented Jul 18, 2017

Yes this is very similar. I believe while ssh tunnelling could be potentially used (I have not tried this yet) to bring all containers to the same network it is not just messy but may also be very costly in case of data shuffle across workers/scheduler. Correct me if I am wrong.
If you are familiar with the code base and could point me to a place to start digging I could look into fixing this myself.

@mrocklin
Copy link
Member

@sohaibiftikhar you might look for self.scheduler.register in distributed/worker.py. The use of self.address in that line is the address that the worker advertises to the scheduler and that the scheduler will in turn share with the other workers when they need to connect to this worker.

@rbubley
Copy link
Contributor

rbubley commented Jul 19, 2017

@mrocklin doing a simple test where I just hard-coded the self.address value you indicated seemed to work, but for some reason the scheduler first tries the internal address before the address passed in this parameter. Scheduler output:

distributed.scheduler - INFO - Register tls://192.168.1.80:7981
distributed.scheduler - ERROR - Failed to connect to worker 'tls://192.168.1.80:7981': Timed out trying to connect to 'tls://192.168.1.80:7981' after 3.0 s: connect() didn't finish in time
distributed.scheduler - INFO - Remove worker tls://192.168.1.80:7981
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Removed worker tls://192.168.1.80:7981
distributed.scheduler - INFO - Register tls://[redacted worker ip]:7981
distributed.scheduler - INFO - Starting worker compute stream, tls://[redacted worker ip]:7981

@mrocklin
Copy link
Member

Note that register is called in two locations, first when it registers and then for every heartbeat ( a few seconds later). Heartbeat messages are just idempotent registration messages.

@sohaibiftikhar
Copy link
Contributor Author

sohaibiftikhar commented Jul 21, 2017

@mrocklin So I looked at the code and I have a fix ready that I still need to test but before that I have a doubt regarding the worker address binding. If we start the worker with --host <public_ip> --nanny the nanny server listens on the public ip but the worker still listens on local ip. However if we change to run with the argument --no-nanny the worker listens on the public ip. I found the reason to be in the file distributed/nanny.py where on around line 182 we provide only the worker_port as worker_start_args and not the --host that the user supplied. Any specific reason that I am missing out on why this is being done this way? To me this looks like a bug.

--UPDATE--
I found the reason. You can ignore the above safely. It is not a bug. The reason is the get_ip function does the job of fetching the correct ip. However I still feel that it is not consistent in behaviour when running with vs without nanny.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants