Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler and Client thow error on connect #3412

Open
tcholewik opened this issue Jan 27, 2020 · 8 comments
Open

Scheduler and Client thow error on connect #3412

tcholewik opened this issue Jan 27, 2020 · 8 comments

Comments

@tcholewik
Copy link

I am trying to connect to a cluster and I didn't get a chance to change any configurations yet so I think everything is set to default. To turn on the scheduler I use the CLI interface without any arguments dask-scheduler and to create client I use Jupyter Notebook that just runs:

import distributed
client = distributed.Client("scheduler.local:8786")

I'm pretty confident that error I am seeing has nothing to do with URL to scheduler that I provided because my client manages to crash the scheduler 😆 .
In both of my containers I specify the version of dask to use 2.9.2 and they are built at the same time so I am also pretty sure that this issue has little to do with client and scheduler having different versions.

One thing worth to mention is that I have both client and scheduler running in two docker containers, and those container run fine when using docker-compose problem starts when I run these container inside Amazon's ECS.

At this point all I can think of is that tornado need to have set timeout setting?
Blank assertion error is not really helping pinpoint the problem.
Looking at the trace looks like no time out has been provided as default.

Client error:

AssertionError                            Traceback (most recent call last)
<ipython-input-11-825742b4cee0> in <module>
----> 2 client = distributed.Client("scheduler.local:8786")

/usr/local/lib/python3.8/site-packages/distributed/client.py in __init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, **kwargs)
    726             ext(self)
    727 
--> 728         self.start(timeout=timeout)
    729         Client._instances.add(self)
    730 

/usr/local/lib/python3.8/site-packages/distributed/client.py in start(self, **kwargs)
    891             self._started = asyncio.ensure_future(self._start(**kwargs))
    892         else:
--> 893             sync(self.loop, self._start, **kwargs)
    894 
    895     def __await__(self):

/usr/local/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    333     if error[0]:
    334         typ, exc, tb = error[0]
--> 335         raise exc.with_traceback(tb)
    336     else:
    337         return result[0]

/usr/local/lib/python3.8/site-packages/distributed/utils.py in f()
    317             if callback_timeout is not None:
    318                 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
--> 319             result[0] = yield future
    320         except Exception as exc:
    321             error[0] = sys.exc_info()

/usr/local/lib/python3.8/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

/usr/local/lib/python3.8/site-packages/distributed/client.py in _start(self, timeout, **kwargs)
    984 
    985         try:
--> 986             await self._ensure_connected(timeout=timeout)
    987         except OSError:
    988             await self._close()

/usr/local/lib/python3.8/site-packages/distributed/client.py in _ensure_connected(self, timeout)
   1065             msg = await comm.read()
   1066         assert len(msg) == 1
-> 1067         assert msg[0]["op"] == "stream-start"
   1068 
   1069         bcomm = BatchedSend(interval="10ms", loop=self.loop)

AssertionError: 

Scheduler error:

Traceback (most recent call last):
  File "/usr/local/bin/dask-scheduler", line 8, in <module>
    sys.exit(go())
  File "/usr/local/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 248, in go
    main()
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 237, in main
    loop.run_sync(run)
  File "/usr/local/lib/python3.8/site-packages/tornado/ioloop.py", line 531, in run_sync
    raise TimeoutError("Operation timed out after %s seconds" % timeout)
tornado.util.TimeoutError: Operation timed out after None seconds
@tcholewik
Copy link
Author

I missed it before but before throwing an error scheduler logs:
distributed.scheduler - INFO - End scheduler at 'tcp://address:8786'

@tcholewik
Copy link
Author

I have been thinking about how can I modify my setup to narrow down what is wrong and solve this/provide more information here.
When I get a chance to I will setup the scheduler still in Amazon cloud but in same docker container.
Its probably not a good practice but maybe it eliminate possibility that something is wrong with my networking setup. Speaking of which I don't recall seeing much documentation on networking just keep scheduler and worker ports exposed. Is there anything else that I missed?

@jakirkham
Copy link
Member

Would you be able to supply some more information about how you are starting your cluster up? Ideally it would good to get an MRE.

@BitTheByte
Copy link

I'm having the same problem is there anything do to?

@jakirkham
Copy link
Member

File a new issue that contains a simple reproducer along with details about the environment it was run.

@jkanche
Copy link

jkanche commented Dec 16, 2021

also running into the same issue although using the latest docker images and using dask-cloudprovider to launch on Fargate

@BitTheByte
Copy link

BitTheByte commented Dec 16, 2021

Hi @jakirkham

For my case at least I really don't know what is the cause or how I can reproduce this issue as it randomly happens. I'm just using dask-distributed to spin up a dask cluster on kubernetes. No fancy things around it's really hard to debug or reproduce intentionally

@jakirkham
Copy link
Member

Maintainers (like myself) generally have lots of asks from many directions, which means we have a limited amount of time to spend per issue. As a result we (maintainers) really depend on users (like yourselves) to articulate clearly what problem you are running into with a reproducer. If you can't reproduce it, we won't be able to reproduce it. If we can't reproduce it, we won't be able to help you debug it (let alone come up with a fix or a test to confirm it doesn't get broken again). I get it, this is probably not what you want to hear and I have been on the other side of these problems (struggling to find my own reproducers and spending significant amounts of time constructing them). Unfortunately this is just the reality of things and this division of work tends to result in better outcome (fixed issues for end users).

Separately the issue raised by this OP is nearly 2yrs old and has been largely dormant until a couple people (yourself included) have seen something that looks like it. This tells me there is a very good chance that what you are seeing is entirely unrelated. The fact Kubernetes and Cloudprovider are coming into the mix suggest it could even be a downstream issue or at least a downstream change that unmasked an upstream issue.

In any event to avoid confusion and aid maintainers in helping you, would suggest filing new issues with as much information as you can provided. Hopefully this all makes sense and greatly appreciate your help here 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants