-
-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up usage around plain rpc #6082
Conversation
If the rpc is not closed with |
@dask/gpu I am seeing some errors on gpuCI that I cannot understand. Can somebody have a look and tell me if this might be related to the changes I'm proposing?
|
@fjetter yes, I think they are. Locally I don't see those errors on current |
What exactly are "endpoints" in this context? and what objects are we talking about? |
Apologies, I should have been more specific. An endpoint is an object that provides a connection to another object of the same type (endpoint) on a remote process, such as client and server, but UCX establishes an |
distributed/deploy/spec.py
Outdated
await asyncio.gather(*self._futures) | ||
|
||
if self.scheduler_comm: | ||
await self.scheduler_comm.close_rpc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fjetter I narrowed the UCX errors down to this line. If I simply replace it with the original await self.scheduler_comm.close(close_workers=True)
those errors go away. Could you explain what should be the difference of both in this particular context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From how I read the code, close
like
self.scheduler_comm.close(close_workers=True)
can actually never work since there is no RPC handler called close
but we'd never know because we suppress all OSErrors.
I replaced it with the proper RPC handler, terminate
If we're not calling self.scheduler_comm.close_rpc
we need to rely on the finalizer which is something I'd like to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest change passes locally, if you think that solution is reasonable/appropriate then I'm good with that too, gpuCI tests already passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typo suggestion, but otherwise looks good. Thanks @fjetter for helping me also figure out the issue with UCX and ultimately for fixing it.
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Thanks for your help and review @pentschev ! |
Failing tests are due to #6088 |
While looking into #6080 I realized that plain
rpc
instances are incredibly rare. They have been mostly replaced byConnectionPool
and thePooledRPCCall
.The three places I found them being used can be seen below and they all did something wrong
gather_from_workers
instantiates a couple but is never using themSpecCluster. _close
is using it to call a handlerclose
that doesn't exist.Scheduler.restart
is using a couple ofrpc
to connect to the nannies. This may leave dangling comms or rpcs if exceptions pop up, e.g. during teardown. the async exit stack should be preferredI think SpecCluster and Scheduler.restart, the pool would work just fine. I didn't want to change too much in one go and think this should already help