-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid opening outbound connections during shutdown #77539
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
I think we hit exactly this problem in https://gradle-enterprise.elastic.co/s/3lq4o3iswxumo/console-log?task=:server:test which seems like an odd coincidence since this isn't a new bug and yet I've not seen it result in test failures ever before. |
Today while the `ClusterConnectionManager` is closing it will reject attempts to open _managed_ connections (i.e. using `connectToNode`), but it still permits ad-hoc connections (i.e. using `openConnection`). This commit extends the existing refcounting mechanism to cover both cases, preventing all concurrent connection attempts while shutting down. Closes elastic#86249 Relates elastic#77539
Note that #86315 means we no longer need to protect against connection attempts while closing. Should we just remove this (buggy) protection mechanism instead of trying to fix it? |
Today while the `ClusterConnectionManager` is closing it will reject attempts to open _managed_ connections (i.e. using `connectToNode`), but it still permits ad-hoc connections (i.e. using `openConnection`). This commit extends the existing refcounting mechanism to cover both cases, preventing all concurrent connection attempts while shutting down. Closes #86249 Relates #77539
To answer the question raised in this comment I started to look at whether we could block on the
closeLock
and I think the answer is no but also I now have more questions about concurrency during closing. Can we definitely not add toacceptedChannels
after the server channel is closed? Also this comment:elasticsearch/server/src/main/java/org/elasticsearch/transport/TcpTransport.java
Line 282 in e6fd459
Does it really do that? It did in the past when we blocked the thread while the connection completes, but today I think it's possible for us to open a connection and then leak it by shutting the event loop down. It'll be ok for connections that are happening via
ClusterConnectionManager#connectToNode
since we closed all the connection managers down first, but we bypass that mechanism for discovery and sniffing.Not sure there's really a bug here, I haven't tried to reproduce it or anything. If there is a bug then it's benign in production anyway since we're shutting down so everything will be cleaned up soon, but it's still untidy and might cause test issues.
The text was updated successfully, but these errors were encountered: