Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Leaking Listener When Closing NodeClient #55676

Merged
merged 4 commits into from
Apr 28, 2020

Conversation

original-brownbear
Copy link
Member

If a node client (or rather its underlying node) is closed then any executions on it will just quietly fail as happens in #55660 via closing the nodes on the test thread and asynchronously using
a node client.
I also removed the onRejection implementation on the error handling runnable since it's dead code (generic pool never rejects).
IMO, it's reasonable to handle the callback on the current thread when shutting down if only to fix tests and remove a dirty spot that loses listeners.
The only way you can really trigger a SO here is by rerunning requests in a hot loop which you shouldn't do in the first place. ES internally doesn't do that in its usages of the node client as far as I can see and we shouldn't do that in tests either (and don't as far as I can see).

Closes #55660

If a node client (or rather its underlying node) is closed then
any executions on it will just quietly fail as happens in elastic#55660
via closing the nodes on the test thread and asynchroneously using
a node client.

Closes elastic#55660
@original-brownbear original-brownbear added >non-issue :Distributed/Network Http and internode communication implementations v8.0.0 v7.8.0 labels Apr 23, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Network)

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm undecided whether we should do this change. The shut down behavior of ES / transport service makes sense as it is, but the integration tests (with the node client) are directly reaching into the node's innards to execute stuff, with requests in particular not being routed through the transport service (relates also #55494).

@original-brownbear
Copy link
Member Author

@ywelsch annoyingly enough, the problem here is precisely the fact that this request goes through the TransportService for the local node (because it's a broadcast request) and that then never finishes.
In other tests, we're saved from this problem because most of the requests we do are TMA which will throw a NodeClosedException when running into a closed node.
I see the point that this is a test-only issue. On the other hand, all transport master node actions will actually throw the node closed exception on the current thread as well ... so you could argue that this change at least makes the NodeClient behavior consistent in that sense I guess?

We had this issue before (service does cleanup/exception-handling on the generic pool) in #46178 where we fixed it by waiting for all actions to finish when closing the affected service. This isn't really a fix here though because we simply have to do something about local requests to a shut-down transport service IMO.
The core of the issue IMO, is simply that a stopped transport service (if its TP is shut down as well) will simply quietly swallow all local requests' listeners. This seems wrong (even if the only practical problem is that it's very trappy in tests) and IMO my change is the only possible fix, you simply have to resolve the listener on the current thread in this case because you can't rely on the pool?

@ywelsch
Copy link
Contributor

ywelsch commented Apr 24, 2020

I was wondering if we should instead add something to the test infra in InternalTestCluster instead, so that we track all pending requests, and throw a NodeClosedException when the node is closed.

@original-brownbear
Copy link
Member Author

We could do that via a FilterClient I think (just have it wrap all listeners with execute once and track them), that wouldn't be too hard either. I still find it weird though, that a closed transport service would become a black hole for listeners? I'm fine fixing it in the client though since it's a test only issue in practice :)

@ywelsch
Copy link
Contributor

ywelsch commented Apr 24, 2020

Let's see what @tbrooks8 thinks.

I still find it weird though, that a closed transport service would become a black hole for listeners?

It's not that a closed transport service is becoming a black hole for listeners when the transport service is shut down (we handle that correctly AFAICS), but that when the threadpools are shut down, there is no guarantee that all tasks on the thread pool (queued up or not) will have their listeners invoked. That's a guarantee we don't have today, and has nothing to do with the transport.

@original-brownbear
Copy link
Member Author

we handle that correctly AFAICS

Not for local node requests. A stopped transport will simply will always get to org.elasticsearch.transport.TransportService#sendRequestInternal, that will throw NodeClosedException , which is then handled on the generic pool (i.e. never if the pool is shut down as well). For the other non-local connections we have some handling around closing them that is good enough (and executes not connected exception straight on the calling thread) but for a local node request we don't. Another alternative here would be to just add special case handling for local node requests and throw for those right when getting the connection if the transport is shut down to make local and remote requests behave the same. That should cover all cases as well.

To be honest, the fact that we handle some exceptions NodeNotConnected straight on the calling thread (https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/transport/TransportService.java#L530), but pass others exceptions on to the generic pool seems pointless (if SO was an issue, then why isn't it for node not connected?).
=> since we never see any SO issues from node not connected (at least I haven't) I wonder if we shouldn't just get rid of the forking to the generic pool in the error handling here in all cases and keep things simple?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear original-brownbear merged commit d7bc3dd into elastic:master Apr 28, 2020
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 28, 2020
If a node client (or rather its underlying node) is closed then
any executions on it will just quietly fail as happens in elastic#55660
via closing the nodes on the test thread and asynchronously using
a node client.

Closes elastic#55660
original-brownbear added a commit that referenced this pull request Apr 28, 2020
If a node client (or rather its underlying node) is closed then
any executions on it will just quietly fail as happens in #55660
via closing the nodes on the test thread and asynchronously using
a node client.

Closes #55660
@original-brownbear original-brownbear restored the 55660 branch August 6, 2020 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Network Http and internode communication implementations >non-issue v7.8.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] org.elasticsearch.discovery.DiskDisruptionIT.testGlobalCheckpointIsSafe
4 participants