Fix TransportMasterNodeAction not Retrying NodeClosedException #51325

original-brownbear · 2020-01-22T17:29:16Z

Added node closed exception to the retryable remote exceptions.

Follow up to the discussion in #51270 (review).
I'd still like to keep the workaround from #51270 though just in case we missed a spot. Otherwise any exception on the failed master that arrives with a delay and is not retried will dead-lock the snapshot.

I'll open a follow-up to deal with the possibility of this happening due to the circuit breaker killing the status update on master.

Added node closed exception to the retryable remote exceptions.

elasticmachine · 2020-01-22T17:29:18Z

Pinging @elastic/es-distributed (:Distributed/Network)

ywelsch · 2020-01-23T10:42:08Z

server/src/main/java/org/elasticsearch/action/support/master/TransportMasterNodeAction.java

@@ -178,7 +178,7 @@ protected void doStart(ClusterState clusterState) {
                                @Override
                                public void handleException(final TransportException exp) {
                                    Throwable cause = exp.unwrapCause();
-                                    if (cause instanceof ConnectTransportException) {
+                                    if (cause instanceof ConnectTransportException || cause instanceof NodeClosedException) {


Can you write an integration test that shows that this solves an issue?

Perhaps a test that restarts the current master while concurrently sending some TransportMasterNodeReadAction (e.g. cluster health).

I'm surprised that we have not seen this issue in any of our integration tests, where we sometimes restart nodes (but perhaps don't do this to concurrently issuing master-level requests).

There we go: 3378d3d

Bit of a dirty test but reliably reproduces the issue without any other hacks :)

I've researched this a little and the problem seems to be isolated to 7.x+. When shutting down a master there is a short window during which:

We will get the cluster state without master node from org.elasticsearch.cluster.coordination.Coordinator#clusterStateWithNoMasterBlock

That then causes the master action on the current master node to fail with NotMasterException and add the cluster state observer for the retry waiting for the new master. That new master never comes and instead the wait is shut-down with the NodeClosedException when the cluster service is shut down

Client receives NodeClosedException back from the master

=> this fix still seems fine, if we retry on NodeClosedException because we can interpret it as the master node going away we're good IMO.

the problem seems to be isolated to 7.x+

Does that also include 7.0, 7.1, ...?
I would like to understand why this is not an issue in earlier versions.

For snapshots this only started showing up because I inadvertently disabled the snapshot shard's services own retry in #50788

I can also reproduce this failure back to 7.1, though interestingly enough it seems to be a lot less likely in 7.1 than in 7.6 (maybe that's due to some locking we removed from the internal test cluster operations, but I can't tell right now).

I pushed 646823d to slow things down a little more and make the test repro better. Without this change it takes a pretty large number of iterations for it to fail on 7.1

ywelsch · 2020-01-23T10:44:04Z

I've marked this as a bug, as it prevents proper failover behavior in case of node restarts.

…shot-shard-status-updates

henningandersen

Thanks Armin, left a single comment, otherwise looking good.

henningandersen · 2020-01-24T08:40:24Z

server/src/main/java/org/elasticsearch/action/support/master/TransportMasterNodeAction.java

@@ -178,7 +178,7 @@ protected void doStart(ClusterState clusterState) {
                                @Override
                                public void handleException(final TransportException exp) {
                                    Throwable cause = exp.unwrapCause();
-                                    if (cause instanceof ConnectTransportException) {
+                                    if (cause instanceof ConnectTransportException || cause instanceof NodeClosedException) {


I think we should separate out cases where the NodeClosedException originate on the local node (for isntance the case where transport stopped, which is also relayed as a NodeClosedException). There should be no need to retry those.

Adding a check that the NodeClosedException is wrapped in a RemoteTransportException would ensure we only retry in true client cases.

Thanks Henning, I pushed 1d07a26 :)

AFAICS, TransportException is not an ElasticsearchWrapperException (in contrast to RemoteTransportException), and therefore unwrapCause will not yield the NodeClosedException in case of a local transport exception.

This means that the code as before would have been fine I think

I think we can run into SendRequestTransportException with NodeClosedException which is an ElasticsearchWrapperException?

^^ seems to have happened here https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/14196/consoleText

1> [2020-01-21T11:51:16,833][WARN ][o.e.s.SnapshotShardsService] [node_td4] [test-repo:test-snap/SQinm62PQkmhonsriaGPUw] [ShardSnapshotStatus[state=FAILED, nodeId=ULGA-lGmRfiN3aM53N3qqg, reason=[test-idx/6AougmeaSJCvxG4mFvfRZA][[test-idx][11]] org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted 1> at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1501) 1> at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:341) 1> at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:287) 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:629) 1> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 1> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 1> at java.base/java.lang.Thread.run(Thread.java:834) 1> , generation=null]] failed to update snapshot state 1> org.elasticsearch.transport.SendRequestTransportException: [node_td4][127.0.0.1:32974][internal:cluster/snapshot/update_snapshot_status] 1> at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:644) ~[main/:?] 1> at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:545) ~[main/:?] 1> at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:520) ~[main/:?] 1> at org.elasticsearch.snapshots.SnapshotShardsService.lambda$sendSnapshotShardUpdate$2(SnapshotShardsService.java:492) ~[main/:?] 1> at org.elasticsearch.transport.TransportRequestDeduplicator.executeOnce(TransportRequestDeduplicator.java:52) ~[main/:?] 1> at org.elasticsearch.snapshots.SnapshotShardsService.sendSnapshotShardUpdate(SnapshotShardsService.java:478) ~[main/:?] 1> at org.elasticsearch.snapshots.SnapshotShardsService.notifyFailedSnapshotShard(SnapshotShardsService.java:472) ~[main/:?] 1> at org.elasticsearch.snapshots.SnapshotShardsService$1.onFailure(SnapshotShardsService.java:305) ~[main/:?] 1> at org.elasticsearch.action.ActionListener$5.onFailure(ActionListener.java:256) ~[main/:?] 1> at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1624) ~[main/:?] 1> at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:341) ~[main/:?] 1> at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:287) ~[main/:?] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:629) ~[main/:?] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] 1> at java.lang.Thread.run(Thread.java:834) [?:?] 1> Caused by: org.elasticsearch.node.NodeClosedException: node closed {node_td4}{ULGA-lGmRfiN3aM53N3qqg}{MhSHeUC3SG2Qv9JTAFNk9Q}{127.0.0.1}{127.0.0.1:32974}{di} 1> at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:626) ~[main/:?] 1> ... 15 more

alright, the tests and you win

…shot-shard-status-updates

ywelsch

LGTM

original-brownbear · 2020-01-24T17:18:39Z

Thanks Yannick + Henning!

…ic#51325) Added node closed exception to the retryable remote exceptions as it's possible to run into this exception instead of a connect exception when the master node is just shutting down but still responding to requests.

… (#51437) Added node closed exception to the retryable remote exceptions as it's possible to run into this exception instead of a connect exception when the master node is just shutting down but still responding to requests.

… (#51438) Added node closed exception to the retryable remote exceptions as it's possible to run into this exception instead of a connect exception when the master node is just shutting down but still responding to requests.

Fix TransportMasterNodeAction not Retrying NodeClosedException

6bee961

Added node closed exception to the retryable remote exceptions.

original-brownbear added >non-issue :Distributed/Network Http and internode communication implementations v8.0.0 v7.7.0 labels Jan 22, 2020

original-brownbear requested a review from ywelsch January 23, 2020 09:26

ywelsch reviewed Jan 23, 2020

View reviewed changes

ywelsch added >bug v7.6.0 and removed v7.7.0 >non-issue labels Jan 23, 2020

original-brownbear added 2 commits January 23, 2020 11:48

Merge remote-tracking branch 'elastic/master' into retry-sending-snap…

323f0d5

…shot-shard-status-updates

test via healthcheck

3378d3d

original-brownbear requested a review from ywelsch January 23, 2020 16:08

henningandersen reviewed Jan 24, 2020

View reviewed changes

original-brownbear added 3 commits January 24, 2020 09:52

Merge remote-tracking branch 'elastic/master' into retry-sending-snap…

d3b05a1

…shot-shard-status-updates

only retry remote node closed

1d07a26

better repro

646823d

ywelsch approved these changes Jan 24, 2020

View reviewed changes

original-brownbear merged commit db48029 into elastic:master Jan 24, 2020

original-brownbear deleted the retry-sending-snapshot-shard-status-updates branch January 24, 2020 17:20

original-brownbear mentioned this pull request Jan 24, 2020

Fix TransportMasterNodeAction not Retrying NodeClosedException (#51325) #51437

Merged

original-brownbear mentioned this pull request Jan 24, 2020

Fix TransportMasterNodeAction not Retrying NodeClosedException (#51325) #51438

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TransportMasterNodeAction not Retrying NodeClosedException #51325

Fix TransportMasterNodeAction not Retrying NodeClosedException #51325

original-brownbear commented Jan 22, 2020

elasticmachine commented Jan 22, 2020

ywelsch Jan 23, 2020

original-brownbear Jan 23, 2020

ywelsch Jan 24, 2020

original-brownbear Jan 24, 2020

ywelsch commented Jan 23, 2020

henningandersen left a comment

henningandersen Jan 24, 2020

original-brownbear Jan 24, 2020

ywelsch Jan 24, 2020 •

edited

Loading

original-brownbear Jan 24, 2020

original-brownbear Jan 24, 2020

ywelsch Jan 24, 2020

ywelsch left a comment

original-brownbear commented Jan 24, 2020

Fix TransportMasterNodeAction not Retrying NodeClosedException #51325

Fix TransportMasterNodeAction not Retrying NodeClosedException #51325

Conversation

original-brownbear commented Jan 22, 2020

elasticmachine commented Jan 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch commented Jan 23, 2020

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch Jan 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

original-brownbear commented Jan 24, 2020

ywelsch Jan 24, 2020 •

edited

Loading