Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-left ... reason: disconnected triggered by earlier node-left event rather than network issue #67873

Closed
DaveCTurner opened this issue Jan 24, 2021 · 41 comments · Fixed by #77295
Labels
>bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team

Comments

@DaveCTurner
Copy link
Contributor

If a node leaves the cluster with reason: disconnected then this means something closed one of the TCP channels initiated by the master targeting the departing node. Nodes never deliberately close any incoming connections, and the master doesn't spontaneously close any outgoing connections, so we usually consider reason: disconnected to be evidence of a network issue.

However, I recently observed a node-left event with reason: disconnected on a cluster running entirely on localhost, which therefore rules out a network issue. This was the second of two node-left events for the same node within a few seconds; the first had reason followers check retry count exceeded. I have so far failed to reproduce this situation.

My best guess regarding the cause is that we hadn't closed everything by the time the node rejoins, so the node-join task executes and only then does the channel get closed and the master notified that this node disconnected. (io.netty.channel.AbstractChannelHandlerContext#close() is fire-and-forget, it just adds a task to be processed by the event loop, but that particular event loop was blocked so the closing was delayed).

I don't think the behaviour is particularly bad, the node already left the cluster, but the bug is in how we report this as a network event when really the network is fine.

@DaveCTurner DaveCTurner added >bug :Distributed/Network Http and internode communication implementations labels Jan 24, 2021
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Jan 24, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor Author

I managed to catch another instance of this with better logging. It looks like we have a race in the coordination layer so I'm relabelling this. I suspect the network-layer issue I mentioned may also need addressing before this is fixed. Here's the breakdown:

The node is dropped from the cluster due to health check failures

[2021-01-25T12:34:12,945][INFO ][o.e.c.s.MasterService    ] [node-0] node-left[{node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} reason: followers check retry count exceeded], term: 18, version: 112, delta: removed {{node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}}
[2021-01-25T12:34:12,948][TRACE][o.e.c.c.C.CoordinatorPublication] [node-0] publishing PublishRequest{term=18, version=112, state=cluster uuid: sXdBCtiZQnehqQYIEh93ng [committed: true]

The node starts to rejoin the cluster while the removal publicaation is still in flight. NB we ensure that the transport service is connected to this node as soon as we receive the join request.

[2021-01-25T12:34:15,507][TRACE][o.e.t.T.tracer           ] [node-1] [1518][internal:cluster/coordination/join] sent to [{node-0}{2HBBTGFHRg69qZq_8Yt14A}{QxsZmgTAS8amf6lIvq29Qg}{127.0.0.1}{127.0.0.1:9300}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}] (timeout: [null])
[2021-01-25T12:34:20,359][TRACE][o.e.t.T.tracer           ] [node-0] [1518][internal:cluster/coordination/join] received request
[2021-01-25T12:34:20,362][TRACE][o.e.c.c.Coordinator      ] [node-0] handleJoinRequest: as LEADER, handling JoinRequest{sourceNode={node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}, minimumTerm=18, optionalJoin=Optional[Join{term=18, lastAcceptedTerm=15, lastAcceptedVersion=109, sourceNode={node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}, targetNode={node-0}{2HBBTGFHRg69qZq_8Yt14A}{QxsZmgTAS8amf6lIvq29Qg}{127.0.0.1}{127.0.0.1:9300}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}}]}
[2021-01-25T12:34:20,365][TRACE][o.e.t.T.tracer           ] [node-0] [4023][internal:cluster/coordination/join/validate] sent to [{node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}] (timeout: [null])
[2021-01-25T12:34:20,365][TRACE][o.e.t.T.tracer           ] [node-1] [4023][internal:cluster/coordination/join/validate] received request
[2021-01-25T12:34:20,365][TRACE][o.e.t.T.tracer           ] [node-1] [4023][internal:cluster/coordination/join/validate] sent response
[2021-01-25T12:34:20,365][TRACE][o.e.t.T.tracer           ] [node-0] [4023][internal:cluster/coordination/join/validate] received response from [{node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}]
[2021-01-25T12:34:20,366][DEBUG][o.e.c.c.CoordinationState] [node-0] handleJoin: added join Join{term=18, lastAcceptedTerm=15, lastAcceptedVersion=109, sourceNode={node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}, targetNode={node-0}{2HBBTGFHRg69qZq_8Yt14A}{QxsZmgTAS8amf6lIvq29Qg}{127.0.0.1}{127.0.0.1:9300}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}} from [{node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}] for election, electionWon=true lastAcceptedTerm=18 lastAcceptedVersion=112

The removal publication completes, so the master applies its state and disconnects from the node.

[2021-01-25T12:34:20,378][TRACE][o.e.c.c.C.CoordinatorPublication] [node-0] onPossibleCompletion: [Publication{term=18, version=112}] was successful
[2021-01-25T12:34:20,378][INFO ][o.e.c.s.ClusterApplierService] [node-0] removed {{node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}}, term: 18, version: 112, reason: Publication{term=18, version=112}
[2021-01-25T12:34:20,379][TRACE][o.e.t.ClusterConnectionManager] [node-0] unregistering {node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} after connection close and marking as disconnected
[2021-01-25T12:34:20,379][TRACE][o.e.c.c.LeaderChecker    ] [node-0] disconnect event ignored for {node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}, no check scheduler
[2021-01-25T12:34:20,379][DEBUG][o.e.t.TcpTransport       ] [node-0] closed transport connection [37] to [{node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}] with age [71630ms]
[2021-01-25T12:34:20,379][DEBUG][o.e.c.NodeConnectionsService] [node-0] disconnected from {node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}
[2021-01-25T12:34:20,385][DEBUG][o.e.c.c.C.CoordinatorPublication] [node-0] publication ended successfully: Publication{term=18, version=112}

The master now processes the join, informing the FollowersChecker to start checking this node.

[2021-01-25T12:34:20,386][INFO ][o.e.c.s.MasterService    ] [node-0] node-join[{node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} join existing leader], term: 18, version: 113, delta: added {{node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}}
[2021-01-25T12:34:20,387][TRACE][o.e.c.c.CoordinationState] [node-0] handleClientValue: processing request for version [113] and term [18]
[2021-01-25T12:34:20,387][TRACE][o.e.c.c.LeaderChecker    ] [node-0] setCurrentNodes: nodes: 
   {node-2}{IRxdzYRsSfeeGB3pAmhJkw}{73F4OpUQQvqt3JZTxlUYUQ}{127.0.0.1}{127.0.0.1:9302}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}
   {node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}
   {node-0}{2HBBTGFHRg69qZq_8Yt14A}{QxsZmgTAS8amf6lIvq29Qg}{127.0.0.1}{127.0.0.1:9300}{cdhilmrstw}{ml.machine_memory=67337220096, xpack.installed=true, transform.node=true, ml.max_open_jobs=20, ml.max_jvm_size=1073741824}, local, master

However we're no longer connected to the node, so the first follower check fails immeidately.

[2021-01-25T12:34:20,387][TRACE][o.e.c.c.FollowersChecker ] [node-0] handleWakeUp: checking {node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} with FollowerCheckRequest{term=18, sender={node-0}{2HBBTGFHRg69qZq_8Yt14A}{QxsZmgTAS8amf6lIvq29Qg}{127.0.0.1}{127.0.0.1:9300}{cdhilmrstw}{ml.machine_memory=67337220096, xpack.installed=true, transform.node=true, ml.max_open_jobs=20, ml.max_jvm_size=1073741824}}
[2021-01-25T12:34:20,387][INFO ][o.e.c.c.FollowersChecker ] [node-0] FollowerChecker{discoveryNode={node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=2} disconnected [1]
org.elasticsearch.transport.NodeNotConnectedException: [node-1][127.0.0.1:9301] Node not connected

The node is dropped from the cluster again, this time with reason: disconnected.

[2021-01-25T12:34:20,450][INFO ][o.e.c.s.MasterService    ] [node-0] node-left[{node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} reason: disconnected [1]], term: 18, version: 114, delta: removed {{node-1}{HOyGmEBiSV-CzrOVID3b3w}{Gmqo4hLyS5yxjAePFwKMVA}{127.0.0.1}{127.0.0.1:9301}{cdhilmrstw}{ml.machine_memory=67337220096, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}}

@DaveCTurner DaveCTurner added :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. and removed :Distributed/Network Http and internode communication implementations labels Jan 25, 2021
@plaformsre
Copy link

Hi @DaveCTurner ,
Is this 'bug' planned for any planned fixes? We came across a similar issue with same error messages.

@DaveCTurner
Copy link
Contributor Author

No, sorry, there's definitely a bug here but it's in a very tricky area and the fix is very low priority: you can only hit it if your cluster is already broken and the only effect in real terms is a bit of extra logging. A PR would be welcome, if it's important to you.

@mosespx
Copy link

mosespx commented Jun 18, 2021

Hi @DaveCTurner, we have also come across this issue with the same error messages. can you please share the ES version you saw it? is there any fix or workaround you have found?
Thanks.

@DaveCTurner
Copy link
Contributor Author

I imagine this affects all 7.x versions, at least the non-EOL ones since there haven't been any recent changes in this area. The only real effect a bit of extra logging so you can work around this by ignoring those messages.

@howardhuanghua
Copy link
Contributor

Hi @DaveCTurner, we seems come across simlar issue on 7.5.1. Would you please help to check that it's the same root cause?

One of the data node has been removed by master due to health check:
Master node log (some of other nodes handle cluster state update slowly caused remove task delayed 10s):

[2021-07-14T14:36:20,456][INFO ][o.e.c.s.MasterService    ] [1612281926004251932] node-left[{1626106122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.machine
_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, temperature=warm, ml.max_open_jobs=20, region=4} reason: followers check retry count exceeded], term: 18, version: 84154
3, delta: removed {{1626106122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.machine_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004,
 ip=9.20.97.134, temperature=warm, ml.max_open_jobs=20, region=4}}
[2021-07-14T14:36:22,292][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [1612281926004251932] Get stats for datafeed '_all'
[2021-07-14T14:36:24,650][INFO ][o.e.m.j.JvmGcMonitorService] [1612281926004251932] [gc][9452212] overhead, spent [452ms] collecting in the last [1.4s]
[2021-07-14T14:36:30,544][INFO ][o.e.c.c.C.CoordinatorPublication] [1612281926004251932] after [10s] publication of cluster state version [841543] is still waiting for {1624291717006281532}{LFjWmM2kQ5Cfx-ATZPXm
Yw}{2_hu72kCTa6VSApgz197pA}{10.10.114.80}{10.10.114.80:9300}{dil}{ml.machine_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.137, temperature=warm, ml.max_open_jobs=20, regio
n=4} [SENT_APPLY_COMMIT], {1626106122000010132}{mopIWQfMS-m6QinbamjZUQ}{y_xoDL7MQwqCzhF-saLrAQ}{10.10.119.14}{10.10.119.14:9300}{dil}{ml.machine_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=
200004, ip=9.20.100.215, temperature=warm, ml.max_open_jobs=20, region=4} [SENT_APPLY_COMMIT], {1626106122000010432}{2U4dWaFsQO-leJ3WE_8V9w}{w_UInYMkRHWdfoNJrbopkw}{10.10.120.146}{10.10.120.146:9300}{dil}{ml.ma
chine_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.100.10, temperature=warm, ml.max_open_jobs=20, region=4} [SENT_APPLY_COMMIT]
[2021-07-14T14:36:31,510][INFO ][o.e.c.s.ClusterApplierService] [1612281926004251932] removed {{1626106122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.mach
ine_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, temperature=warm, ml.max_open_jobs=20, region=4}}, term: 18, version: 841543, reason: Publication{term=18, version=84
1543}

The removed data node log:

[2021-07-14T14:36:24,656][INFO ][o.e.c.c.Coordinator      ] [1626106122000009932] master node [{1612281926004251932}{kUvamGdrQbOD5xBHUPw_Xg}{NIjnKjjJSByvhxYGrcYvdw}{10.10.120.93}{10.10.120.93:9300}{lm}{ml.machi
ne_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.76.20, temperature=hot, ml.max_open_jobs=20, region=4}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{1612281926004251932}{kUvamGdrQbOD5xBHUPw_Xg}{NIjnKjjJSByvhxYGrcYvdw}{10.10.120.93}{10.10.120.93:9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack
.installed=true, set=200004, ip=9.20.76.20, temperature=hot, ml.max_open_jobs=20, region=4}] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:277) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1120) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1120) ~[elasticsearch-7.5.1.jar:7.5.1]

After the above node has been removed, the node cannot re-join cluster anymore, it continuously discovery master, even it could discovery the right list, but it could not join:

[2021-07-14T14:36:27,029][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [1626106122000009932] no known master node, scheduling a retry
[2021-07-14T14:36:34,671][WARN ][o.e.c.c.ClusterFormationFailureHelper] [1626106122000009932] master not discovered yet: have discovered [{1626106122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.machine_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, temperature=warm, ml.max_open_jobs=20, region=4}, {1612281926004251832}{ngVUygwVT7GSlDP9qrCQYA}{ZXkoaSvTRwGeHzFeJBGV_A}{10.10.118.143}{10.10.118.143:9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.65.142, temperature=hot, ml.max_open_jobs=20, region=4}, {1612281926004252032}{pfbf-PI1RH-BjCLtT_iLlQ}{kavU1W4JRZ-eFyCTC16Ecg}{10.10.112.84}{10.10.112.84:9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.76.89, temperature=hot, ml.max_open_jobs=20, region=4}, {1612281926004251932}{kUvamGdrQbOD5xBHUPw_Xg}{NIjnKjjJSByvhxYGrcYvdw}{10.10.120.93}{10.10.120.93:9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.76.20, temperature=hot, ml.max_open_jobs=20, region=4}]; discovery will continue using [10.10.112.84:9300, 10.10.118.143:9300, 10.10.120.93:9300] from hosts providers and [{1612281926004251832}{ngVUygwVT7GSlDP9qrCQYA}{ZXkoaSvTRwGeHzFeJBGV_A}{10.10.118.143}{10.10.118.143:9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.65.142, temperature=hot, ml.max_open_jobs=20, region=4}, {1612281926004252032}{pfbf-PI1RH-BjCLtT_iLlQ}{kavU1W4JRZ-eFyCTC16Ecg}{10.10.112.84}{10.10.112.84:9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.76.89, temperature=hot, ml.max_open_jobs=20, region=4}, {1612281926004251932}{kUvamGdrQbOD5xBHUPw_Xg}{NIjnKjjJSByvhxYGrcYvdw}{10.10.120.93}{10.10.120.93:9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.76.20, temperature=hot, ml.max_open_jobs=20, region=4}] from last-known cluster state; node term 18, last-accepted version 841542 in term 18
[2021-07-14T14:36:44,672][WARN ][o.e.c.c.ClusterFormationFailureHelper] [1626106122000009932] master not discovered yet: have discovered [{1626106122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.machine_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, temperature=warm, ml.max_open_jobs=20, region=4}, {1612281926004251832}{ngVUygwVT7GSlDP9qrCQYA}{ZXkoaSvTRwGeHzFeJBGV_A}{10.10.118.143}{10.10.118.143:9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.65.142, temperature=hot, ml.max_open_jobs=20, region=4}, {1612281926004252032}{pfbf-PI1RH-BjCLtT_iLlQ}{kavU1W4JRZ-eFyCTC16Ecg}{10.10.112.84}{10.10.112.84:9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=200
004, ip=9.20.76.89, temperature=hot, ml.max_open_jobs=20, region=4}, {1612281926004251932}{kUvamGdrQbOD5xBHUPw_Xg}{NIjnKjjJSByvhxYGrcYvdw}{10.10.120.93}{10.10.120.93:9300}{lm}{ml.machine_memory=33389010944, rac
k=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.76.20, temperature=hot, ml.max_open_jobs=20, region=4}]; discovery will continue using [10.10.112.84:9300, 10.10.118.143:9300, 10.10.120.93:9300] from h
osts providers and [{1612281926004251832}{ngVUygwVT7GSlDP9qrCQYA}{ZXkoaSvTRwGeHzFeJBGV_A}{10.10.118.143}{10.10.118.143:9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=20000
4, ip=9.20.65.142, temperature=hot, ml.max_open_jobs=20, region=4}, {1612281926004252032}{pfbf-PI1RH-BjCLtT_iLlQ}{kavU1W4JRZ-eFyCTC16Ecg}{10.10.112.84}{10.10.112.84:9300}{lm}{ml.machine_memory=33389010944, rack
=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.76.89, temperature=hot, ml.max_open_jobs=20, region=4}, {1612281926004251932}{kUvamGdrQbOD5xBHUPw_Xg}{NIjnKjjJSByvhxYGrcYvdw}{10.10.120.93}{10.10.120.93:
9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.76.20, temperature=hot, ml.max_open_jobs=20, region=4}] from last-known cluster state; node term 18, last-ac
cepted version 841542 in term 18
[2021-07-14T14:36:54,674][WARN ][o.e.c.c.ClusterFormationFailureHelper] [1626106122000009932] master not discovered yet: have discovered [{1626106122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10
.10.119.90}{10.10.119.90:9300}{dil}{ml.machine_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, temperature=warm, ml.max_open_jobs=20, region=4}, {1612281926004251832}{ng
VUygwVT7GSlDP9qrCQYA}{ZXkoaSvTRwGeHzFeJBGV_A}{10.10.118.143}{10.10.118.143:9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.65.142, temperature=hot, ml.max_o
pen_jobs=20, region=4}, {1612281926004252032}{pfbf-PI1RH-BjCLtT_iLlQ}{kavU1W4JRZ-eFyCTC16Ecg}{10.10.112.84}{10.10.112.84:9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=200
004, ip=9.20.76.89, temperature=hot, ml.max_open_jobs=20, region=4}, {1612281926004251932}{kUvamGdrQbOD5xBHUPw_Xg}{NIjnKjjJSByvhxYGrcYvdw}{10.10.120.93}{10.10.120.93:9300}{lm}{ml.machine_memory=33389010944, rac
k=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.76.20, temperature=hot, ml.max_open_jobs=20, region=4}]; discovery will continue using [10.10.112.84:9300, 10.10.118.143:9300, 10.10.120.93:9300] from h
osts providers and [{1612281926004251832}{ngVUygwVT7GSlDP9qrCQYA}{ZXkoaSvTRwGeHzFeJBGV_A}{10.10.118.143}{10.10.118.143:9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=20000
4, ip=9.20.65.142, temperature=hot, ml.max_open_jobs=20, region=4}, {1612281926004252032}{pfbf-PI1RH-BjCLtT_iLlQ}{kavU1W4JRZ-eFyCTC16Ecg}{10.10.112.84}{10.10.112.84:9300}{lm}{ml.machine_memory=33389010944, rack
=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.76.89, temperature=hot, ml.max_open_jobs=20, region=4}, {1612281926004251932}{kUvamGdrQbOD5xBHUPw_Xg}{NIjnKjjJSByvhxYGrcYvdw}{10.10.120.93}{10.10.120.93:
9300}{lm}{ml.machine_memory=33389010944, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.76.20, temperature=hot, ml.max_open_jobs=20, region=4}] from last-known cluster state; node term 18, last-ac
cepted version 841542 in term 18

And master node continuously handle the node joining and removing tasks in few seconds:

[2021-07-14T14:37:12,674][INFO ][o.e.c.s.MasterService    ] [1612281926004251932] node-join[{1626106122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.machine
_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, temperature=warm, ml.max_open_jobs=20, region=4} join existing leader], term: 18, version: 841551, delta: added {{162610
6122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.machine_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, temperatu
re=warm, ml.max_open_jobs=20, region=4}}
[2021-07-14T14:37:16,330][INFO ][o.e.c.s.ClusterApplierService] [1612281926004251932] added {{1626106122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.machin
e_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, temperature=warm, ml.max_open_jobs=20, region=4}}, term: 18, version: 841551, reason: Publication{term=18, version=8415
51}
[2021-07-14T14:37:18,141][INFO ][o.e.c.s.MasterService    ] [1612281926004251932] node-left[{1626106122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.machine
_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, temperature=warm, ml.max_open_jobs=20, region=4} reason: disconnected], term: 18, version: 841552, delta: removed {{1626
106122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.machine_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, tempera
ture=warm, ml.max_open_jobs=20, region=4}}
[2021-07-14T14:37:20,521][INFO ][o.e.c.s.ClusterApplierService] [1612281926004251932] removed {{1626106122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.mach
ine_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, temperature=warm, ml.max_open_jobs=20, region=4}}, term: 18, version: 841552, reason: Publication{term=18, version=84
1552}
[2021-07-14T14:37:20,577][INFO ][o.e.c.s.MasterService    ] [1612281926004251932] node-join[{1626106122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.machine
_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, temperature=warm, ml.max_open_jobs=20, region=4} join existing leader], term: 18, version: 841553, delta: added {{162610
6122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.machine_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, temperatu
re=warm, ml.max_open_jobs=20, region=4}}
[2021-07-14T14:37:25,892][INFO ][o.e.c.s.ClusterApplierService] [1612281926004251932] added {{1626106122000009932}{lmFqU9EpQZ6pWQFU8QR5Jg}{Uumtj9wjSHO-Myt4oWvJAA}{10.10.119.90}{10.10.119.90:9300}{dil}{ml.machin
e_memory=33389862912, rack=cvm_4_200004, xpack.installed=true, set=200004, ip=9.20.97.134, temperature=warm, ml.max_open_jobs=20, region=4}}, term: 18, version: 841553, reason: Publication{term=18, version=8415
53}

Until we restart the data node manually, it re-joins success.

@DaveCTurner
Copy link
Contributor Author

No, this issue is only transient, it doesn't explain a persistent failure to rejoin the cluster.

@howardhuanghua
Copy link
Contributor

howardhuanghua commented Jul 26, 2021

No, this issue is only transient, it doesn't explain a persistent failure to rejoin the cluster.

@DaveCTurner thanks. Any idea that data node could discovery entire dedicated master nodes and could not join cluster? Also the received cluster state has complete master nodes. The node has been removed due to high disk I/O util issue previously, after node removed by master, I/O util becomes normal. It should not have network issue since manually restart the node could join success.

@DaveCTurner
Copy link
Contributor Author

I don't have an idea how this might happen, but I certainly would like to. It sounds like a bug somewhere, maybe in Elasticsearch. I have seen other cases where a restart fixed a recurring disconnection that was caused by a badly-behaved router on the network between nodes.

I should point out that you're using a version that's well past EOL - I don't think I have a working development environment for this version any more, so you'd need to reproduce it in a supported version first. If you can then please open a separate bug report and we'll talk about how to diagnose the problems further.

@kkewwei
Copy link
Contributor

kkewwei commented Aug 18, 2021

The following process will match some cases:

  1. When a node is removed from Cluster because of lagging at cluster state, master only removes the node from followerCheckers, close the connection, but doesn't cancel the scheduled checker that may already exist.
  2. When the scheduled checker is executed in master, it will throw NodeNotConnectedException because the connection has been closed, master will call failNode() and again remove the node from cluster.
  3. The connection with the removed node may be on the way, that is, master is connecting with the removed node, At the moment, is removing the node from cluster, unexpected things may happen.

If master could cancel the corresponding ongoing scheduled checker when removing from followerCheckers?

@howardhuanghua
Copy link
Contributor

  1. When a node is removed from Cluster because of lagging at cluster state, master only removes the node from followerCheckers, close the connection, but doesn't cancel the scheduled checker that may already exist.
  1. Remove node from followersChecks should not close the connection? It should simple remove the entry from the concurrent hash map?
  2. Existing checker would be stopped if the node has been removed from the concurrent hash map?

@kkewwei
Copy link
Contributor

kkewwei commented Aug 20, 2021

Yes. what i explain above doesn't get to the point.

In product, we use the version(v7.9.1), and it lasts for a long time, not is transient, as @howardhuanghua describes.

As @DaveCTurner describes in detail, the reason why master continuously handle the node joining and removing requests is that the master node is removing the data node from the cluster, meanwhile the removing data node is waiting to rejoin the cluster, we should avoid the situation.

The removed data node also print such logs:

[2021-08-17T23:01:31,567][INFO ][o.e.c.c.JoinHelper       ] [data1] failed to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{RaXCd_qkSdeGtBcsG86XTA}{10.16.146.21}{10.16.146.21:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional[Join{term=14, lastAcceptedTerm=13, lastAcceptedVersion=428180, sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{RaXCd_qkSdeGtBcsG86XTA}{10.16.146.21}{10.16.146.21:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, targetNode={master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [master1][1.2.3.4:9304][internal:cluster/coordination/join] request_id [867331376] timed out after [59899ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1074) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]

The data node fails to join the cluster because of timeout, but the master will continue think the join request of a valid request, and will send a valid request to the data node. the data node receives the check request and check as follows:

if (localState.metadata().clusterUUIDCommitted() &&

Check items are simple, whether the data node is in the cluster or not, it will quickly response to master.

That is, the data node has given up joining the cluster, but master accepts the join request, and will publish node-join task, which is unreasonable. we should add the check: only if the data node send internal:cluster/coordination/join request and receive corresponding response correctly, the data node is eligible to response internal:cluster/coordination/join/validate check.

@DaveCTurner
Copy link
Contributor Author

This appears to be unrelated to the problem described in this issue. You seem to be saying that joins continuously encounter a timeout, but this issue is about nodes disconnecting immediately during the joining process. In addition the join timeout is ignored since 7.10.0 (#60872) so I don't think it's possible to see the effect you describe any more.

@kkewwei
Copy link
Contributor

kkewwei commented Aug 21, 2021

The more detail logs from master is as follow:

[2021-08-17T23:36:30,524][INFO ][o.e.c.s.MasterService    ] [master1] node-left[{data1}{1.2.3.4:9300} reason: disconnected], term: 14, version: 550366, delta: removed {{data1}{1.2.3.4:9300}}
[2021-08-17T23:36:38,286][INFO ][o.e.c.s.ClusterApplierService] [master1] removed {{data1}{1.2.3.4:9300}}, term: 14, version: 550366, reason: Publication{term=14, version=550366}
[2021-08-17T23:36:38,378][TRACE][o.e.t.ClusterConnectionManager] [master1] unregistering {data1}{1.2.3.4:9300} after connection close and marking as disconnected
[2021-08-17T23:36:38,416][DEBUG][o.e.c.c.C.CoordinatorPublication] [master1] publication ended successfully: Publication{term=14, version=550366}
[2021-08-17T23:36:38,485][INFO ][o.e.c.s.MasterService    ] [master1] node-join[{data1}{1.2.3.4:9300} join existing leader], term: 14, version: 550367, delta: added {{data1}{1.2.3.4:9300}}
[2021-08-17T23:36:38,893][DEBUG][o.e.c.c.FollowersChecker ] [master1] FollowerChecker{discoveryNode={data1}{1.2.3.4:9300}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=10} disconnected
org.elasticsearch.transport.NodeNotConnectedException: [rz-data-hdp-dn-rtyarn0073][10.16.158.18:9300] Node not connected
        at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:189) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:673) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:600) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker.handleWakeUp(FollowersChecker.java:317) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker.start(FollowersChecker.java:295) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker.lambda$setCurrentNodes$3(FollowersChecker.java:150) ~[elasticsearch-7.9.1.jar:7.9.1]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_112]
        at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[?:1.8.0_112]
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[?:1.8.0_112]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_112]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_112]
        at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:312) ~[?:1.8.0_112]
        at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:743) ~[?:1.8.0_112]
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) ~[?:1.8.0_112]
        at org.elasticsearch.cluster.coordination.FollowersChecker.setCurrentNodes(FollowersChecker.java:143) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1113) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:268) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
[2021-08-17T23:36:38,894][DEBUG][o.e.c.c.FollowersChecker ] [master1] FollowerChecker{discoveryNode={data1}{1.2.3.4:9300}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=10} marking node as faulty
[2021-08-17T23:36:38,894][DEBUG][o.e.c.c.C.CoordinatorPublication] [master1] onFaultyNode: [{data1}{1.2.3.4:9300}] is faulty, failing target in publication Publication{term=14, version=550367}
[2021-08-17T23:36:39,427][INFO ][o.e.c.s.ClusterApplierService] [master1] added {{data1}{1.2.3.4:9300}}, term: 14, version: 550367, reason: Publication{term=14, version=550367}
[2021-08-17T23:36:39,429][DEBUG][o.e.t.ClusterConnectionManager] [master1] connected to node [{data1}{1.2.3.4:9300}]
[2021-08-17T23:36:39,550][DEBUG][o.e.c.c.C.CoordinatorPublication] [master1] publication ended successfully: Publication{term=14, version=550367}
[2021-08-17T23:36:39,991][INFO ][o.e.c.s.MasterService    ] [master1] node-left[{data1}{1.2.3.4:9300} reason: disconnected], term: 14, version: 550368, delta: removed {{data1}{1.2.3.4:9300}}
[2021-08-17T23:36:40,507][INFO ][o.e.c.s.ClusterApplierService] [master1] removed {{data1}{1.2.3.4:9300}}, term: 14, version: 550368, reason: Publication{term=14, version=550368}
[2021-08-17T23:36:40,639][TRACE][o.e.t.ClusterConnectionManager] [master1] unregistering {data1}{1.2.3.4:9300} after connection close and marking as disconnected
[2021-08-17T23:36:40,677][DEBUG][o.e.c.c.C.CoordinatorPublication] [master1] publication ended successfully: Publication{term=14, version=550368}
[2021-08-17T23:36:41,584][DEBUG][o.e.c.c.C.CoordinatorPublication] [master1] publication ended successfully: Publication{term=14, version=550369}
[2021-08-17T23:36:41,642][INFO ][o.e.c.s.MasterService    ] [master1] node-join[{data1}{1.2.3.4:9300} join existing leader], term: 14, version: 550370, delta: added {{data1}{1.2.3.4:9300}}
[2021-08-17T23:36:41,986][DEBUG][o.e.c.c.FollowersChecker ] [master1] FollowerChecker{discoveryNode={data1}{1.2.3.4:9300}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=10} disconnected
org.elasticsearch.transport.NodeNotConnectedException: [rz-data-hdp-dn-rtyarn0073][10.16.158.18:9300] Node not connected
        at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:189) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:673) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:600) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker.handleWakeUp(FollowersChecker.java:317) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker.start(FollowersChecker.java:295) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker.lambda$setCurrentNodes$3(FollowersChecker.java:150) ~[elasticsearch-7.9.1.jar:7.9.1]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_112]
        at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[?:1.8.0_112]
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[?:1.8.0_112]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_112]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_112]
        at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:312) ~[?:1.8.0_112]
        at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:743) ~[?:1.8.0_112]

We can clearly see that:
1.the master node continuously receive the join request("join existing leader").
2.the connection with the data node is build in ClusterApplierService.applyChanges().

Logically, the connection with the data node should be built in handleJoinRequest(), which before master start send internal:cluster/coordination/join/validate request.

transportService.connectToNode(joinRequest.getSourceNode(), ActionListener.wrap(ignore -> {

First why master throws NodeNotConnectedException when process node-join task?

Logically there should exist the connection, In fact:

  1. The master is calling handleJoinRequest(), build the connection, at the same time, the node-left task is ongoing, including close the connection. The built connection is been removed by node-left task.
  2. After the node-left task, the master begin process node-join task and start follower check , but can't find the connection.
    So throws NodeNotConnectedException.

Second why master continuously handle the node joining and removing requests?
We can see the master frequently receive the join request from the data node, but all of the join requests are timeout.

The key point is that the master frequently receives join requests from data node, and doesn't check whether the join requests are still valid, the ignored join timeout seems doesn't work here.

What bothers me is why the join requests are timeout.

@DaveCTurner
Copy link
Contributor Author

What bothers me is why the join requests are timeout.

You're using 7.9, the timeout was removed in 7.10.

@kkewwei
Copy link
Contributor

kkewwei commented Aug 21, 2021

You're using 7.9, the timeout was removed in 7.10.

I mean in 7.9, why so many join requests are timeout. I will continue to observe the reason.

Deleting the timeout in v7.10 only ensure that the join request is valid, even if in v7.10, there will be still many join requests from the same data node, this situation may still exist: the node-left task removes the connection built by handleJoinRequest in master, and throws NodeNotConnectedException, then the master continuously handle the node joining and removing requests.

Should we add reference counter to the connection to avoid mistakenly closing connection?

@DaveCTurner
Copy link
Contributor Author

in v7.10, there will be still many join requests from the same data node, this situation may still exist

I don't think this is the case, with the timeout removed there is only ever one join request (to each master) in flight at once since we wait for each one to complete before sending another. I think this means we don't get into this strange state of repeatedly trying to join the master while the master is concurrently disconnecting from us.

I mean there's something else wrong in your system if the master can't process join requests within 60s for an extended period of time, but at least with this change it doesn't loop like you describe any more it simply waits for the master to catch up.

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Aug 22, 2021

IIRC the reason we removed the timeout was because we saw this loop happening and realised that the timeout (plus an egregiously-overloaded master) was causing it. You should be able to get pretty much the same effect in 7.9 by setting the join timeout to something very long (e.g. 1d). If you're still seeing this loop with a long timeout then (a) that would indicate that removing the timeout wasn't enough and (b) that would give you time to investigate why your master is processing joins so slowly.

Should we add reference counter to the connection to avoid mistakenly closing connection?

Yeah that was the fix I was thinking of here too, but note that it would merely hide the slow-joins problem in your cluster.

@kkewwei
Copy link
Contributor

kkewwei commented Aug 23, 2021

with the timeout removed there is only ever one join request (to each master) in flight at once since we wait for each one to complete before sending another.

There is indeed only one ongoing request at the same time, I got it wrong at first.

The case happened again this afternoon, it lasted for 10 minutes. I changed log level to DEBUG in advance, filter the log about the removed data node sending join requests:

[2021-08-23T17:16:03,616][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:08,976][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:09,610][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:13,771][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:13,911][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:17,808][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:17,912][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:21,890][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:21,914][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:25,814][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:25,915][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:30,048][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:30,917][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:34,503][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {rmaster1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.4}{1.2.3.4:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.10}{1.2.3.10:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}

We can see that data node indeed frequently send join requests to master in a continuous period of time(The join requests received by the master can also verify this), we have reason to believe that in every join request, the following process is successfull:
1.the master receives join request, and sent a join/validate check request to data node.
2.the data node quickly response to the master about the check request.
3.the master starts the node-join task.

There indeed exist so many valid join requests in a short time. Timeout(60s) is not the main reason, The frequency of occurrence is low.

Removing timeout is not enough, it seems that we should add reference counter to the connection in master to avoid this.

If you want more detail logs to verify, I am very happy to provide.

@DaveCTurner
Copy link
Contributor Author

Hmm this does sound suspicious. Could you share the complete logs for the master node and the data node covering the whole outage?

@kkewwei
Copy link
Contributor

kkewwei commented Aug 24, 2021

As at same times, there are more than one data node removing from the cluster, and the corresponding log level is DEBUG, too many logs are printed in master node in a short time, I choose one node data node from the master log, filter it by master1.log | grep data1.

[2021-08-23T17:16:00,931][DEBUG][o.e.c.c.FollowersChecker ] [master1] FollowerChecker{discoveryNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=10} marking node as faulty
[2021-08-23T17:16:00,931][DEBUG][o.e.c.c.C.CoordinatorPublication] [master1] onFaultyNode: [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] is faulty, failing target in publication Publication{term=14, version=573091}
[2021-08-23T17:16:03,224][INFO ][o.e.c.s.ClusterApplierService] [master1] added {{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}, term: 14, version: 573091, reason: Publication{term=14, version=573091}
[2021-08-23T17:16:03,224][DEBUG][o.e.c.NodeConnectionsService] [master1] connecting to {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}
[2021-08-23T17:16:03,330][DEBUG][o.e.t.ClusterConnectionManager] [master1] connected to node [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}]
[2021-08-23T17:16:03,330][DEBUG][o.e.c.NodeConnectionsService] [master1] connected to {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}
[2021-08-23T17:16:04,578][INFO ][o.e.c.s.MasterService    ] [master1] node-left[{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected, {data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected, {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected, {data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected, {data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected], term: 14, version: 573092, delta: removed {{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}
[2021-08-23T17:16:06,926][INFO ][o.e.c.s.ClusterApplierService] [master1] removed {{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}, term: 14, version: 573092, reason: Publication{term=14, version=573092}
[2021-08-23T17:16:06,979][TRACE][o.e.t.ClusterConnectionManager] [master1] unregistering {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} after connection close and marking as disconnected
[2021-08-23T17:16:06,983][DEBUG][o.e.c.NodeConnectionsService] [master1] disconnected from {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}
[2021-08-23T17:16:07,116][INFO ][o.e.c.s.MasterService    ] [master1] node-join[{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader, {data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader, {data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader, {data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader, {data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader], term: 14, version: 573093, delta: added {{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}
[2021-08-23T17:16:07,540][DEBUG][o.e.c.c.FollowersChecker ] [master1] FollowerChecker{discoveryNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=10} disconnected
org.elasticsearch.transport.NodeNotConnectedException: [data1][1.2.3.4:9300] Node not connected
        at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:189) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:673) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:600) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker.handleWakeUp(FollowersChecker.java:317) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker.start(FollowersChecker.java:295) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker.lambda$setCurrentNodes$3(FollowersChecker.java:150) ~[elasticsearch-7.9.1.jar:7.9.1]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_112]
        at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[?:1.8.0_112]
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[?:1.8.0_112]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_112]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_112]
        at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:312) ~[?:1.8.0_112]
        at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:743) ~[?:1.8.0_112]
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) ~[?:1.8.0_112]
        at org.elasticsearch.cluster.coordination.FollowersChecker.setCurrentNodes(FollowersChecker.java:143) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1113) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:268) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
[2021-08-23T17:16:07,544][DEBUG][o.e.c.c.FollowersChecker ] [master1] FollowerChecker{discoveryNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=10} marking node as faulty
[2021-08-23T17:16:07,545][DEBUG][o.e.c.c.C.CoordinatorPublication] [master1] onFaultyNode: [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] is faulty, failing target in publication Publication{term=14, version=573093}
[2021-08-23T17:16:08,890][INFO ][o.e.c.s.ClusterApplierService] [master1] added {{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}, term: 14, version: 573093, reason: Publication{term=14, version=573093}
[2021-08-23T17:16:08,891][DEBUG][o.e.c.NodeConnectionsService] [master1] connecting to {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}
[2021-08-23T17:16:08,892][DEBUG][o.e.t.ClusterConnectionManager] [master1] connected to node [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}]
[2021-08-23T17:16:08,892][DEBUG][o.e.c.NodeConnectionsService] [master1] connected to {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}
[2021-08-23T17:16:10,204][INFO ][o.e.c.s.MasterService    ] [master1] node-left[{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected, {data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected, {data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected, {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected, {data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected], term: 14, version: 573094, delta: removed {{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}
[2021-08-23T17:16:11,521][INFO ][o.e.c.s.ClusterApplierService] [master1] removed {{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}, term: 14, version: 573094, reason: Publication{term=14, version=573094}
[2021-08-23T17:16:11,585][TRACE][o.e.t.ClusterConnectionManager] [master1] unregistering {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} after connection close and marking as disconnected
[2021-08-23T17:16:11,588][DEBUG][o.e.c.NodeConnectionsService] [master1] disconnected from {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}
[2021-08-23T17:16:11,761][INFO ][o.e.c.s.MasterService    ] [master1] node-join[{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader, {data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader, {data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader, {data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader, {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader], term: 14, version: 573095, delta: added {{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}
[2021-08-23T17:16:12,189][DEBUG][o.e.c.c.FollowersChecker ] [master1] FollowerChecker{discoveryNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=10} disconnected
org.elasticsearch.transport.NodeNotConnectedException: [data1][1.2.3.4:9300] Node not connected
        at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:189) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:673) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:600) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker.handleWakeUp(FollowersChecker.java:317) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker.start(FollowersChecker.java:295) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker.lambda$setCurrentNodes$3(FollowersChecker.java:150) ~[elasticsearch-7.9.1.jar:7.9.1]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_112]
        at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[?:1.8.0_112]
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[?:1.8.0_112]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_112]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_112]
        at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:312) ~[?:1.8.0_112]
        at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:743) ~[?:1.8.0_112]
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) ~[?:1.8.0_112]
        at org.elasticsearch.cluster.coordination.FollowersChecker.setCurrentNodes(FollowersChecker.java:143) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1113) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:268) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
[2021-08-23T17:16:12,193][DEBUG][o.e.c.c.FollowersChecker ] [master1] FollowerChecker{discoveryNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=10} marking node as faulty
[2021-08-23T17:16:12,193][DEBUG][o.e.c.c.C.CoordinatorPublication] [master1] onFaultyNode: [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] is faulty, failing target in publication Publication{term=14, version=573095}
[2021-08-23T17:16:13,614][INFO ][o.e.c.s.ClusterApplierService] [master1] added {{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}, term: 14, version: 573095, reason: Publication{term=14, version=573095}
[2021-08-23T17:16:13,615][DEBUG][o.e.c.NodeConnectionsService] [master1] connecting to {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}
[2021-08-23T17:16:13,697][DEBUG][o.e.t.ClusterConnectionManager] [master1] connected to node [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}]
[2021-08-23T17:16:13,697][DEBUG][o.e.c.NodeConnectionsService] [master1] connected to {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}
[2021-08-23T17:16:14,887][INFO ][o.e.c.s.MasterService    ] [master1] node-left[{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected, {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected, {data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected, {data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected, {data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected], term: 14, version: 573096, delta: removed {{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}
[2021-08-23T17:16:15,502][INFO ][o.e.c.s.ClusterApplierService] [master1] removed {{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}, term: 14, version: 573096, reason: Publication{term=14, version=573096}
[2021-08-23T17:16:15,572][TRACE][o.e.t.ClusterConnectionManager] [master1] unregistering {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} after connection close and marking as disconnected
[2021-08-23T17:16:15,578][DEBUG][o.e.c.NodeConnectionsService] [master1] disconnected from {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}
[2021-08-23T17:16:15,753][INFO ][o.e.c.s.MasterService    ] [master1] node-join[{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader, {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader, {data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader, {data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader, {data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} join existing leader], term: 14, version: 573097, delta: added {{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}
[2021-08-23T17:16:16,260][DEBUG][o.e.c.c.FollowersChecker ] [master1] FollowerChecker{discoveryNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=10} disconnected
org.elasticsearch.transport.NodeNotConnectedException: [data1][1.2.3.4:9300] Node not connected
        at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:189) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:673) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:600) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker.handleWakeUp(FollowersChecker.java:317) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker$FollowerChecker.start(FollowersChecker.java:295) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.FollowersChecker.lambda$setCurrentNodes$3(FollowersChecker.java:150) ~[elasticsearch-7.9.1.jar:7.9.1]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_112]
        at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[?:1.8.0_112]
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[?:1.8.0_112]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_112]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_112]
        at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:312) ~[?:1.8.0_112]
        at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:743) ~[?:1.8.0_112]
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) ~[?:1.8.0_112]
        at org.elasticsearch.cluster.coordination.FollowersChecker.setCurrentNodes(FollowersChecker.java:143) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1113) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:268) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
[2021-08-23T17:16:16,264][DEBUG][o.e.c.c.FollowersChecker ] [master1] FollowerChecker{discoveryNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=10} marking node as faulty
[2021-08-23T17:16:16,265][DEBUG][o.e.c.c.C.CoordinatorPublication] [master1] onFaultyNode: [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] is faulty, failing target in publication Publication{term=14, version=573097}
[2021-08-23T17:16:17,691][INFO ][o.e.c.s.ClusterApplierService] [master1] added {{data4}{MCz3f3FKTaK7iJ3EFRSXcw}{DyzejZeqQaeVVLrw86FnJQ}{1.2.3.1}{1.2.3.1:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data5}{CWtkFNWjQiKRnimSCINSQg}{einEV15-RYiD0xHWVQRscQ}{1.2.3.3}{1.2.3.3:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data2}{nLJM0qzfQEe-yFrrlgX7Xg}{c9UHGjY-S-KBe8mf_tA-JQ}{1.2.3.5}{1.2.3.5:9300}{dilrt}{ml.machine_memory=268703219712, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true},{data3}{qaB0Zb-jSU6u1rVRunaBlA}{EbExhcgiQZePaEXB3zeB6Q}{1.2.3.7}{1.2.3.7:9300}{dilrt}{ml.machine_memory=268703215616, rack=yg, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}, term: 14, version: 573097, reason: Publication{term=14, version=573097}
[2021-08-23T17:16:17,691][DEBUG][o.e.c.NodeConnectionsService] [master1] connecting to {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}
[2021-08-23T17:16:17,692][DEBUG][o.e.t.ClusterConnectionManager] [master1] connected to node [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}]
[2021-08-23T17:16:17,693][DEBUG][o.e.c.NodeConnectionsService] [master1] connected to {data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}

Data node log:

[2021-08-23T17:16:03,616][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:04,266][WARN ][o.e.c.c.ClusterFormationFailureHelper] [data1] master not discovered yet: have discovered [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, {master2}{7T_YyUDPRsquBstSmRjGcQ}{Sr6FOUfPSCmKtg_71AaPYw}{1.2.3.12}{1.2.3.12:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master3}{cLeWlwq4SbufcDa0Fw0qPQ}{t-qZSS-lRcWcTow0l21xqA}{1.2.3.13}{1.2.3.13:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}]; discovery will continue using [1.2.3.12:9304, 1.2.3.13:9304, 1.2.3.10:9304] from hosts providers and [{master2}{7T_YyUDPRsquBstSmRjGcQ}{Sr6FOUfPSCmKtg_71AaPYw}{1.2.3.12}{1.2.3.12:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master3}{cLeWlwq4SbufcDa0Fw0qPQ}{t-qZSS-lRcWcTow0l21xqA}{1.2.3.13}{1.2.3.13:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] from last-known cluster state; node term 14, last-accepted version 573089 in term 14
[2021-08-23T17:16:04,609][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:05,961][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:06,838][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:07,610][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:08,610][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:08,976][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:09,610][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:10,806][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:11,913][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:12,911][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:13,771][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:13,911][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:14,267][WARN ][o.e.c.c.ClusterFormationFailureHelper] [data1] master not discovered yet: have discovered [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, {master2}{7T_YyUDPRsquBstSmRjGcQ}{Sr6FOUfPSCmKtg_71AaPYw}{1.2.3.12}{1.2.3.12:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master3}{cLeWlwq4SbufcDa0Fw0qPQ}{t-qZSS-lRcWcTow0l21xqA}{1.2.3.13}{1.2.3.13:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}]; discovery will continue using [1.2.3.12:9304, 1.2.3.13:9304, 1.2.3.10:9304] from hosts providers and [{master2}{7T_YyUDPRsquBstSmRjGcQ}{Sr6FOUfPSCmKtg_71AaPYw}{1.2.3.12}{1.2.3.12:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master3}{cLeWlwq4SbufcDa0Fw0qPQ}{t-qZSS-lRcWcTow0l21xqA}{1.2.3.13}{1.2.3.13:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] from last-known cluster state; node term 14, last-accepted version 573089 in term 14
[2021-08-23T17:16:14,911][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:15,912][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:16,912][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:17,808][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:17,912][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:18,913][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:19,914][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:20,914][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:21,890][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:21,914][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:22,914][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:23,915][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:24,497][WARN ][o.e.c.c.ClusterFormationFailureHelper] [data1] master not discovered yet: have discovered [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, {master2}{7T_YyUDPRsquBstSmRjGcQ}{Sr6FOUfPSCmKtg_71AaPYw}{1.2.3.12}{1.2.3.12:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master3}{cLeWlwq4SbufcDa0Fw0qPQ}{t-qZSS-lRcWcTow0l21xqA}{1.2.3.13}{1.2.3.13:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}]; discovery will continue using [1.2.3.12:9304, 1.2.3.13:9304, 1.2.3.10:9304] from hosts providers and [{master2}{7T_YyUDPRsquBstSmRjGcQ}{Sr6FOUfPSCmKtg_71AaPYw}{1.2.3.12}{1.2.3.12:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master3}{cLeWlwq4SbufcDa0Fw0qPQ}{t-qZSS-lRcWcTow0l21xqA}{1.2.3.13}{1.2.3.13:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] from last-known cluster state; node term 14, last-accepted version 573089 in term 14
[2021-08-23T17:16:24,915][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:25,814][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:25,915][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:26,916][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:27,916][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:28,916][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:29,916][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request

If the logs above doesn't help you, and you know where I can upload the log file, please tell me.

@kkewwei
Copy link
Contributor

kkewwei commented Aug 24, 2021

The process is reasonable:
1.Data node can frequently send join requests to master.
2.At any time, the master is in the process of node-left task, the new join request has arrived and reuse the existing connection, then the master close new connection, So FollowersChecker throws the NodeNotConnectedException.

The process keeps looping.

@DaveCTurner
Copy link
Contributor Author

Ahh ok I think I see now, thanks. That does look like this problem now. I am surprised to see it looping like this (without a timeout) but it looks like things are happening in quite a different order in your system from the order I thought we enforced. Turns out we aren't enforcing it, we're just relying on cluster state updates happening quicker than they do for you.

I've had a quick look at implementing refcounting to prevent this. We don't actually have anything appropriate to refcount today - we can't prevent the Transport.Connection from closing since it might be the other end that closes it, and the NodeConnectionsService.ConnectionTarget isn't created early enough and already has quite a complex lifecycle. I'm thinking about an alternative approach that involves tracking pending disconnections and rejecting joins until the disconnection is complete. Are you also working on a fix at your end?

@DaveCTurner
Copy link
Contributor Author

This test seems to reproduce the loop fairly reliably:

diff --git a/server/src/internalClusterTest/java/org/elasticsearch/discovery/ClusterDisruptionIT.java b/server/src/internalClusterTest/java/org/elasticsearch/discovery/ClusterDisruptionIT.java
index 042d2e91089..df05b9f1084 100644
--- a/server/src/internalClusterTest/java/org/elasticsearch/discovery/ClusterDisruptionIT.java
+++ b/server/src/internalClusterTest/java/org/elasticsearch/discovery/ClusterDisruptionIT.java
@@ -16,19 +16,22 @@ import org.elasticsearch.action.NoShardAvailableActionException;
 import org.elasticsearch.action.get.GetResponse;
 import org.elasticsearch.action.index.IndexRequestBuilder;
 import org.elasticsearch.action.index.IndexResponse;
+import org.elasticsearch.action.support.PlainActionFuture;
 import org.elasticsearch.client.Client;
 import org.elasticsearch.cluster.ClusterState;
 import org.elasticsearch.cluster.action.shard.ShardStateAction;
 import org.elasticsearch.cluster.coordination.ClusterBootstrapService;
+import org.elasticsearch.cluster.coordination.FollowersChecker;
 import org.elasticsearch.cluster.coordination.LagDetector;
 import org.elasticsearch.cluster.metadata.IndexMetadata;
 import org.elasticsearch.cluster.routing.Murmur3HashFunction;
 import org.elasticsearch.cluster.routing.ShardRouting;
 import org.elasticsearch.cluster.routing.ShardRoutingState;
+import org.elasticsearch.cluster.service.ClusterService;
 import org.elasticsearch.common.settings.Settings;
-import org.elasticsearch.core.TimeValue;
 import org.elasticsearch.common.util.concurrent.ConcurrentCollections;
 import org.elasticsearch.common.xcontent.XContentType;
+import org.elasticsearch.core.TimeValue;
 import org.elasticsearch.index.VersionType;
 import org.elasticsearch.index.shard.IndexShard;
 import org.elasticsearch.index.shard.IndexShardTestCase;
@@ -40,6 +43,8 @@ import org.elasticsearch.test.disruption.NetworkDisruption.Bridge;
 import org.elasticsearch.test.disruption.NetworkDisruption.TwoPartitions;
 import org.elasticsearch.test.disruption.ServiceDisruptionScheme;
 import org.elasticsearch.test.junit.annotations.TestIssueLogging;
+import org.elasticsearch.test.transport.MockTransportService;
+import org.elasticsearch.transport.TransportService;
 
 import java.util.ArrayList;
 import java.util.Collections;
@@ -59,6 +64,9 @@ import java.util.stream.IntStream;
 
 import static org.elasticsearch.action.DocWriteResponse.Result.CREATED;
 import static org.elasticsearch.action.DocWriteResponse.Result.UPDATED;
+import static org.elasticsearch.cluster.coordination.LeaderChecker.LEADER_CHECK_INTERVAL_SETTING;
+import static org.elasticsearch.cluster.coordination.LeaderChecker.LEADER_CHECK_RETRY_COUNT_SETTING;
+import static org.elasticsearch.discovery.PeerFinder.DISCOVERY_FIND_PEERS_INTERVAL_SETTING;
 import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked;
 import static org.hamcrest.Matchers.equalTo;
 import static org.hamcrest.Matchers.everyItem;
@@ -494,4 +502,60 @@ public class ClusterDisruptionIT extends AbstractDisruptionTestCase {
         }
     }
 
+    public void testRejoinWhileBeingRemoved() {
+        final String masterNode = internalCluster().startMasterOnlyNode();
+        final String dataNode = internalCluster().startDataOnlyNode(Settings.builder()
+            .put(DISCOVERY_FIND_PEERS_INTERVAL_SETTING.getKey(), "100ms")
+            .put(LEADER_CHECK_INTERVAL_SETTING.getKey(), "100ms")
+            .put(LEADER_CHECK_RETRY_COUNT_SETTING.getKey(), "1")
+            .build());
+
+        final ClusterService masterClusterService = internalCluster().getInstance(ClusterService.class, masterNode);
+        final PlainActionFuture<Void> removedNode = new PlainActionFuture<>();
+        masterClusterService.addListener(clusterChangedEvent -> {
+            if (removedNode.isDone() == false && clusterChangedEvent.state().nodes().getDataNodes().isEmpty()) {
+                removedNode.onResponse(null);
+            }
+        });
+
+        final ClusterService dataClusterService = internalCluster().getInstance(ClusterService.class, dataNode);
+        final PlainActionFuture<Void> failedLeader = new PlainActionFuture<>() {
+            @Override
+            protected boolean blockingAllowed() {
+                // we're deliberately blocking the cluster applier on the master until the data node starts to rejoin
+                return true;
+            }
+        };
+        final AtomicBoolean dataNodeHasMaster = new AtomicBoolean(true);
+        dataClusterService.addListener(clusterChangedEvent -> {
+            dataNodeHasMaster.set(clusterChangedEvent.state().nodes().getMasterNode() != null);
+            if (failedLeader.isDone() == false && dataNodeHasMaster.get() == false) {
+                failedLeader.onResponse(null);
+            }
+        });
+
+        masterClusterService.addHighPriorityApplier(event -> {
+            failedLeader.actionGet();
+            if (dataNodeHasMaster.get() == false) {
+                try {
+                    Thread.sleep(100);
+                } catch (InterruptedException e) {
+                    throw new AssertionError("unexpected", e);
+                }
+            }
+        });
+
+        final MockTransportService dataTransportService
+            = (MockTransportService) internalCluster().getInstance(TransportService.class, dataNode);
+        dataTransportService.addRequestHandlingBehavior(FollowersChecker.FOLLOWER_CHECK_ACTION_NAME, (handler, request, channel, task) -> {
+            if (removedNode.isDone() == false) {
+                channel.sendResponse(new ElasticsearchException("simulated check failure"));
+            } else {
+                handler.messageReceived(request, channel, task);
+            }
+        });
+
+        removedNode.actionGet(10, TimeUnit.SECONDS);
+        ensureStableCluster(2);
+    }
 }

@kkewwei
Copy link
Contributor

kkewwei commented Aug 25, 2021

Yes, I’m trying to fix it. As long as the IO pressure is relatively high, it may happen, usually once every few days, the temporary solution is to reduce the IO pressure by adding the data nodes.

I'm thinking about an alternative approach that involves tracking pending disconnections and rejecting joins until the disconnection is complete

If the join requests aways comes first, rejecting joins doesn't seem to work, in this case, unless the node-join task is executed before node-left task, the connection will not be closed by mistake, but we can't control the order, and if in the order, there Will appear to be in a loop:node-join-> node-left-> node-join, the node will never successfully join the cluster.

It seems that adding refcounting in ClusterConnectionManager(or in Transport.Connection) be more reasonable, as management of connections is always under the control of ClusterConnectionManager.

@kkewwei
Copy link
Contributor

kkewwei commented Aug 26, 2021

@DaveCTurner, Have you started working on it ? If not, it's my pleasure try to fix it, I will try it by adding refcounting.

@DaveCTurner
Copy link
Contributor Author

I have been exploring a few ideas; it is definitely possible to block joins while a disconnection is pending (either reject them or just wait for the disconnection to complete) but it's not very pretty. I think I can see a way to do it with refcounting now: it looks possible to make TranportService#connectToNode acquire a ref on the connection that the listener can release (i.e. it becomes TransportService#connectToNode(DiscoveryNode, ActionListener<Releasable>)). It looks a bit complicated to work out exactly when to release it from each of the three callsites, but that's the idea I'd pursue next.

A colleague points out that this will change the behaviour in the case that something starts dropping packets on a PING connection: we won't close the connection any more, we'll keep on trying to re-use it until the TCP retransmission timeout kicks in, which defaults to 15 minutes. That's the sort of thing we see sometimes with badly-configured firewalls, but we could solve it by verifying the PING connection while validating the join.

It's unlikely I'll have time to dig deeper before the middle of next week now, if you think you'll make progress in the next few days then please go ahead, hopefully we can collaborate on a PR.

@kkewwei
Copy link
Contributor

kkewwei commented Aug 26, 2021

It's my honor to cooperate with you, and thank you very much for your ideas, I will try to fix it in your ideas.

@kkewwei
Copy link
Contributor

kkewwei commented Aug 31, 2021

@DaveCTurner, I just try to solve it, which is different from what you said, and it may not meet your requirement, so I don't add unit test.

If you can provide suggestions, I would be happy to fix, or If there is a better way to implement it, I't glad to see it.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Sep 2, 2021
Today it's possible to open two connections to a node, and then we
notice when registering the second connection and close it instead.
Fixing elastic#67873 will require us to keep tighter control over the identity
and lifecycle of each connection, and opening redundant connections gets
in the way of this. This commit adds a check for an existing connection
_after_ marking the connection as pending, which guarantees that we
don't open those redundant connections.
DaveCTurner added a commit that referenced this issue Sep 3, 2021
Today it's possible to open two connections to a node, and then we
notice when registering the second connection and close it instead.
Fixing #67873 will require us to keep tighter control over the identity
and lifecycle of each connection, and opening redundant connections gets
in the way of this. This commit adds a check for an existing connection
_after_ marking the connection as pending, which guarantees that we
don't open those redundant connections.
DaveCTurner added a commit that referenced this issue Sep 3, 2021
Today it's possible to open two connections to a node, and then we
notice when registering the second connection and close it instead.
Fixing #67873 will require us to keep tighter control over the identity
and lifecycle of each connection, and opening redundant connections gets
in the way of this. This commit adds a check for an existing connection
_after_ marking the connection as pending, which guarantees that we
don't open those redundant connections.
DaveCTurner added a commit that referenced this issue Sep 14, 2021
Today we open connections to other nodes in various places and largely
assume that they remain open as needed, only closing them when applying
a cluster state that removes the remote node from the cluster. This
isn't ideal: we might preserve unnecessary connections to remote nodes
that aren't in the cluster if they never manage to join the cluster, and
we might also disconnect from a node that left the cluster while it's in
the process of re-joining too (see #67873).

With this commit we move to a model in which each user of a connection
to a remote node acquires a reference to the connection that must be
released once it's no longer needed. Connections remain open while there
are any live references, but are now actively closed when all references
are released.

Fixes #67873
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Sep 14, 2021
Today we open connections to other nodes in various places and largely
assume that they remain open as needed, only closing them when applying
a cluster state that removes the remote node from the cluster. This
isn't ideal: we might preserve unnecessary connections to remote nodes
that aren't in the cluster if they never manage to join the cluster, and
we might also disconnect from a node that left the cluster while it's in
the process of re-joining too (see elastic#67873).

With this commit we move to a model in which each user of a connection
to a remote node acquires a reference to the connection that must be
released once it's no longer needed. Connections remain open while there
are any live references, but are now actively closed when all references
are released.

Fixes elastic#67873
Backport of elastic#77295
elasticsearchmachine pushed a commit that referenced this issue Sep 14, 2021
Today we open connections to other nodes in various places and largely
assume that they remain open as needed, only closing them when applying
a cluster state that removes the remote node from the cluster. This
isn't ideal: we might preserve unnecessary connections to remote nodes
that aren't in the cluster if they never manage to join the cluster, and
we might also disconnect from a node that left the cluster while it's in
the process of re-joining too (see #67873).

With this commit we move to a model in which each user of a connection
to a remote node acquires a reference to the connection that must be
released once it's no longer needed. Connections remain open while there
are any live references, but are now actively closed when all references
are released.

Fixes #67873
Backport of #77295
@howardhuanghua
Copy link
Contributor

Data node log:

[2021-08-23T17:16:03,616][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:04,266][WARN ][o.e.c.c.ClusterFormationFailureHelper] [data1] master not discovered yet: have discovered [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, {master2}{7T_YyUDPRsquBstSmRjGcQ}{Sr6FOUfPSCmKtg_71AaPYw}{1.2.3.12}{1.2.3.12:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master3}{cLeWlwq4SbufcDa0Fw0qPQ}{t-qZSS-lRcWcTow0l21xqA}{1.2.3.13}{1.2.3.13:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}]; discovery will continue using [1.2.3.12:9304, 1.2.3.13:9304, 1.2.3.10:9304] from hosts providers and [{master2}{7T_YyUDPRsquBstSmRjGcQ}{Sr6FOUfPSCmKtg_71AaPYw}{1.2.3.12}{1.2.3.12:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master3}{cLeWlwq4SbufcDa0Fw0qPQ}{t-qZSS-lRcWcTow0l21xqA}{1.2.3.13}{1.2.3.13:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] from last-known cluster state; node term 14, last-accepted version 573089 in term 14
[2021-08-23T17:16:04,609][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:05,961][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:06,838][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:07,610][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:08,610][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:08,976][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:09,610][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:10,806][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:11,913][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:12,911][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:13,771][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:13,911][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:14,267][WARN ][o.e.c.c.ClusterFormationFailureHelper] [data1] master not discovered yet: have discovered [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, {master2}{7T_YyUDPRsquBstSmRjGcQ}{Sr6FOUfPSCmKtg_71AaPYw}{1.2.3.12}{1.2.3.12:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master3}{cLeWlwq4SbufcDa0Fw0qPQ}{t-qZSS-lRcWcTow0l21xqA}{1.2.3.13}{1.2.3.13:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}]; discovery will continue using [1.2.3.12:9304, 1.2.3.13:9304, 1.2.3.10:9304] from hosts providers and [{master2}{7T_YyUDPRsquBstSmRjGcQ}{Sr6FOUfPSCmKtg_71AaPYw}{1.2.3.12}{1.2.3.12:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master3}{cLeWlwq4SbufcDa0Fw0qPQ}{t-qZSS-lRcWcTow0l21xqA}{1.2.3.13}{1.2.3.13:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] from last-known cluster state; node term 14, last-accepted version 573089 in term 14
[2021-08-23T17:16:14,911][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:15,912][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:16,912][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:17,808][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:17,912][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:18,913][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:19,914][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:20,914][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:21,890][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:21,914][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:22,914][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:23,915][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:24,497][WARN ][o.e.c.c.ClusterFormationFailureHelper] [data1] master not discovered yet: have discovered [{data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, {master2}{7T_YyUDPRsquBstSmRjGcQ}{Sr6FOUfPSCmKtg_71AaPYw}{1.2.3.12}{1.2.3.12:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master3}{cLeWlwq4SbufcDa0Fw0qPQ}{t-qZSS-lRcWcTow0l21xqA}{1.2.3.13}{1.2.3.13:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}]; discovery will continue using [1.2.3.12:9304, 1.2.3.13:9304, 1.2.3.10:9304] from hosts providers and [{master2}{7T_YyUDPRsquBstSmRjGcQ}{Sr6FOUfPSCmKtg_71AaPYw}{1.2.3.12}{1.2.3.12:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {master3}{cLeWlwq4SbufcDa0Fw0qPQ}{t-qZSS-lRcWcTow0l21xqA}{1.2.3.13}{1.2.3.13:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] from last-known cluster state; node term 14, last-accepted version 573089 in term 14
[2021-08-23T17:16:24,915][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:25,814][DEBUG][o.e.c.c.JoinHelper       ] [data1] successfully joined {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:25,915][DEBUG][o.e.c.c.JoinHelper       ] [data1] attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}
[2021-08-23T17:16:26,916][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:27,916][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:28,916][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request
[2021-08-23T17:16:29,916][DEBUG][o.e.c.c.JoinHelper       ] [data1] already attempting to join {master1}{zfbqQmrSRuygmRRpsVOwLQ}{ad5KevFXQh6-izGcLzjC5A}{1.2.3.10}{1.2.3.10:9304}{ilmr}{ml.machine_memory=135209578496, rack=gha, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with request JoinRequest{sourceNode={data1}{nCLleDNKQ0mrULPjf-flcw}{ihftDxGVSbitAfPtuE1hJg}{1.2.3.4}{1.2.3.4:9300}{dilrt}{ml.machine_memory=135299469312, rack=gha, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=14, optionalJoin=Optional.empty}, not sending request

Hi @DaveCTurner , in @kkewwei shared data node's log, we could see the data node sends join request continuously to master node. It seems its PeerFinder thread has never been marked as inactive. This is the key point causes the data node sends another join request during master node processing node-left task, caused the loop.

Normally, if node cannot find master then becomes candidate, PeerFinder thread would be marked as active, it would send join request to the detected master node in each second (duplicated join request would be skipped), and after data node receives publish request, it becomes follower, then PeerFinder should be marked as inactive, would never send join request again.

Do you have any idea that why above data node continuously send join request?

@DaveCTurner
Copy link
Contributor Author

Do you have any idea that why above data node continuously send join request?

Yes, it's because the master is repeatedly dropping the connection to the data node. The data node gets told "yes you have joined" but it never receives the cluster state update that adds it to the cluster, so it tries again.

@howardhuanghua
Copy link
Contributor

A picture helps to understand the whole story of this issue.
image

@howardhuanghua
Copy link
Contributor

Hi @DaveCTurner , a simple question need you help that why we set most of the coordination layer parameters as static, could not change them dymamically? Like intervals, retry count and timeout settings in LagDector, LeaderChecker, FollowersCheck. Thank you.

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Sep 16, 2021

Making settings dynamic adds substantial complexity. Moreover we strongly recommend not adjusting the settings you mention at all, let alone adjusting them dynamically.

If the defaults don't work for you then there's something else wrong in your system, maybe a bug or maybe something environmental. We made these settings so we had a way to implement a temporary workaround for a bug if one were found, but you shouldn't have them set permanently, let alone need to adjust them dynamically.

Let's investigate the underlying problem you're trying to solve by adjusting these settings, either in a new Github issue or in a support case.

@howardhuanghua
Copy link
Contributor

@DaveCTurner , thanks for the patient reply. Since in this case, we try to increase discovery.find_peers_interval to 5s to avoid frequent join requests in processing node-left task, but we found it's static, it's a little bit costly to rolling restart for the modification. I think this is the temporary simple solution for the old version clusters as it's a huge modification in #77672 that hard to backport. Please help to confirm that increase this interval would not have other impact except a little delay for the node join.

@DaveCTurner
Copy link
Contributor Author

Please help to confirm that increase this interval would not have other impact except a little delay for the node join.

That sounds right to me, and yes we will not be backporting #77672 to versions before 7.16.0.

Just to emphasise that we are still very interested in understanding the underlying slowness in your clusters. We have been profiling some of our own high-shard-count clusters recently and have found a number of places that deserve optimisation (you may have noticed a few PRs on the subject). We're very receptive to any insights you can share from your own profiling.

@howardhuanghua
Copy link
Contributor

Thanks @DaveCTurner . In this case, the main issue is that robust/big search requests caused resource bottle neck like IO/CPU, then caused partial of the nodes leave temporarily (like follower lagging). It would come into above left/join loop for a long time.

On another hand, we are optimizing cluster state updating performance for high-shard-count clusters, just like this PR #77266, it's only a piece of the whole optimization process. Would you please help to list the related shard scalability optimizations that you are working on? We could do further cooperation in the future.

@DaveCTurner
Copy link
Contributor Author

For instance we'd like to understand why you are seeing followers leaving due to lagging at all - that means it took over 2 minutes to process a cluster state, which is a very long time even on a heavily loaded node. Master-ineligible nodes should be able to process a new cluster state without doing any IO at all, and shouldn't need much CPU either, so if you are seeing resource bottlenecks delaying these things for >2 minutes then that sounds like a bug.

There's no overarching list of tasks, we are just profiling various scenarios and addressing any unnecessary bottlenecks that we find.

@howardhuanghua
Copy link
Contributor

howardhuanghua commented Sep 19, 2021

We've seen IO util 100% last for several mins before in one of our public cloud ES cluster node, that's a cold node with HDD. ES version is 7.5, node cluster state persistent would call fsync in this version. But recently I do meet a 7.10 cluster also got lagging issue, but resource is not high at all. I can't dig deep reason as log is info level, I will keep an eye on these cases in our env, and submit another issue if I found some clues.

We have been profiling some of our own high-shard-count clusters recently and have found a number of places that deserve optimisation (you may have noticed a few PRs on the subject).

Would you please share these related PRs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants