Connection not re-establised post unreachability resolution #31081

nick-nachos · 2022-01-21T23:30:25Z

We've got an unusual clustering issue that has hit us a couple of times this week. Incidentally (or not?) this started happening a week after we upgraded from Akka 2.6.14 to 2.6.18. We had been on 2.6.14 for about 6 months and prior to that on 2.5.x for a year or so, and have never seen anything similar. The incident has happened 3-4 times so far in total in the course of a week in multiple environments, which however are all similar: K8s over AWS using the Istio service mesh. Our service has a deployment with 4 replicas, all of whom join an Akka Cluster (per environment) using Artery-TCP.

In all cases, the sequence of events is as follows:

One of the nodes (node-3 in this case) temporarily loses network connectivity, thus getting detected as unreachable:

node-1 [15:24:00.366] WARN <Cluster> Cluster Node [akka://cluster@node-1:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-3:12345, Up)]. 
node-3 [15:24:00.938] WARN <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-0:12345, Up)]. 
node-3 [15:24:00.939] WARN <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-1:12345, Up)]. 
node-3 [15:24:00.939] WARN <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-2:12345, Up)]. 
node-2 [15:24:01.187] WARN <Cluster> Cluster Node [akka://cluster@node-2:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-3:12345, Up)]. 
node-0 [15:24:01.620] WARN <Cluster> Cluster Node [akka://cluster@node-0:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-3:12345, Up)]. 
node-3 [15:24:01.850] INFO <Cluster> Cluster Node [akka://cluster@node-3:12345] - is the new leader among reachable nodes (more leaders may exist) 
node-3 [15:24:03.879] WARN <Association> Outbound message stream to [akka://cluster@node-0:12345] failed. Restarting it. akka.stream.StreamTcpException: Tcp command [Connect(node-0:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.UnknownHostException: node-0 
node-3 [15:24:03.941] INFO <RemoteActorRefProvider$RemoteDeadLetterActorRef> Message [akka.cluster.GossipStatus] from Actor[akka://cluster/system/cluster/core/daemon#-812544556] to Actor[akka://cluster/deadLetters] was not delivered. [10] dead letters encountered, no more dead letters will be logged in next [5.000 min]. If this is not an expected behavior then Actor[akka://cluster/deadLetters] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
node-3 [15:24:03.959] WARN <Materializer> [outbound connection to [akka://cluster@node-0:12345], message stream] Upstream failed, cause: StreamTcpException: Tcp command [Connect(node-0:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.UnknownHostException: node-0 
node-2 [15:24:04.114] WARN <Association> Outbound message stream to [akka://cluster@node-3:12345] failed. Restarting it. akka.stream.StreamTcpException: The connection closed with error: Connection reset by peer 
node-3 [15:24:04.196] WARN <Association> Outbound message stream to [akka://cluster@node-2:12345] failed. Restarting it. akka.stream.StreamTcpException: Tcp command [Connect(node-2:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.UnknownHostException: node-2 
node-3 [15:24:04.237] WARN <Materializer> [outbound connection to [akka://cluster@node-2:12345], message stream] Upstream failed, cause: StreamTcpException: Tcp command [Connect(node-2:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.UnknownHostException: node-2 
node-3 [15:24:04.319] WARN <Association> Outbound message stream to [akka://cluster@node-1:12345] failed. Restarting it. akka.stream.StreamTcpException: Tcp command [Connect(node-1:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.UnknownHostException: node-1 
node-2 [15:24:05.226] WARN <Materializer> [outbound connection to [akka://cluster@node-3:12345], message stream] Upstream failed, cause: StreamTcpException: The connection closed with error: Connection reset by peer 
node-3 [15:24:07.299] WARN <Materializer> [outbound connection to [akka://cluster@node-1:12345], message stream] Upstream failed, cause: StreamTcpException: Tcp command [Connect(node-1:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.UnknownHostException: node-1 
node-3 [15:24:13.420] WARN <Materializer> [outbound connection to [akka://cluster@node-2:12345], message stream] Upstream failed, cause: StreamTcpException: Tcp command [Connect(node-2:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.ConnectException: Connection refused

Eventually connectivity issues resolve, and the node (node-3) becomes reachable again:

node-1 [15:24:14.640] INFO <Cluster> Cluster Node [akka://cluster@node-1:12345] - Marking node as REACHABLE [Member(akka://cluster@node-3:12345, Up)]. 
node-0 [15:24:14.875] INFO <Cluster> Cluster Node [akka://cluster@node-0:12345] - Marking node as REACHABLE [Member(akka://cluster@node-3:12345, Up)]. 
node-3 [15:24:15.213] INFO <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as REACHABLE [Member(akka://cluster@node-2:12345, Up)]. 
node-3 [15:24:15.213] INFO <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as REACHABLE [Member(akka://cluster@node-1:12345, Up)]. 
node-3 [15:24:15.213] INFO <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as REACHABLE [Member(akka://cluster@node-0:12345, Up)]. 
node-2 [15:24:15.459] INFO <Cluster> Cluster Node [akka://cluster@node-2:12345] - Marking node as REACHABLE [Member(akka://cluster@node-3:12345, Up)]. 
node-3 [15:24:16.130] INFO <Cluster> Cluster Node [akka://cluster@node-3:12345] - is no longer leader

Despite reachability having been restored however, the leader (node-0) still reports that it's unable to perform its duties:

node-0 [15:24:20.727] INFO <Cluster> Cluster Node [akka://cluster@node-0:12345] - Leader can currently not perform its duties, reachability status: [], member status: [akka://cluster@node-0:12345 Up seen=true, akka://cluster@node-1:12345 Up seen=true, akka://cluster@node-2:12345 Up seen=true, akka://cluster@node-3:12345 Up seen=false]

The message above keeps repeating periodically forever, until the cluster is torn down. What's very interesting here is that the reachability status list is empty and the seen field of the node-3 member status is false. Keep in mind that the leader had previously marked the node as reachable.

Subsequently, besides this message being logged every now and then, we also start getting logs for dropped GossipEnvelope and GossipStatus messages:

node-1 [16:01:05.542] INFO <RemoteActorRef> Message [akka.cluster.GossipEnvelope] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] was dropped. Due to overflow of send queue, size [3072]. [7] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
node-1 [16:01:05.542] INFO <RemoteActorRef> Message [akka.cluster.GossipEnvelope] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] was not delivered. [8] dead letters encountered. If this is not an expected behavior then Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
[...]
node-1 [16:06:07.463] INFO <RemoteActorRef> Message [akka.cluster.GossipEnvelope] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] was dropped. Due to overflow of send queue, size [3072]. [949] dead letters encountered, of which 938 were not logged. The counter will be reset now. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
node-1 [16:06:07.463] INFO <RemoteActorRef> Message [akka.cluster.GossipEnvelope] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] was not delivered. [1] dead letters encountered. If this is not an expected behavior then Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
node-1 [16:06:08.308] INFO <RemoteActorRef> Message [akka.cluster.GossipStatus] wrapped in [akka.actor.ActorSelectionMessage] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster@node-3:12345/] was dropped. Due to overflow of send queue, size [3072]. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
node-1 [16:06:08.309] INFO <RemoteActorRefProvider$RemoteDeadLetterActorRef> Message [akka.cluster.GossipStatus] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster/deadLetters] was not delivered. [3] dead letters encountered. If this is not an expected behavior then Actor[akka://cluster/deadLetters] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
node-1 [16:06:08.482] INFO <RemoteActorRef> Message [akka.cluster.GossipEnvelope] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] was dropped. Due to overflow of send queue, size [3072]. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

Based on application behavior we can also deduct that application level messages are probably not delivered to the problematic node either. Neither rolling restart, nor targeted node restarts have worked to resolve this, we pretty much had to shut the cluster down completely and start up again.

Unfortunately I haven't been able to reproduce this locally, but given the peculiarity of the cluster leader's behavior I was hoping you'd be able to deduct whether there's some particular corner case that could trigger it. Unfortunately we haven't rolled back to 2.6.14 yet, so I can't say whether that resolves the issue or not, but I will make sure to post here once we do. In the meantime, do you have any thoughts on this?

The text was updated successfully, but these errors were encountered:

nick-nachos · 2022-01-21T23:33:56Z

@patriknw you might be interested in this.

patriknw · 2022-01-24T12:51:57Z

@nick-nachos Thanks for reporting. seen=false is indeed what prevents the leader to perform its duties. That means that the latest gossip state hasn't been fully propagated to/from node-3. There seems to still be some connectivity issues between the nodes. Can you see if it's only node-2 and node-3 that can't talk or if node-1 and node-4 also have issues with node3? I suspect that they have that.

There are some additional logging that could be enabled to debug this:

https://doc.akka.io/docs/akka/current/typed/logging.html#auxiliary-remote-logging-options
akka.cluster.log-info-verbose = on
akka.cluster.debug.verbose-heartbeat-logging = on
akka.cluster.debug.verbose-gossip-logging = on

nick-nachos · 2022-01-24T15:58:00Z

Hi @patriknw thank you very much for your reply. First things first, I forgot to mention in my original post that we're using Akka Cluster Classic, not Typed.

Some extra info I dug out with regards to your questions:

I can see from our application logs that node-0 (leader) and node-1 seemed unable to send application messages to node-3 whereas the reverse worked properly (i.e. node-3 seemed to send the same application level messages to node-1 successfully). node-2 seemed to have successful bidirectional communication with that node (node-3).
There's another use-case we have where we apply the ask pattern: in this case however, the original sender is also a large message recipient (small request, large reply). In this case:
- node-0 and node-1 did not manage to properly send anything to node-3, thus getting ask timeouts.
- node-3 was able to send and get responses from node-1 (node-0 did not apply in this case) (also remember, the response is received by the large recipient)
- node-2 to/from node-3 comms worked properly.

What was interesting to me and one of the reasons I raised the issue here, is that although node-0 and node-1 could not establish app level communications with node-3 and we can also see the dropped GrossipStatus/Envelope messages from them towards it, there didn't seem to be any unreachable nodes detected at any point beyond the original detection; Based on the info above, I would have expected node-3 to detect nodes 0 and 1 as unreachable if they couldn't send any messages to it, unless heartbeats somehow made it to their destination when all other messages didn't (unlikely).

For now we will be rolling back from 2.6.18 to 2.6.14 just in case any of the changes between those two versions (with regards to Artery and TCP streams) could have caused any edge-case regression. The reason for that is that we have seen our nodes temporarily lose connectivity on K8s environments in the past, but the cluster would always recover once connectivity had been restored.

I will also try to select some lower volume environment out of those that were affected to enable the logging options you mentioned and also switch the the akka logs to debug level, so that we have more info in case this hits us again.

In the meantime, if you've got any thoughts on this feel free to share.

patriknw · 2022-01-31T13:16:49Z

The point-to-point failure detection should be seen in logs as "Marking node as UNREACHABLE" and "Marking node as REACHABLE". Exact logging of the heartbeat roundtrips can be seen with verbose-heartbeat-logging, but it's rather verbose.

nick-nachos · 2022-02-15T21:20:06Z

A small update on this @patriknw, rolling back to Akka 2.6.14 seems to have done the trick for us. Our clusters seem to properly recover after the resolution of the same type of network connectivity issues: once the unreachable node becomes reachable again, the leader reports that it can properly resume its duties.

patriknw · 2022-02-16T18:38:47Z

Thanks for the update. I can't think of anything we changed that could explain that (famous last word 😄). We have to look into the change logs. However, I think we have too little hints of what the problem could be. If you have a way to reproduce it and could enable more logs and share raw logs with us that could be very useful for tracking down this.

nick-nachos · 2022-02-16T22:45:50Z

Actually @patriknw the famous last words might have been mine (me and my big mouth), as I've just happened to troubleshoot an incident that was similar to the original: in a cluster of 4 nodes (node-0, node-1, node-2, node-3) with Akka 2.6.14, node-2 lost connectivity at some point and became unreachable. Few seconds later connectivity was restored and the node became reachable again. Although we didn't get that weird message from the leader (it was able to resume its duties), one of the nodes (node-3) seemed unable to send application level messages and gossip envelopes to node-2.

This time I was able to switch the log-level of that node (node-3) to debug on the runtime, although that was done re-actively after the fact, which means that we don't have any debug level logs from the moment the issue started. That said I'm still going to post the logs here in case you pick up anything useful. We're also going to enable this diagnostic logging by default on these problematic environments, so once this happens again we'll have even more output available.

The log file attached contains around 6 minutes of Akka logs from node-3, i.e. the node that was unable to send app-level messages to node-2. Some key takeaways:

The dead-letters logging of the RemoteActorRef class suggests that buffering has taken place, the buffer has overflown and subsequent messages are discarded immediately. I wonder if that suggests that some underlying TCP stream may have silently failed, stuck in back-pressure or anything of the sort.
node-3 is actually able to send and receive heartbeats to/from all nodes of the cluster (including node-2) which proves that there is no actual physical networking issue in place
node-3 is logging that it's receiving gossip from node-2 (ClusterGossip class), although itself is unable of sending its own GossipEnvelope to node-2 (based on dead letters logging)
Interestingly, there's no log entries about receiving gossip from nodes other than node-2 (again ClusterGossip class). The Replicator class however does log received gossip statuses from all nodes.

Some extra info in case it's needed:

We make use of Cluster Singleton and Cluster Sharding
We also make use of the large message support of Artery: for the response messages we know might be large, we make sure to spin up temporary receivers under a specific path (user/service/largeMessageReceivers) in a similar fashion to the ask pattern. That might explain the messages from the Deserialization class.

No worries if the input is not enough this time either, as I mentioned earlier we will now have Akka logs in debug by default so next time this happens we'll get as much of the full picture as possible.

node-3.log

patriknw · 2022-02-18T14:58:18Z

Thanks for sharing those logs. I agree with your conclusions.

suggests that buffering has taken place, the buffer has overflown and subsequent messages are discarded immediately

That could also happen if you send messages in a burst without flow control. The buffer size is only 3072. If you send faster than it is able to serialize and push over the wire it will overflow. If it's just temporary bursts you could get around it by increasing the buffer size. Config akka.remote.artery.advanced.outbound-message-queue-size

there's no log entries about receiving gossip from nodes other than node-2

That might be because when the nodes are in sync they only exchange a smaller GossipStatus message. The messages from the Replicator is also a good indication that the communication works.

nick-nachos · 2022-02-18T16:19:36Z

Thank for your response @patriknw. The message rate in these environments is very low in general, but what you've just said made something "click": I went back to our logs, at the exact moment the incident started, and these are the exact logs from node-3 (the node that is later unable to send):

[16-Feb-2022 06:19:20.624] WARN <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-2:12345, Up)]. 
[16-Feb-2022 06:19:52.245] INFO <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as REACHABLE [Member(akka://cluster@node-2:12345, Up)]. 
[16-Feb-2022 07:36:59.975] INFO <RemoteActorRef> Message [akka.cluster.GossipStatus] wrapped in [akka.actor.ActorSelectionMessage] from Actor[akka://cluster/system/cluster/core/daemon#-347874725] to Actor[akka://cluster@node-2:12345/] was dropped. Due to overflow of send queue, size [3072]. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
[16-Feb-2022 07:36:59.975] INFO <RemoteActorRefProvider$RemoteDeadLetterActorRef> Message [akka.cluster.GossipStatus] from Actor[akka://cluster/system/cluster/core/daemon#-347874725] to Actor[akka://cluster/deadLetters] was not delivered. [4] dead letters encountered. If this is not an expected behavior then Actor[akka://cluster/deadLetters] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
[16-Feb-2022 07:37:05.075] INFO <RemoteActorRef> Message [akka.cluster.GossipStatus] wrapped in [akka.actor.ActorSelectionMessage] from Actor[akka://cluster/system/cluster/core/daemon#-347874725] to Actor[akka://cluster@node-2:12345/] was dropped. Due to overflow of send queue, size [3072]. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

At that point the diagnostic logging was not enabled yet, so all we have is info level logs and above, but there's something really interesting here: node-2 was detected unreachable for about 32 seconds, during which time there have been no association or materializer failures reported. This kind of suggests that the association (and so the underlying stream(s) I suppose) remains alive during that time. Could this mean that the TCP stream could have entered some back-pressure mode which causes the buffering?

I suspect that's the case, and if so, there's a second finding in the logs above that is very useful: you can see that after reachability has been restored, node-3 reports its first dead letters due to buffer overflow 77 minutes later (no logs at all in the between). To me, this is indicative of the very low traffic volume of these particular environments, and makes me suspect that the most plausible explanation here is that the TCP stream has been stuck in back-pressure (or something similar) and is thus unable to send any of the buffered messages, although the reachability issue has been resolved.

The alternative scenario by which there's messages being sent by Artery but there's also message loss (overflow) taking place due to the intermediate buffering would have made more sense for our higher volume environments (where we've never had any issues though). If that were the case here in these lower volume ones, what I would have expected to see is just some intermediate overflowing, but at some point the buffer would have cleared up (it shouldn't take too long to send 3k messages) and the message flow would go back to normal. Instead, what we get is (seemingly) persistent message loss, until we eventually delete the affected pod.

What do you think of this hypothesis?

In any case, as I said in my previous post, if the data we currently have at out disposal is not enough, we're rolling out with diagnostic logging enabled by default on these environments this week, so once we get hit again we should have something more tangible to work with (hopefully).

nick-nachos · 2022-02-21T13:07:18Z

@patriknw In continuation to my previous findings, I can also confirm that all the rest of the nodes of the cluster (node-0, node-1, node-2) had their associations/materializers to/from node-2 torn down when the node became unreachable. node-3 was the only exception to that, its association to node-2 was never reported as broken. I think this points towards the hypothesis that the association of node-3 towards node-2 enters some kind of "zombie" state that it never recovers from, and would also explain why the other nodes had no issue sending/receiving messages to/from node-2 after it recovered, and also why node-2 was also able to send messages to node-3 (the association of node-2 towards node-3 was torn down when node-2 became unreachable)

nick-nachos · 2022-03-01T18:29:48Z

Hi @patriknw. We were able to capture the diagnostic cluster logs from an incident a few days ago. This time we were also able to find the issue causing the connectivity loss which was the Istio proxy sidecar of the affected pod dying with OOM. The sidecar took around 30 sec to recover (seems consistent every time) during which time the pod loses all network connectivity. After connectivity recovers we also get hit by the Akka clustering issue due to which one of the other nodes cannot send application messages to the recovered node.

In this case it was node-2 that lost connectivity (again), and once recovered it was node-3 that was unable to send application messages (again). I can confirm from the logs that heartbeats are properly exchanged between the nodes post recovery. We can infer the message loss from the AskTimeoutExceptions we see in our application logs (not attached) for node-3. At the same time the very same messages are successfully sent (and responded to) from nodes 0 and 1 towards node-2 (send rates are similar across nodes, as the message sending is triggered by external HTTP requests that are dispatched in a round-robin fashion by our load balancer)

node-2 was detected as unreachable roughly between 04:52:08 and 04:52:40 (30-ish seconds). The logs start around 2 minutes before that and end around 7.5 minutes after that (for some context). What you won't see in the logs this time is the dead-letters (GossipEnvelope etc) due to buffering: the reason for that is that the message send rate in this environment (as mentioned in my previous post) is so low, that it would have taken a while to fill the buffer up (we dealt with the issue swiftly this time). I can assure you however that we could definitely see all nodes (0, 1 and 3) reporting AskTimeoutExceptions for requests towards node-2 while it was unreachable, whereas node-3 was the only one that kept doing so after the node became reachable again.

I've spent some time going through these logs to see if there's anything standing out, but I couldn't find much (with my layman's eyes at least). Some little things that I picked up (which may probably be nothing-burgers):

node-3 "complained" a couple of times after the recovery of node-2 that it Couldn't establish a causal relationship between "remote" gossip and "local" gossip. That said, after a couple of occurrences, the log line didn't seem to reappear.
there seems to be a small asymmetry between how many connections each node has made towards node-2: node-1 seems to have made 3 connections whereas node-0 and node-3 have made 2. What I'm referring to here is each of these nodes printing a message <TcpOutgoingConnection> Connection established to [node-2:12345] followed by node-2 printing a message <TcpListener> New connection accepted. node-0 and node-3 both making 2 connections against node-2 but only node-3 having an issue sending messages would suggest to me that the connection number is not a factor in it of itself. That said, there is one difference there, in that, while node-0 makes both connections pretty close to one another (with like 1 sec of difference, also the case for node-1), node-3 creates the second connection 21 seconds after the first one. Again this could be a big flat nothing, as I don't know how Artery works internally, only on a very high level; for example, do you create separate TCP connections per channel (control, user and user-large) in which case I would expect to see 3 connections per node, or do you do some sort of multiplexing over a single connection?

I hereby attach the logs of all four nodes ordered by date: akka.log.zip. It is formatted as cluster-node-$i\t$payload so that you can easily grep by node if you need to do so.

nick-nachos · 2022-03-01T19:19:13Z

Actually that second bullet in my comment above may have not been a nothing-burger at all: I've just remembered seeing logs of the following format:

<Association> Using large message stream for akka://cluster@node-2:12345/user/service/largeMessageReceivers/$6ss

and what I have also noticed sometimes is such logs being pretty close (time-wise) to the creation of an outbound connection after the reachability issue has resolved. I would make a bet that the outbound connection used for the large user messages is created dynamically if/when a large message is ever sent for the first time.

If that's the case, then this second finding does the trick: node-0 and node-3 may have both created 2 outbound connections to node-2, but both connections made by node-0 seem unrelated to large messages, whereas the second connection made by node-3 seems related:

cluster-node-3	[24-Feb-2022 04:52:59.334] DEBUG <Association> Using large message stream for akka://cluster@node-2:12345/user/service/largeMessageReceivers/$6ss 
cluster-node-3	[24-Feb-2022 04:52:59.339] DEBUG <TcpOutgoingConnection> Connection established to [node-2:12345] 
cluster-node-3	[24-Feb-2022 04:52:59.339] DEBUG <TcpOutgoingConnection> Attempting connection to [node-2/172.19.181.105:12345] 
cluster-node-3	[24-Feb-2022 04:52:59.339] DEBUG <TcpOutgoingConnection> Resolving node-2 before connecting 
cluster-node-2	[24-Feb-2022 04:52:59.341] DEBUG <TcpListener> New connection accepted

This would explain the 20 second difference between node-3 creating the first and second connection: the first connection was for the control messages, and the second one for the user-large messages in response to a message send by an actor on node-2 that required a large response to be sent back (by node-3). So effectively node-3 never established a connection for the plain user-space messages, only the ones for control and user-large. node-0 on the other hand created the ones for control and user (didn't need to send any large response to node-2 apparently) and node-1 created all three types of connections.

Does any of that stuff make sense, or am I way too off?

patriknw · 2022-03-09T08:22:47Z

I had vacation last week, but will take a look at the logs now. Thanks for sharing and analyzing the problem.

Yes, connections are established lazyily when a message is sent over that "stream". There can be be up to 3 outgoing connections to each destination node. Control stream fo Akka internals, such as heartbeat messages. Ordinary message stream, and large message stream.

patriknw · 2022-03-09T08:26:03Z

Couldn't establish a causal relationship between "remote" gossip and "local" gossip.

That's normal and nothing to worry about. The membership state is like a CRDT and such "conflicts" are resolved automatically and deterministically.

nick-nachos · 2022-03-09T09:16:41Z

Hey @patriknw welcome back! I kind of figured that the remote/local gossip thing was probably a red herring, thanks for confirming. As discussed in my previous comment anyway, I think I can safely say that based on the logs it's most probably the case that the user channel (ordinary message stream) is not created from node-3 to node-2, hence the inability to send any user messages. What was not clear to me though is why that happens, but hopefully there might be something there that will help you figure it out.

patriknw · 2022-03-09T11:10:59Z

Search "Received gossip status from [akka://cluster@node-3" confirms that the ordinary message stream is not working between node-3 and node-2. That message is from the ddata Replicator and sent over ordinary message stream. node-2 is not receiving it from node-3 (but from others). node-3 is receiving it from all (including node-2).

That is also the reason why the "Receiving gossip from [UniqueAddress(akka://cluster@node-2" continuous in node-3. Normally that should sync and then not be sent more when no changes. This is the Cluster membership gossip, so not the same as the Replicator, but it's also sent over the ordinary message stream.

During the error period (until 04:52:37) we can see the errors from TcpOutgoingConnection and ArteryTcpTransport when trying to establish the connection. "Connection reset by peer" in node-3 and "UnknownHostException" in node-2.

Searching for "Outbound connection opened to" and subsequent log messages we can see that from node-2 it establishes 2 new connections to all nodes, and additional connections for the large. As expected.

From node-3 "Outbound connection opened to" it is only establishing two connections to node-2, where one is for the large. Missing the connection for the ordinary message stream. We need to figure out why it's not establishing that connection from node-2 to node-3 ...

The failed attempts before that:

cluster-node-3	[24-Feb-2022 04:52:29.791] DEBUG <ArteryTcpTransport> Outbound connection opened to [akka://cluster@node-2:12345] 
cluster-node-3	[24-Feb-2022 04:52:29.796] DEBUG <TcpOutgoingConnection> Connection established to [node-2:12345] 
cluster-node-3	[24-Feb-2022 04:52:29.796] DEBUG <TcpOutgoingConnection> Attempting connection to [node-2/172.19.181.105:12345] 
cluster-node-3	[24-Feb-2022 04:52:29.796] DEBUG <TcpOutgoingConnection> Resolving node-2 before connecting 
cluster-node-3	[24-Feb-2022 04:52:29.797] DEBUG <TcpOutgoingConnection> Closing connection due to IO error java.io.IOException: Connection reset by peer 
cluster-node-3	[24-Feb-2022 04:52:29.798] DEBUG <Materializer> [outbound connection to [akka://cluster@node-2:12345], message stream] Upstream failed, cause: StreamTcpException: The connection closed with error: Connection reset by peer 
...
cluster-node-3	[24-Feb-2022 04:52:34.091] DEBUG <ArteryTcpTransport> Outbound connection opened to [akka://cluster@node-2:12345] 
cluster-node-3	[24-Feb-2022 04:52:34.097] DEBUG <TcpOutgoingConnection> Resolving node-2 before connecting 
cluster-node-3	[24-Feb-2022 04:52:34.097] DEBUG <TcpOutgoingConnection> Attempting connection to [node-2/172.19.181.105:12345] 
cluster-node-3	[24-Feb-2022 04:52:34.097] DEBUG <TcpOutgoingConnection> Connection established to [node-2:12345] 
cluster-node-3	[24-Feb-2022 04:52:34.100] DEBUG <Materializer> [outbound connection to [akka://cluster@node-2:12345], message stream] Upstream finished. 
cluster-node-3	[24-Feb-2022 04:52:34.100] DEBUG <TcpOutgoingConnection> Closing connection due to IO error java.io.IOException: Broken pipe

Maybe it's something with that "Broken pipe" and "Upstream finished". I'll dig further...

nick-nachos · 2022-03-09T11:52:49Z

Yes @patriknw that's exactly how I remember this to be. It's interesting cause I think some of the other nodes had similar errors while trying to establish communication with node-2 (before it became reachable of course) like that "Upstream finished" thing (I think I've seen it in one of the rest of the nodes, not 100% sure though), I wonder, in the case that this is an indicator of a potential issue, if there could be some racing condition around it (so that it doesn't always cause a problem)

patriknw · 2022-03-10T07:56:20Z

I think the problem is the unexpected "Upstream finished" in some situations. I'm working on how to fix that.

* In the logs attached to the issue we can see that an outbound connection is not re-established after "Upstream finished" (broken pipe). * Normally that is handled by the inner RestartFlow around the connection flow, but if that has reached it's maxRestarts (3) it will complete the entire stream and attachOutboundStreamRestart would not handle that as a restart case.

nick-nachos · 2022-03-10T11:54:42Z

Great stuff, thank you for looking into this!

* In the logs attached to the issue we can see that an outbound connection is not re-established after "Upstream finished" (broken pipe). * Normally that is handled by the inner RestartFlow around the connection flow, but if that has reached it's maxRestarts (3) it will complete the entire stream and attachOutboundStreamRestart would not handle that as a restart case.

… (#31232) * In the logs attached to the issue we can see that an outbound connection is not re-established after "Upstream finished" (broken pipe). * Normally that is handled by the inner RestartFlow around the connection flow, but if that has reached it's maxRestarts (3) it will complete the entire stream and attachOutboundStreamRestart would not handle that as a restart case.

patriknw · 2022-03-14T09:18:09Z

@nick-nachos We will probably release Akka 2.6.19 later this week but if you want to try the fix earlier you can use snapshot version 2.6.18+134-ae05ccaf-SNAPSHOT. See https://doc.akka.io/docs/akka/current/project/links.html#snapshots-repository

nick-nachos · 2022-03-14T09:28:52Z

Thank you @patriknw, I think we'll probably wait for the official release 😄

patriknw added 0 - new Ticket is unclear on it's purpose or if it is valid or not t:cluster labels Jan 24, 2022

patriknw added bug and removed 0 - new Ticket is unclear on it's purpose or if it is valid or not labels Feb 16, 2022

patriknw changed the title ~~Cluster leader unable to perform duties post unreachability resolution~~ Connection not re-establised post unreachability resolution Mar 10, 2022

patriknw added t:remoting:artery 3 - in progress Someone is working on this ticket and removed t:cluster labels Mar 10, 2022

patriknw self-assigned this Mar 10, 2022

patriknw mentioned this issue Mar 10, 2022

Restart of outbound stream if upstream completed (unexpectedly), #31081 #31232

Merged

patriknw mentioned this issue Mar 10, 2022

Artery doesn't send outbound messages after quick reconnect attempts #30058

Closed

patriknw removed the 3 - in progress Someone is working on this ticket label Mar 11, 2022

patriknw added this to the 2.6.19 milestone Mar 11, 2022

patriknw closed this as completed Mar 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection not re-establised post unreachability resolution #31081

Connection not re-establised post unreachability resolution #31081

nick-nachos commented Jan 21, 2022 •

edited

nick-nachos commented Jan 21, 2022

patriknw commented Jan 24, 2022

nick-nachos commented Jan 24, 2022 •

edited

patriknw commented Jan 31, 2022

nick-nachos commented Feb 15, 2022

patriknw commented Feb 16, 2022

nick-nachos commented Feb 16, 2022 •

edited

patriknw commented Feb 18, 2022

nick-nachos commented Feb 18, 2022 •

edited

nick-nachos commented Feb 21, 2022 •

edited

nick-nachos commented Mar 1, 2022 •

edited

nick-nachos commented Mar 1, 2022 •

edited

patriknw commented Mar 9, 2022

patriknw commented Mar 9, 2022

nick-nachos commented Mar 9, 2022

patriknw commented Mar 9, 2022

nick-nachos commented Mar 9, 2022 •

edited

patriknw commented Mar 10, 2022

nick-nachos commented Mar 10, 2022

patriknw commented Mar 14, 2022

nick-nachos commented Mar 14, 2022

Connection not re-establised post unreachability resolution #31081

Connection not re-establised post unreachability resolution #31081

Comments

nick-nachos commented Jan 21, 2022 • edited

nick-nachos commented Jan 21, 2022

patriknw commented Jan 24, 2022

nick-nachos commented Jan 24, 2022 • edited

patriknw commented Jan 31, 2022

nick-nachos commented Feb 15, 2022

patriknw commented Feb 16, 2022

nick-nachos commented Feb 16, 2022 • edited

patriknw commented Feb 18, 2022

nick-nachos commented Feb 18, 2022 • edited

nick-nachos commented Feb 21, 2022 • edited

nick-nachos commented Mar 1, 2022 • edited

nick-nachos commented Mar 1, 2022 • edited

patriknw commented Mar 9, 2022

patriknw commented Mar 9, 2022

nick-nachos commented Mar 9, 2022

patriknw commented Mar 9, 2022

nick-nachos commented Mar 9, 2022 • edited

patriknw commented Mar 10, 2022

nick-nachos commented Mar 10, 2022

patriknw commented Mar 14, 2022

nick-nachos commented Mar 14, 2022

nick-nachos commented Jan 21, 2022 •

edited

nick-nachos commented Jan 24, 2022 •

edited

nick-nachos commented Feb 16, 2022 •

edited

nick-nachos commented Feb 18, 2022 •

edited

nick-nachos commented Feb 21, 2022 •

edited

nick-nachos commented Mar 1, 2022 •

edited

nick-nachos commented Mar 1, 2022 •

edited

nick-nachos commented Mar 9, 2022 •

edited