Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection not re-establised post unreachability resolution #31081

Closed
nick-nachos opened this issue Jan 21, 2022 · 21 comments
Closed

Connection not re-establised post unreachability resolution #31081

nick-nachos opened this issue Jan 21, 2022 · 21 comments
Assignees
Milestone

Comments

@nick-nachos
Copy link
Contributor

nick-nachos commented Jan 21, 2022

We've got an unusual clustering issue that has hit us a couple of times this week. Incidentally (or not?) this started happening a week after we upgraded from Akka 2.6.14 to 2.6.18. We had been on 2.6.14 for about 6 months and prior to that on 2.5.x for a year or so, and have never seen anything similar. The incident has happened 3-4 times so far in total in the course of a week in multiple environments, which however are all similar: K8s over AWS using the Istio service mesh. Our service has a deployment with 4 replicas, all of whom join an Akka Cluster (per environment) using Artery-TCP.

In all cases, the sequence of events is as follows:

  1. One of the nodes (node-3 in this case) temporarily loses network connectivity, thus getting detected as unreachable:
node-1 [15:24:00.366] WARN <Cluster> Cluster Node [akka://cluster@node-1:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-3:12345, Up)]. 
node-3 [15:24:00.938] WARN <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-0:12345, Up)]. 
node-3 [15:24:00.939] WARN <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-1:12345, Up)]. 
node-3 [15:24:00.939] WARN <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-2:12345, Up)]. 
node-2 [15:24:01.187] WARN <Cluster> Cluster Node [akka://cluster@node-2:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-3:12345, Up)]. 
node-0 [15:24:01.620] WARN <Cluster> Cluster Node [akka://cluster@node-0:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-3:12345, Up)]. 
node-3 [15:24:01.850] INFO <Cluster> Cluster Node [akka://cluster@node-3:12345] - is the new leader among reachable nodes (more leaders may exist) 
node-3 [15:24:03.879] WARN <Association> Outbound message stream to [akka://cluster@node-0:12345] failed. Restarting it. akka.stream.StreamTcpException: Tcp command [Connect(node-0:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.UnknownHostException: node-0 
node-3 [15:24:03.941] INFO <RemoteActorRefProvider$RemoteDeadLetterActorRef> Message [akka.cluster.GossipStatus] from Actor[akka://cluster/system/cluster/core/daemon#-812544556] to Actor[akka://cluster/deadLetters] was not delivered. [10] dead letters encountered, no more dead letters will be logged in next [5.000 min]. If this is not an expected behavior then Actor[akka://cluster/deadLetters] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
node-3 [15:24:03.959] WARN <Materializer> [outbound connection to [akka://cluster@node-0:12345], message stream] Upstream failed, cause: StreamTcpException: Tcp command [Connect(node-0:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.UnknownHostException: node-0 
node-2 [15:24:04.114] WARN <Association> Outbound message stream to [akka://cluster@node-3:12345] failed. Restarting it. akka.stream.StreamTcpException: The connection closed with error: Connection reset by peer 
node-3 [15:24:04.196] WARN <Association> Outbound message stream to [akka://cluster@node-2:12345] failed. Restarting it. akka.stream.StreamTcpException: Tcp command [Connect(node-2:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.UnknownHostException: node-2 
node-3 [15:24:04.237] WARN <Materializer> [outbound connection to [akka://cluster@node-2:12345], message stream] Upstream failed, cause: StreamTcpException: Tcp command [Connect(node-2:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.UnknownHostException: node-2 
node-3 [15:24:04.319] WARN <Association> Outbound message stream to [akka://cluster@node-1:12345] failed. Restarting it. akka.stream.StreamTcpException: Tcp command [Connect(node-1:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.UnknownHostException: node-1 
node-2 [15:24:05.226] WARN <Materializer> [outbound connection to [akka://cluster@node-3:12345], message stream] Upstream failed, cause: StreamTcpException: The connection closed with error: Connection reset by peer 
node-3 [15:24:07.299] WARN <Materializer> [outbound connection to [akka://cluster@node-1:12345], message stream] Upstream failed, cause: StreamTcpException: Tcp command [Connect(node-1:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.UnknownHostException: node-1 
node-3 [15:24:13.420] WARN <Materializer> [outbound connection to [akka://cluster@node-2:12345], message stream] Upstream failed, cause: StreamTcpException: Tcp command [Connect(node-2:12345,None,List(),Some(5000 milliseconds),true)] failed because of java.net.ConnectException: Connection refused 
  1. Eventually connectivity issues resolve, and the node (node-3) becomes reachable again:
node-1 [15:24:14.640] INFO <Cluster> Cluster Node [akka://cluster@node-1:12345] - Marking node as REACHABLE [Member(akka://cluster@node-3:12345, Up)]. 
node-0 [15:24:14.875] INFO <Cluster> Cluster Node [akka://cluster@node-0:12345] - Marking node as REACHABLE [Member(akka://cluster@node-3:12345, Up)]. 
node-3 [15:24:15.213] INFO <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as REACHABLE [Member(akka://cluster@node-2:12345, Up)]. 
node-3 [15:24:15.213] INFO <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as REACHABLE [Member(akka://cluster@node-1:12345, Up)]. 
node-3 [15:24:15.213] INFO <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as REACHABLE [Member(akka://cluster@node-0:12345, Up)]. 
node-2 [15:24:15.459] INFO <Cluster> Cluster Node [akka://cluster@node-2:12345] - Marking node as REACHABLE [Member(akka://cluster@node-3:12345, Up)]. 
node-3 [15:24:16.130] INFO <Cluster> Cluster Node [akka://cluster@node-3:12345] - is no longer leader 
  1. Despite reachability having been restored however, the leader (node-0) still reports that it's unable to perform its duties:
node-0 [15:24:20.727] INFO <Cluster> Cluster Node [akka://cluster@node-0:12345] - Leader can currently not perform its duties, reachability status: [], member status: [akka://cluster@node-0:12345 Up seen=true, akka://cluster@node-1:12345 Up seen=true, akka://cluster@node-2:12345 Up seen=true, akka://cluster@node-3:12345 Up seen=false] 

The message above keeps repeating periodically forever, until the cluster is torn down. What's very interesting here is that the reachability status list is empty and the seen field of the node-3 member status is false. Keep in mind that the leader had previously marked the node as reachable.

  1. Subsequently, besides this message being logged every now and then, we also start getting logs for dropped GossipEnvelope and GossipStatus messages:
node-1 [16:01:05.542] INFO <RemoteActorRef> Message [akka.cluster.GossipEnvelope] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] was dropped. Due to overflow of send queue, size [3072]. [7] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
node-1 [16:01:05.542] INFO <RemoteActorRef> Message [akka.cluster.GossipEnvelope] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] was not delivered. [8] dead letters encountered. If this is not an expected behavior then Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
[...]
node-1 [16:06:07.463] INFO <RemoteActorRef> Message [akka.cluster.GossipEnvelope] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] was dropped. Due to overflow of send queue, size [3072]. [949] dead letters encountered, of which 938 were not logged. The counter will be reset now. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
node-1 [16:06:07.463] INFO <RemoteActorRef> Message [akka.cluster.GossipEnvelope] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] was not delivered. [1] dead letters encountered. If this is not an expected behavior then Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
node-1 [16:06:08.308] INFO <RemoteActorRef> Message [akka.cluster.GossipStatus] wrapped in [akka.actor.ActorSelectionMessage] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster@node-3:12345/] was dropped. Due to overflow of send queue, size [3072]. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
node-1 [16:06:08.309] INFO <RemoteActorRefProvider$RemoteDeadLetterActorRef> Message [akka.cluster.GossipStatus] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster/deadLetters] was not delivered. [3] dead letters encountered. If this is not an expected behavior then Actor[akka://cluster/deadLetters] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
node-1 [16:06:08.482] INFO <RemoteActorRef> Message [akka.cluster.GossipEnvelope] from Actor[akka://cluster/system/cluster/core/daemon#-1162634486] to Actor[akka://cluster@node-3:12345/system/cluster/core/daemon#-812544556] was dropped. Due to overflow of send queue, size [3072]. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 

Based on application behavior we can also deduct that application level messages are probably not delivered to the problematic node either. Neither rolling restart, nor targeted node restarts have worked to resolve this, we pretty much had to shut the cluster down completely and start up again.

Unfortunately I haven't been able to reproduce this locally, but given the peculiarity of the cluster leader's behavior I was hoping you'd be able to deduct whether there's some particular corner case that could trigger it. Unfortunately we haven't rolled back to 2.6.14 yet, so I can't say whether that resolves the issue or not, but I will make sure to post here once we do. In the meantime, do you have any thoughts on this?

@nick-nachos
Copy link
Contributor Author

@patriknw you might be interested in this.

@patriknw
Copy link
Member

@nick-nachos Thanks for reporting. seen=false is indeed what prevents the leader to perform its duties. That means that the latest gossip state hasn't been fully propagated to/from node-3. There seems to still be some connectivity issues between the nodes. Can you see if it's only node-2 and node-3 that can't talk or if node-1 and node-4 also have issues with node3? I suspect that they have that.

There are some additional logging that could be enabled to debug this:

@patriknw patriknw added 0 - new Ticket is unclear on it's purpose or if it is valid or not t:cluster labels Jan 24, 2022
@nick-nachos
Copy link
Contributor Author

nick-nachos commented Jan 24, 2022

Hi @patriknw thank you very much for your reply. First things first, I forgot to mention in my original post that we're using Akka Cluster Classic, not Typed.

Some extra info I dug out with regards to your questions:

  • I can see from our application logs that node-0 (leader) and node-1 seemed unable to send application messages to node-3 whereas the reverse worked properly (i.e. node-3 seemed to send the same application level messages to node-1 successfully). node-2 seemed to have successful bidirectional communication with that node (node-3).
  • There's another use-case we have where we apply the ask pattern: in this case however, the original sender is also a large message recipient (small request, large reply). In this case:
    • node-0 and node-1 did not manage to properly send anything to node-3, thus getting ask timeouts.
    • node-3 was able to send and get responses from node-1 (node-0 did not apply in this case) (also remember, the response is received by the large recipient)
    • node-2 to/from node-3 comms worked properly.

What was interesting to me and one of the reasons I raised the issue here, is that although node-0 and node-1 could not establish app level communications with node-3 and we can also see the dropped GrossipStatus/Envelope messages from them towards it, there didn't seem to be any unreachable nodes detected at any point beyond the original detection; Based on the info above, I would have expected node-3 to detect nodes 0 and 1 as unreachable if they couldn't send any messages to it, unless heartbeats somehow made it to their destination when all other messages didn't (unlikely).

For now we will be rolling back from 2.6.18 to 2.6.14 just in case any of the changes between those two versions (with regards to Artery and TCP streams) could have caused any edge-case regression. The reason for that is that we have seen our nodes temporarily lose connectivity on K8s environments in the past, but the cluster would always recover once connectivity had been restored.

I will also try to select some lower volume environment out of those that were affected to enable the logging options you mentioned and also switch the the akka logs to debug level, so that we have more info in case this hits us again.

In the meantime, if you've got any thoughts on this feel free to share.

@patriknw
Copy link
Member

The point-to-point failure detection should be seen in logs as "Marking node as UNREACHABLE" and "Marking node as REACHABLE". Exact logging of the heartbeat roundtrips can be seen with verbose-heartbeat-logging, but it's rather verbose.

@nick-nachos
Copy link
Contributor Author

A small update on this @patriknw, rolling back to Akka 2.6.14 seems to have done the trick for us. Our clusters seem to properly recover after the resolution of the same type of network connectivity issues: once the unreachable node becomes reachable again, the leader reports that it can properly resume its duties.

@patriknw
Copy link
Member

Thanks for the update. I can't think of anything we changed that could explain that (famous last word 😄). We have to look into the change logs. However, I think we have too little hints of what the problem could be. If you have a way to reproduce it and could enable more logs and share raw logs with us that could be very useful for tracking down this.

@patriknw patriknw added bug and removed 0 - new Ticket is unclear on it's purpose or if it is valid or not labels Feb 16, 2022
@nick-nachos
Copy link
Contributor Author

nick-nachos commented Feb 16, 2022

Actually @patriknw the famous last words might have been mine (me and my big mouth), as I've just happened to troubleshoot an incident that was similar to the original: in a cluster of 4 nodes (node-0, node-1, node-2, node-3) with Akka 2.6.14, node-2 lost connectivity at some point and became unreachable. Few seconds later connectivity was restored and the node became reachable again. Although we didn't get that weird message from the leader (it was able to resume its duties), one of the nodes (node-3) seemed unable to send application level messages and gossip envelopes to node-2.

This time I was able to switch the log-level of that node (node-3) to debug on the runtime, although that was done re-actively after the fact, which means that we don't have any debug level logs from the moment the issue started. That said I'm still going to post the logs here in case you pick up anything useful. We're also going to enable this diagnostic logging by default on these problematic environments, so once this happens again we'll have even more output available.

The log file attached contains around 6 minutes of Akka logs from node-3, i.e. the node that was unable to send app-level messages to node-2. Some key takeaways:

  • The dead-letters logging of the RemoteActorRef class suggests that buffering has taken place, the buffer has overflown and subsequent messages are discarded immediately. I wonder if that suggests that some underlying TCP stream may have silently failed, stuck in back-pressure or anything of the sort.
  • node-3 is actually able to send and receive heartbeats to/from all nodes of the cluster (including node-2) which proves that there is no actual physical networking issue in place
  • node-3 is logging that it's receiving gossip from node-2 (ClusterGossip class), although itself is unable of sending its own GossipEnvelope to node-2 (based on dead letters logging)
  • Interestingly, there's no log entries about receiving gossip from nodes other than node-2 (again ClusterGossip class). The Replicator class however does log received gossip statuses from all nodes.

Some extra info in case it's needed:

  • We make use of Cluster Singleton and Cluster Sharding
  • We also make use of the large message support of Artery: for the response messages we know might be large, we make sure to spin up temporary receivers under a specific path (user/service/largeMessageReceivers) in a similar fashion to the ask pattern. That might explain the messages from the Deserialization class.

No worries if the input is not enough this time either, as I mentioned earlier we will now have Akka logs in debug by default so next time this happens we'll get as much of the full picture as possible.

node-3.log

@patriknw
Copy link
Member

Thanks for sharing those logs. I agree with your conclusions.

suggests that buffering has taken place, the buffer has overflown and subsequent messages are discarded immediately

That could also happen if you send messages in a burst without flow control. The buffer size is only 3072. If you send faster than it is able to serialize and push over the wire it will overflow. If it's just temporary bursts you could get around it by increasing the buffer size. Config akka.remote.artery.advanced.outbound-message-queue-size

there's no log entries about receiving gossip from nodes other than node-2

That might be because when the nodes are in sync they only exchange a smaller GossipStatus message. The messages from the Replicator is also a good indication that the communication works.

@nick-nachos
Copy link
Contributor Author

nick-nachos commented Feb 18, 2022

Thank for your response @patriknw. The message rate in these environments is very low in general, but what you've just said made something "click": I went back to our logs, at the exact moment the incident started, and these are the exact logs from node-3 (the node that is later unable to send):

[16-Feb-2022 06:19:20.624] WARN <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as UNREACHABLE [Member(akka://cluster@node-2:12345, Up)]. 
[16-Feb-2022 06:19:52.245] INFO <Cluster> Cluster Node [akka://cluster@node-3:12345] - Marking node as REACHABLE [Member(akka://cluster@node-2:12345, Up)]. 
[16-Feb-2022 07:36:59.975] INFO <RemoteActorRef> Message [akka.cluster.GossipStatus] wrapped in [akka.actor.ActorSelectionMessage] from Actor[akka://cluster/system/cluster/core/daemon#-347874725] to Actor[akka://cluster@node-2:12345/] was dropped. Due to overflow of send queue, size [3072]. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
[16-Feb-2022 07:36:59.975] INFO <RemoteActorRefProvider$RemoteDeadLetterActorRef> Message [akka.cluster.GossipStatus] from Actor[akka://cluster/system/cluster/core/daemon#-347874725] to Actor[akka://cluster/deadLetters] was not delivered. [4] dead letters encountered. If this is not an expected behavior then Actor[akka://cluster/deadLetters] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
[16-Feb-2022 07:37:05.075] INFO <RemoteActorRef> Message [akka.cluster.GossipStatus] wrapped in [akka.actor.ActorSelectionMessage] from Actor[akka://cluster/system/cluster/core/daemon#-347874725] to Actor[akka://cluster@node-2:12345/] was dropped. Due to overflow of send queue, size [3072]. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 

At that point the diagnostic logging was not enabled yet, so all we have is info level logs and above, but there's something really interesting here: node-2 was detected unreachable for about 32 seconds, during which time there have been no association or materializer failures reported. This kind of suggests that the association (and so the underlying stream(s) I suppose) remains alive during that time. Could this mean that the TCP stream could have entered some back-pressure mode which causes the buffering?

I suspect that's the case, and if so, there's a second finding in the logs above that is very useful: you can see that after reachability has been restored, node-3 reports its first dead letters due to buffer overflow 77 minutes later (no logs at all in the between). To me, this is indicative of the very low traffic volume of these particular environments, and makes me suspect that the most plausible explanation here is that the TCP stream has been stuck in back-pressure (or something similar) and is thus unable to send any of the buffered messages, although the reachability issue has been resolved.

The alternative scenario by which there's messages being sent by Artery but there's also message loss (overflow) taking place due to the intermediate buffering would have made more sense for our higher volume environments (where we've never had any issues though). If that were the case here in these lower volume ones, what I would have expected to see is just some intermediate overflowing, but at some point the buffer would have cleared up (it shouldn't take too long to send 3k messages) and the message flow would go back to normal. Instead, what we get is (seemingly) persistent message loss, until we eventually delete the affected pod.

What do you think of this hypothesis?

In any case, as I said in my previous post, if the data we currently have at out disposal is not enough, we're rolling out with diagnostic logging enabled by default on these environments this week, so once we get hit again we should have something more tangible to work with (hopefully).

@nick-nachos
Copy link
Contributor Author

nick-nachos commented Feb 21, 2022

@patriknw In continuation to my previous findings, I can also confirm that all the rest of the nodes of the cluster (node-0, node-1, node-2) had their associations/materializers to/from node-2 torn down when the node became unreachable. node-3 was the only exception to that, its association to node-2 was never reported as broken. I think this points towards the hypothesis that the association of node-3 towards node-2 enters some kind of "zombie" state that it never recovers from, and would also explain why the other nodes had no issue sending/receiving messages to/from node-2 after it recovered, and also why node-2 was also able to send messages to node-3 (the association of node-2 towards node-3 was torn down when node-2 became unreachable)

@nick-nachos
Copy link
Contributor Author

nick-nachos commented Mar 1, 2022

Hi @patriknw. We were able to capture the diagnostic cluster logs from an incident a few days ago. This time we were also able to find the issue causing the connectivity loss which was the Istio proxy sidecar of the affected pod dying with OOM. The sidecar took around 30 sec to recover (seems consistent every time) during which time the pod loses all network connectivity. After connectivity recovers we also get hit by the Akka clustering issue due to which one of the other nodes cannot send application messages to the recovered node.

In this case it was node-2 that lost connectivity (again), and once recovered it was node-3 that was unable to send application messages (again). I can confirm from the logs that heartbeats are properly exchanged between the nodes post recovery. We can infer the message loss from the AskTimeoutExceptions we see in our application logs (not attached) for node-3. At the same time the very same messages are successfully sent (and responded to) from nodes 0 and 1 towards node-2 (send rates are similar across nodes, as the message sending is triggered by external HTTP requests that are dispatched in a round-robin fashion by our load balancer)

node-2 was detected as unreachable roughly between 04:52:08 and 04:52:40 (30-ish seconds). The logs start around 2 minutes before that and end around 7.5 minutes after that (for some context). What you won't see in the logs this time is the dead-letters (GossipEnvelope etc) due to buffering: the reason for that is that the message send rate in this environment (as mentioned in my previous post) is so low, that it would have taken a while to fill the buffer up (we dealt with the issue swiftly this time). I can assure you however that we could definitely see all nodes (0, 1 and 3) reporting AskTimeoutExceptions for requests towards node-2 while it was unreachable, whereas node-3 was the only one that kept doing so after the node became reachable again.

I've spent some time going through these logs to see if there's anything standing out, but I couldn't find much (with my layman's eyes at least). Some little things that I picked up (which may probably be nothing-burgers):

  • node-3 "complained" a couple of times after the recovery of node-2 that it Couldn't establish a causal relationship between "remote" gossip and "local" gossip. That said, after a couple of occurrences, the log line didn't seem to reappear.
  • there seems to be a small asymmetry between how many connections each node has made towards node-2: node-1 seems to have made 3 connections whereas node-0 and node-3 have made 2. What I'm referring to here is each of these nodes printing a message <TcpOutgoingConnection> Connection established to [node-2:12345] followed by node-2 printing a message <TcpListener> New connection accepted. node-0 and node-3 both making 2 connections against node-2 but only node-3 having an issue sending messages would suggest to me that the connection number is not a factor in it of itself. That said, there is one difference there, in that, while node-0 makes both connections pretty close to one another (with like 1 sec of difference, also the case for node-1), node-3 creates the second connection 21 seconds after the first one. Again this could be a big flat nothing, as I don't know how Artery works internally, only on a very high level; for example, do you create separate TCP connections per channel (control, user and user-large) in which case I would expect to see 3 connections per node, or do you do some sort of multiplexing over a single connection?

I hereby attach the logs of all four nodes ordered by date: akka.log.zip. It is formatted as cluster-node-$i\t$payload so that you can easily grep by node if you need to do so.

@nick-nachos
Copy link
Contributor Author

nick-nachos commented Mar 1, 2022

Actually that second bullet in my comment above may have not been a nothing-burger at all: I've just remembered seeing logs of the following format:

<Association> Using large message stream for akka://cluster@node-2:12345/user/service/largeMessageReceivers/$6ss

and what I have also noticed sometimes is such logs being pretty close (time-wise) to the creation of an outbound connection after the reachability issue has resolved. I would make a bet that the outbound connection used for the large user messages is created dynamically if/when a large message is ever sent for the first time.

If that's the case, then this second finding does the trick: node-0 and node-3 may have both created 2 outbound connections to node-2, but both connections made by node-0 seem unrelated to large messages, whereas the second connection made by node-3 seems related:

cluster-node-3	[24-Feb-2022 04:52:59.334] DEBUG <Association> Using large message stream for akka://cluster@node-2:12345/user/service/largeMessageReceivers/$6ss 
cluster-node-3	[24-Feb-2022 04:52:59.339] DEBUG <TcpOutgoingConnection> Connection established to [node-2:12345] 
cluster-node-3	[24-Feb-2022 04:52:59.339] DEBUG <TcpOutgoingConnection> Attempting connection to [node-2/172.19.181.105:12345] 
cluster-node-3	[24-Feb-2022 04:52:59.339] DEBUG <TcpOutgoingConnection> Resolving node-2 before connecting 
cluster-node-2	[24-Feb-2022 04:52:59.341] DEBUG <TcpListener> New connection accepted 

This would explain the 20 second difference between node-3 creating the first and second connection: the first connection was for the control messages, and the second one for the user-large messages in response to a message send by an actor on node-2 that required a large response to be sent back (by node-3). So effectively node-3 never established a connection for the plain user-space messages, only the ones for control and user-large. node-0 on the other hand created the ones for control and user (didn't need to send any large response to node-2 apparently) and node-1 created all three types of connections.

Does any of that stuff make sense, or am I way too off?

@patriknw
Copy link
Member

patriknw commented Mar 9, 2022

I had vacation last week, but will take a look at the logs now. Thanks for sharing and analyzing the problem.

Yes, connections are established lazyily when a message is sent over that "stream". There can be be up to 3 outgoing connections to each destination node. Control stream fo Akka internals, such as heartbeat messages. Ordinary message stream, and large message stream.

@patriknw
Copy link
Member

patriknw commented Mar 9, 2022

Couldn't establish a causal relationship between "remote" gossip and "local" gossip.

That's normal and nothing to worry about. The membership state is like a CRDT and such "conflicts" are resolved automatically and deterministically.

@nick-nachos
Copy link
Contributor Author

Hey @patriknw welcome back! I kind of figured that the remote/local gossip thing was probably a red herring, thanks for confirming. As discussed in my previous comment anyway, I think I can safely say that based on the logs it's most probably the case that the user channel (ordinary message stream) is not created from node-3 to node-2, hence the inability to send any user messages. What was not clear to me though is why that happens, but hopefully there might be something there that will help you figure it out.

@patriknw
Copy link
Member

patriknw commented Mar 9, 2022

Search "Received gossip status from [akka://cluster@node-3" confirms that the ordinary message stream is not working between node-3 and node-2. That message is from the ddata Replicator and sent over ordinary message stream. node-2 is not receiving it from node-3 (but from others). node-3 is receiving it from all (including node-2).

That is also the reason why the "Receiving gossip from [UniqueAddress(akka://cluster@node-2" continuous in node-3. Normally that should sync and then not be sent more when no changes. This is the Cluster membership gossip, so not the same as the Replicator, but it's also sent over the ordinary message stream.

During the error period (until 04:52:37) we can see the errors from TcpOutgoingConnection and ArteryTcpTransport when trying to establish the connection. "Connection reset by peer" in node-3 and "UnknownHostException" in node-2.

Searching for "Outbound connection opened to" and subsequent log messages we can see that from node-2 it establishes 2 new connections to all nodes, and additional connections for the large. As expected.

From node-3 "Outbound connection opened to" it is only establishing two connections to node-2, where one is for the large. Missing the connection for the ordinary message stream. We need to figure out why it's not establishing that connection from node-2 to node-3 ...

The failed attempts before that:

cluster-node-3	[24-Feb-2022 04:52:29.791] DEBUG <ArteryTcpTransport> Outbound connection opened to [akka://cluster@node-2:12345] 
cluster-node-3	[24-Feb-2022 04:52:29.796] DEBUG <TcpOutgoingConnection> Connection established to [node-2:12345] 
cluster-node-3	[24-Feb-2022 04:52:29.796] DEBUG <TcpOutgoingConnection> Attempting connection to [node-2/172.19.181.105:12345] 
cluster-node-3	[24-Feb-2022 04:52:29.796] DEBUG <TcpOutgoingConnection> Resolving node-2 before connecting 
cluster-node-3	[24-Feb-2022 04:52:29.797] DEBUG <TcpOutgoingConnection> Closing connection due to IO error java.io.IOException: Connection reset by peer 
cluster-node-3	[24-Feb-2022 04:52:29.798] DEBUG <Materializer> [outbound connection to [akka://cluster@node-2:12345], message stream] Upstream failed, cause: StreamTcpException: The connection closed with error: Connection reset by peer 
...
cluster-node-3	[24-Feb-2022 04:52:34.091] DEBUG <ArteryTcpTransport> Outbound connection opened to [akka://cluster@node-2:12345] 
cluster-node-3	[24-Feb-2022 04:52:34.097] DEBUG <TcpOutgoingConnection> Resolving node-2 before connecting 
cluster-node-3	[24-Feb-2022 04:52:34.097] DEBUG <TcpOutgoingConnection> Attempting connection to [node-2/172.19.181.105:12345] 
cluster-node-3	[24-Feb-2022 04:52:34.097] DEBUG <TcpOutgoingConnection> Connection established to [node-2:12345] 
cluster-node-3	[24-Feb-2022 04:52:34.100] DEBUG <Materializer> [outbound connection to [akka://cluster@node-2:12345], message stream] Upstream finished. 
cluster-node-3	[24-Feb-2022 04:52:34.100] DEBUG <TcpOutgoingConnection> Closing connection due to IO error java.io.IOException: Broken pipe 

Maybe it's something with that "Broken pipe" and "Upstream finished". I'll dig further...

@nick-nachos
Copy link
Contributor Author

nick-nachos commented Mar 9, 2022

Yes @patriknw that's exactly how I remember this to be. It's interesting cause I think some of the other nodes had similar errors while trying to establish communication with node-2 (before it became reachable of course) like that "Upstream finished" thing (I think I've seen it in one of the rest of the nodes, not 100% sure though), I wonder, in the case that this is an indicator of a potential issue, if there could be some racing condition around it (so that it doesn't always cause a problem)

@patriknw patriknw changed the title Cluster leader unable to perform duties post unreachability resolution Connection not re-establised post unreachability resolution Mar 10, 2022
@patriknw patriknw added t:remoting:artery 3 - in progress Someone is working on this ticket and removed t:cluster labels Mar 10, 2022
@patriknw patriknw self-assigned this Mar 10, 2022
@patriknw
Copy link
Member

I think the problem is the unexpected "Upstream finished" in some situations. I'm working on how to fix that.

patriknw added a commit that referenced this issue Mar 10, 2022
* In the logs attached to the issue we can see that an outbound connection is not
  re-established after "Upstream finished" (broken pipe).
* Normally that is handled by the inner RestartFlow around the connection flow, but
  if that has reached it's maxRestarts (3) it will complete the entire stream and
  attachOutboundStreamRestart would not handle that as a restart case.
patriknw added a commit that referenced this issue Mar 10, 2022
* In the logs attached to the issue we can see that an outbound connection is not
  re-established after "Upstream finished" (broken pipe).
* Normally that is handled by the inner RestartFlow around the connection flow, but
  if that has reached it's maxRestarts (3) it will complete the entire stream and
  attachOutboundStreamRestart would not handle that as a restart case.
@nick-nachos
Copy link
Contributor Author

Great stuff, thank you for looking into this!

patriknw added a commit that referenced this issue Mar 10, 2022
* In the logs attached to the issue we can see that an outbound connection is not
  re-established after "Upstream finished" (broken pipe).
* Normally that is handled by the inner RestartFlow around the connection flow, but
  if that has reached it's maxRestarts (3) it will complete the entire stream and
  attachOutboundStreamRestart would not handle that as a restart case.
patriknw added a commit that referenced this issue Mar 11, 2022
… (#31232)

* In the logs attached to the issue we can see that an outbound connection is not
  re-established after "Upstream finished" (broken pipe).
* Normally that is handled by the inner RestartFlow around the connection flow, but
  if that has reached it's maxRestarts (3) it will complete the entire stream and
  attachOutboundStreamRestart would not handle that as a restart case.
@patriknw patriknw removed the 3 - in progress Someone is working on this ticket label Mar 11, 2022
@patriknw patriknw added this to the 2.6.19 milestone Mar 11, 2022
@patriknw
Copy link
Member

@nick-nachos We will probably release Akka 2.6.19 later this week but if you want to try the fix earlier you can use snapshot version 2.6.18+134-ae05ccaf-SNAPSHOT. See https://doc.akka.io/docs/akka/current/project/links.html#snapshots-repository

@nick-nachos
Copy link
Contributor Author

Thank you @patriknw, I think we'll probably wait for the official release 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants