New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection not re-establised post unreachability resolution #31081
Comments
@patriknw you might be interested in this. |
@nick-nachos Thanks for reporting. There are some additional logging that could be enabled to debug this:
|
Hi @patriknw thank you very much for your reply. First things first, I forgot to mention in my original post that we're using Akka Cluster Classic, not Typed. Some extra info I dug out with regards to your questions:
What was interesting to me and one of the reasons I raised the issue here, is that although node-0 and node-1 could not establish app level communications with node-3 and we can also see the dropped GrossipStatus/Envelope messages from them towards it, there didn't seem to be any unreachable nodes detected at any point beyond the original detection; Based on the info above, I would have expected node-3 to detect nodes 0 and 1 as unreachable if they couldn't send any messages to it, unless heartbeats somehow made it to their destination when all other messages didn't (unlikely). For now we will be rolling back from I will also try to select some lower volume environment out of those that were affected to enable the logging options you mentioned and also switch the the akka logs to debug level, so that we have more info in case this hits us again. In the meantime, if you've got any thoughts on this feel free to share. |
The point-to-point failure detection should be seen in logs as "Marking node as UNREACHABLE" and "Marking node as REACHABLE". Exact logging of the heartbeat roundtrips can be seen with |
A small update on this @patriknw, rolling back to Akka 2.6.14 seems to have done the trick for us. Our clusters seem to properly recover after the resolution of the same type of network connectivity issues: once the unreachable node becomes reachable again, the leader reports that it can properly resume its duties. |
Thanks for the update. I can't think of anything we changed that could explain that (famous last word 😄). We have to look into the change logs. However, I think we have too little hints of what the problem could be. If you have a way to reproduce it and could enable more logs and share raw logs with us that could be very useful for tracking down this. |
Actually @patriknw the famous last words might have been mine (me and my big mouth), as I've just happened to troubleshoot an incident that was similar to the original: in a cluster of 4 nodes (node-0, node-1, node-2, node-3) with Akka This time I was able to switch the log-level of that node (node-3) to debug on the runtime, although that was done re-actively after the fact, which means that we don't have any debug level logs from the moment the issue started. That said I'm still going to post the logs here in case you pick up anything useful. We're also going to enable this diagnostic logging by default on these problematic environments, so once this happens again we'll have even more output available. The log file attached contains around 6 minutes of Akka logs from node-3, i.e. the node that was unable to send app-level messages to node-2. Some key takeaways:
Some extra info in case it's needed:
No worries if the input is not enough this time either, as I mentioned earlier we will now have Akka logs in debug by default so next time this happens we'll get as much of the full picture as possible. |
Thanks for sharing those logs. I agree with your conclusions.
That could also happen if you send messages in a burst without flow control. The buffer size is only 3072. If you send faster than it is able to serialize and push over the wire it will overflow. If it's just temporary bursts you could get around it by increasing the buffer size. Config
That might be because when the nodes are in sync they only exchange a smaller GossipStatus message. The messages from the Replicator is also a good indication that the communication works. |
Thank for your response @patriknw. The message rate in these environments is very low in general, but what you've just said made something "click": I went back to our logs, at the exact moment the incident started, and these are the exact logs from node-3 (the node that is later unable to send):
At that point the diagnostic logging was not enabled yet, so all we have is info level logs and above, but there's something really interesting here: node-2 was detected unreachable for about 32 seconds, during which time there have been no association or materializer failures reported. This kind of suggests that the association (and so the underlying stream(s) I suppose) remains alive during that time. Could this mean that the TCP stream could have entered some back-pressure mode which causes the buffering? I suspect that's the case, and if so, there's a second finding in the logs above that is very useful: you can see that after reachability has been restored, node-3 reports its first dead letters due to buffer overflow 77 minutes later (no logs at all in the between). To me, this is indicative of the very low traffic volume of these particular environments, and makes me suspect that the most plausible explanation here is that the TCP stream has been stuck in back-pressure (or something similar) and is thus unable to send any of the buffered messages, although the reachability issue has been resolved. The alternative scenario by which there's messages being sent by Artery but there's also message loss (overflow) taking place due to the intermediate buffering would have made more sense for our higher volume environments (where we've never had any issues though). If that were the case here in these lower volume ones, what I would have expected to see is just some intermediate overflowing, but at some point the buffer would have cleared up (it shouldn't take too long to send 3k messages) and the message flow would go back to normal. Instead, what we get is (seemingly) persistent message loss, until we eventually delete the affected pod. What do you think of this hypothesis? In any case, as I said in my previous post, if the data we currently have at out disposal is not enough, we're rolling out with diagnostic logging enabled by default on these environments this week, so once we get hit again we should have something more tangible to work with (hopefully). |
@patriknw |
Hi @patriknw. We were able to capture the diagnostic cluster logs from an incident a few days ago. This time we were also able to find the issue causing the connectivity loss which was the Istio proxy sidecar of the affected pod dying with OOM. The sidecar took around 30 sec to recover (seems consistent every time) during which time the pod loses all network connectivity. After connectivity recovers we also get hit by the Akka clustering issue due to which one of the other nodes cannot send application messages to the recovered node. In this case it was node-2 that lost connectivity (again), and once recovered it was node-3 that was unable to send application messages (again). I can confirm from the logs that heartbeats are properly exchanged between the nodes post recovery. We can infer the message loss from the AskTimeoutExceptions we see in our application logs (not attached) for node-3. At the same time the very same messages are successfully sent (and responded to) from nodes 0 and 1 towards node-2 (send rates are similar across nodes, as the message sending is triggered by external HTTP requests that are dispatched in a round-robin fashion by our load balancer) node-2 was detected as unreachable roughly between 04:52:08 and 04:52:40 (30-ish seconds). The logs start around 2 minutes before that and end around 7.5 minutes after that (for some context). What you won't see in the logs this time is the dead-letters (GossipEnvelope etc) due to buffering: the reason for that is that the message send rate in this environment (as mentioned in my previous post) is so low, that it would have taken a while to fill the buffer up (we dealt with the issue swiftly this time). I can assure you however that we could definitely see all nodes (0, 1 and 3) reporting AskTimeoutExceptions for requests towards node-2 while it was unreachable, whereas node-3 was the only one that kept doing so after the node became reachable again. I've spent some time going through these logs to see if there's anything standing out, but I couldn't find much (with my layman's eyes at least). Some little things that I picked up (which may probably be nothing-burgers):
I hereby attach the logs of all four nodes ordered by date: akka.log.zip. It is formatted as |
Actually that second bullet in my comment above may have not been a nothing-burger at all: I've just remembered seeing logs of the following format:
and what I have also noticed sometimes is such logs being pretty close (time-wise) to the creation of an outbound connection after the reachability issue has resolved. I would make a bet that the outbound connection used for the large user messages is created dynamically if/when a large message is ever sent for the first time. If that's the case, then this second finding does the trick: node-0 and node-3 may have both created 2 outbound connections to node-2, but both connections made by node-0 seem unrelated to large messages, whereas the second connection made by node-3 seems related:
This would explain the 20 second difference between node-3 creating the first and second connection: the first connection was for the control messages, and the second one for the user-large messages in response to a message send by an actor on node-2 that required a large response to be sent back (by node-3). So effectively node-3 never established a connection for the plain user-space messages, only the ones for control and user-large. node-0 on the other hand created the ones for control and user (didn't need to send any large response to node-2 apparently) and node-1 created all three types of connections. Does any of that stuff make sense, or am I way too off? |
I had vacation last week, but will take a look at the logs now. Thanks for sharing and analyzing the problem. Yes, connections are established lazyily when a message is sent over that "stream". There can be be up to 3 outgoing connections to each destination node. Control stream fo Akka internals, such as heartbeat messages. Ordinary message stream, and large message stream. |
That's normal and nothing to worry about. The membership state is like a CRDT and such "conflicts" are resolved automatically and deterministically. |
Hey @patriknw welcome back! I kind of figured that the remote/local gossip thing was probably a red herring, thanks for confirming. As discussed in my previous comment anyway, I think I can safely say that based on the logs it's most probably the case that the user channel (ordinary message stream) is not created from node-3 to node-2, hence the inability to send any user messages. What was not clear to me though is why that happens, but hopefully there might be something there that will help you figure it out. |
Search "Received gossip status from [akka://cluster@node-3" confirms that the ordinary message stream is not working between node-3 and node-2. That message is from the ddata Replicator and sent over ordinary message stream. node-2 is not receiving it from node-3 (but from others). node-3 is receiving it from all (including node-2). That is also the reason why the "Receiving gossip from [UniqueAddress(akka://cluster@node-2" continuous in node-3. Normally that should sync and then not be sent more when no changes. This is the Cluster membership gossip, so not the same as the Replicator, but it's also sent over the ordinary message stream. During the error period (until 04:52:37) we can see the errors from TcpOutgoingConnection and ArteryTcpTransport when trying to establish the connection. "Connection reset by peer" in node-3 and "UnknownHostException" in node-2. Searching for "Outbound connection opened to" and subsequent log messages we can see that from node-2 it establishes 2 new connections to all nodes, and additional connections for the large. As expected. From node-3 "Outbound connection opened to" it is only establishing two connections to node-2, where one is for the large. Missing the connection for the ordinary message stream. We need to figure out why it's not establishing that connection from node-2 to node-3 ... The failed attempts before that:
Maybe it's something with that "Broken pipe" and "Upstream finished". I'll dig further... |
Yes @patriknw that's exactly how I remember this to be. It's interesting cause I think some of the other nodes had similar errors while trying to establish communication with node-2 (before it became reachable of course) like that "Upstream finished" thing (I think I've seen it in one of the rest of the nodes, not 100% sure though), I wonder, in the case that this is an indicator of a potential issue, if there could be some racing condition around it (so that it doesn't always cause a problem) |
I think the problem is the unexpected "Upstream finished" in some situations. I'm working on how to fix that. |
* In the logs attached to the issue we can see that an outbound connection is not re-established after "Upstream finished" (broken pipe). * Normally that is handled by the inner RestartFlow around the connection flow, but if that has reached it's maxRestarts (3) it will complete the entire stream and attachOutboundStreamRestart would not handle that as a restart case.
* In the logs attached to the issue we can see that an outbound connection is not re-established after "Upstream finished" (broken pipe). * Normally that is handled by the inner RestartFlow around the connection flow, but if that has reached it's maxRestarts (3) it will complete the entire stream and attachOutboundStreamRestart would not handle that as a restart case.
Great stuff, thank you for looking into this! |
* In the logs attached to the issue we can see that an outbound connection is not re-established after "Upstream finished" (broken pipe). * Normally that is handled by the inner RestartFlow around the connection flow, but if that has reached it's maxRestarts (3) it will complete the entire stream and attachOutboundStreamRestart would not handle that as a restart case.
… (#31232) * In the logs attached to the issue we can see that an outbound connection is not re-established after "Upstream finished" (broken pipe). * Normally that is handled by the inner RestartFlow around the connection flow, but if that has reached it's maxRestarts (3) it will complete the entire stream and attachOutboundStreamRestart would not handle that as a restart case.
@nick-nachos We will probably release Akka 2.6.19 later this week but if you want to try the fix earlier you can use snapshot version |
Thank you @patriknw, I think we'll probably wait for the official release 😄 |
We've got an unusual clustering issue that has hit us a couple of times this week. Incidentally (or not?) this started happening a week after we upgraded from Akka
2.6.14
to2.6.18
. We had been on2.6.14
for about 6 months and prior to that on2.5.x
for a year or so, and have never seen anything similar. The incident has happened 3-4 times so far in total in the course of a week in multiple environments, which however are all similar: K8s over AWS using the Istio service mesh. Our service has a deployment with 4 replicas, all of whom join an Akka Cluster (per environment) using Artery-TCP.In all cases, the sequence of events is as follows:
The message above keeps repeating periodically forever, until the cluster is torn down. What's very interesting here is that the reachability status list is empty and the seen field of the node-3 member status is false. Keep in mind that the leader had previously marked the node as reachable.
Based on application behavior we can also deduct that application level messages are probably not delivered to the problematic node either. Neither rolling restart, nor targeted node restarts have worked to resolve this, we pretty much had to shut the cluster down completely and start up again.
Unfortunately I haven't been able to reproduce this locally, but given the peculiarity of the cluster leader's behavior I was hoping you'd be able to deduct whether there's some particular corner case that could trigger it. Unfortunately we haven't rolled back to
2.6.14
yet, so I can't say whether that resolves the issue or not, but I will make sure to post here once we do. In the meantime, do you have any thoughts on this?The text was updated successfully, but these errors were encountered: