-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port exhaustion problem with Akka.Cluster #2575
Comments
@Blind-Striker is this an issue with Lighthouse specifically or other Akka.NET services as well? edit: Nevermind. You said it's the sender service. This sounds like this is an issue with unclean shutdowns occurring. I'd strongly recommend using the latest nightly builds http://getakka.net/docs/akka-developers/nightly-builds until Akka.NET 1.2 is released. The new |
On top of that, the new DotNetty transport is going to have improved handling of this type of stuff over Helios. |
Removed my previous comment. That was an error. Not cluster's job to do that. We're going to get to the bottom of this though, but in the meantime use the nightlies and let us know on this issue if that improves things. 1.2 is due to be out imminently; blocked by a third party nuget release that is due out any day now. |
Thanks @Aaronontheweb, we will do necessary updates and notify you about results. In meanwhile should we open an issue to Stack Overflow for other people that experiencing this problem? |
@Blind-Striker yep, that would be great. |
Working theory on the issue is that we aren't shutting down the associations a reachable node makes when it attempts to associate with an unreachable node (which happens repeatedly on a timer.) This is an Akka.Remote issue ultimately. Going to investigate. |
Stack Overflow link : http://stackoverflow.com/questions/43128080/port-exhaustion-issue-with-akka-cluster |
I think I'm having the same issue. I have upgraded to v1.2. After 3 hours of starting a service, it tries to connect to unknown ports (not configured in the HOCON) Here is an excerpt of my logs: 2017-04-18 03:46:53,582 [1] INFO Mayadeen.Archive.VideoExporter.ConfigureService - Setting up the service 2017-04-18 06:50:33,183 [33] WARN Akka.Event.DummyClassForStringSources - Association to [akka.tcp://ArchiveSystem@140.125.4.1:16668] having UID [1471622119] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation. 2017-04-18 06:50:40,293 [10] WARN Akka.Remote.EndpointWriter - AssociationError [akka.tcp://ArchiveSystem@140.125.4.2:16667] -> akka.tcp://ArchiveSystem@140.125.4.1:41762: Error [No connection could be made because the target machine actively refused it tcp://ArchiveSystem@140.125.4.1:41762] [] 2017-04-18 06:50:40,293 [26] WARN Akka.Event.DummyClassForStringSources - Tried to associate with unreachable remote address [akka.tcp://ArchiveSystem@140.125.4.1:41762]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [No connection could be made because the target machine actively refused it tcp://ArchiveSystem@140.125.4.1:41762] ... |
@jalchr unknown ports? Are you using port 0 somewhere? |
@Blind-Striker I downgraded this issue from I also followed your video closely too, but even in there it doesn't look like the ports are staying open. PM me on Gitter and let's take a look at this over Skype or email. I need more data on how to reproduce it before I can investigate further. |
Looking at this more closely, I think we may have an issue with DotNetty. Related: Azure/DotNetty#238 Working on some specs now to verify that it's not an Akka.Remote issue. |
* added dotnetty shutdown specs * added specs to verify that DotNetty transport instances are shutdown correctly * #2575 - added DotNetty shutdown specs * added copyright headers
Since we have reduced the number of IIS applications in our Web Server, we have never encountered this problem again. But in the tests that we do in our local, we get the same results when we apply the method of the video I sent above. I will write a sample application and send it to you to reproduce this case if I have time. But i must say that we still using Akka 1.1.3 version. We will update akka to version 1.3.1 this week or next. I'll let you know the results. Maybe this problem is no longer valid. I don't want to take up your time for this case |
We've had an installation report a similar issue last week. The first and only report like it, but it's also a VM in a cloud environment, and has a lot more unrelated network activity going on. We're also using Akka 1.1.3. It's relevant to note that this version of Akka still uses Helios and not DotNetty. So the above referenced issue isn't directly related. I don't think Helios is a supported scenario anymore, so I'm not really looking for an investigation or anything. But it is worthwhile to report additional information. This installation isn't using clustering at all. It's a collection of five services that communicate directly with each other remotely via Akka.Net. The service that was causing problems is just an inbound connector. On startup, the actor system is created, and 3 instances of Port was set to 0 in the HOCON config (under Some of the connector code does use the |
BTW, some updates on this. Had some reports from users with versions of Akka.Cluster as recent as 1.3.7 that port exhaustion was an ongoing issue still, including one this week. The Akka.NET v1.3.8 release includes an upgrade to DotNetty v0.4.8, which has fixes for cleaning up sockets upon channel shutdown. Reports I've had from end-users affected by this issue indicate that the Akka.NET v1.3.8 / DotNetty v0.4.8 upgrades effectively address it. I'm working on a methodology to prove this definitively, but I'd strongly recommend that you upgrade anyway since we appear to have eliminated at least some potential causes of this in the most recent release. |
Hi, Im running Akka Cluster 1.3.8, and I just experienced port exhaustion. Followed by: Unfortunately I did not do a netstat, but I saw that other services on the computer could not create an outgoing TCP connection as well in the event log. When I shut down the Akka services the problem resolved. So it still seem to be an ongoing issue. |
Thanks @Ulimo - I shelved an idea for being able to definitively test whether or not this is an issue because I thought our most recent DotNetty upgrade had resolved it, but given your bug report it appears that this is not the case. We'll begin looking back into it. |
@Aaronontheweb fyi I added a script to both our test and production environment to monitor open TCP ports and timestamp, so it will be possible for me to link them together to the logs. If this happens again, is there any other information that should be collected that can simplify the process of finding the leak? |
Update on this, since @izavala and I have been working on it extensively for a few weeks. We were able to create a reliable reproduction case for this here: https://github.com/petabridge/SocketLeakDetection along with a set of actors who can automatically kill the affected More importantly, however - we have a fix for the issue here: #3764 - I'm working on cleaning this up and getting it ready for merge into 1.3.13, which we hope to release immediately. This was a very difficult bug to track down and reproduce, but thanks to the efforts and very detailed logs provided to us by affected end-users we were able to finally identify the underlying cause, which is that under some circumstances we never properly disassociated outbound associations that failed during the Akka.Remote handshake process. You can see my comment on the fix (ported from the JVM) here: #3764 (comment) This issue happened most noticeably when a node inside Akka.Cluster was quarantined by at least one other node - which caused a perpetual loop of Akka.Cluster retrying the failed connection from the quarantined node back to the quarantiner, but having it rejected each time due to the Akka.Remote quarantine in effect. Since we never properly closed the outbound association, the inbound socket may not have properly closed either - hence why the port count could increase simultaneously on both nodes. Please leave a comment here if you have any additional questions or insights about this - one of the nastier bugs we've ever had to track down. |
closed via #3764 |
Hi,
We created an Akka Cluster infrastructure for Sms, Email and Push notifications. 3 different kind of nodes are exist in the system, which are client, sender and lighthouse. Client role is being used by Web application and API application(Web and API is hosted at IIS). Lighthouse and Sender roles are being hosted as a Windows service. We are also running 4 more console applications of same windows service that in sender role.
We've been experiencing port exhaustion problems in our Web Server for about 2 weeks. Our Web Server starting to consume the ports quickly and after a while we can not do any SQL operations.
Sometimes we have no choice but to do iis reset. This problems occur if there are more than one nodes that in sender role. We diagnosed it and found the source of the problem.
SRV_NOTIFICATION is server that lighthouse ve sender's nodes running. SRV_INBOUND is our Web Server. After checking this table, we checked what ports on the Web Server were assigned.
And we got results like table below. In netstat there were more than 12000 connections like this :
192.168.1.10 Web Server
192.168.1.10:3564 API
192.168.1.101:17527 Lighthouse
The connections are opening but not closing.
After deployments our Web and Api applications are leaving and rejoining to do cluster and they configured for fixed ports. We're monitoring our cluster with application that created by @cgstevens. Even we implemented the grecaful shutdown logic for Actor System sometimes WEB and API applications cant leave the cluster so we have to remove nodes manualy and restart the actor system.
We have reproduce the problem in our development environment and recorded a video below
https://drive.google.com/file/d/0B5ZNfLACId3jMWUyOWliMUhNWTQ/view
Our hocon configuration for nodes are below :
WEB and API
Lighthouse
Sender
Cluster.Monitor
The text was updated successfully, but these errors were encountered: