Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port exhaustion problem with Akka.Cluster #2575

Closed
Blind-Striker opened this issue Mar 30, 2017 · 19 comments
Closed

Port exhaustion problem with Akka.Cluster #2575

Blind-Striker opened this issue Mar 30, 2017 · 19 comments

Comments

@Blind-Striker
Copy link

Hi,

We created an Akka Cluster infrastructure for Sms, Email and Push notifications. 3 different kind of nodes are exist in the system, which are client, sender and lighthouse. Client role is being used by Web application and API application(Web and API is hosted at IIS). Lighthouse and Sender roles are being hosted as a Windows service. We are also running 4 more console applications of same windows service that in sender role.

We've been experiencing port exhaustion problems in our Web Server for about 2 weeks. Our Web Server starting to consume the ports quickly and after a while we can not do any SQL operations.
Sometimes we have no choice but to do iis reset. This problems occur if there are more than one nodes that in sender role. We diagnosed it and found the source of the problem.

---------------
HOST                  OPEN    WAIT
SRV_NOTIFICATION      3429    0
SRV_LOCAL             198     0
SRV_UNDEFINED_IPV4    23      0
SRV_DATABASE          15      0
SRV_AUTH              4       0
SRV_API               6       0
SRV_UNDEFINED_IPV6    19      0
SRV_INBOUND           12347   5

TotalPortsInUse   : 17286
MaxUserPorts      : 64510
TcpTimedWaitDelay : 30
03/23/2017 09:30:10
---------------

SRV_NOTIFICATION is server that lighthouse ve sender's nodes running. SRV_INBOUND is our Web Server. After checking this table, we checked what ports on the Web Server were assigned.
And we got results like table below. In netstat there were more than 12000 connections like this :

TCP    192.168.1.10:65531     192.168.1.10:3564      ESTABLISHED     5716   [w3wp.exe]
TCP    192.168.1.10:65532     192.168.1.101:17527    ESTABLISHED     5716   [w3wp.exe]
TCP    192.168.1.10:65533     192.168.1.101:17527    ESTABLISHED     5716   [w3wp.exe]
TCP    192.168.1.10:65534     192.168.1.10:3564      ESTABLISHED     5716   [w3wp.exe]

192.168.1.10 Web Server
192.168.1.10:3564 API
192.168.1.101:17527 Lighthouse

The connections are opening but not closing.

After deployments our Web and Api applications are leaving and rejoining to do cluster and they configured for fixed ports. We're monitoring our cluster with application that created by @cgstevens. Even we implemented the grecaful shutdown logic for Actor System sometimes WEB and API applications cant leave the cluster so we have to remove nodes manualy and restart the actor system.

We have reproduce the problem in our development environment and recorded a video below

https://drive.google.com/file/d/0B5ZNfLACId3jMWUyOWliMUhNWTQ/view

Our hocon configuration for nodes are below :

WEB and API

<akka>
	<hocon><![CDATA[
			akka{
				loglevel = DEBUG
				
				actor{
					provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"
					
					deployment {
						/coordinatorRouter {
							router = round-robin-group
							routees.paths = ["/user/NotificationCoordinator"]
							cluster {
									enabled = on
									max-nr-of-instances-per-node = 1
									allow-local-routees = off
									use-role = sender
							}
						}
						
						/decidingRouter {
							router = round-robin-group
							routees.paths = ["/user/NotificationDeciding"]
							cluster {
									enabled = on
									max-nr-of-instances-per-node = 1
									allow-local-routees = off
									use-role = sender
							}
						}
					}
					
					serializers {
							wire = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
					}
					
					serialization-bindings {
					 "System.Object" = wire
					}
	
					debug{
						receive = on
						autoreceive = on
						lifecycle = on
						event-stream = on
						unhandled = on
					}
				}
				
				remote {
					helios.tcp {
							transport-class = "Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote"
							applied-adapters = []
							transport-protocol = tcp
							hostname = "192.168.1.10"
							port = 3564
					}
				}
				
				cluster {
						seed-nodes = ["akka.tcp://notificationSystem@192.168.1.101:17527"]
						roles = [client]
				}
			}
		]]>
	</hocon>
</akka>

Lighthouse

<akka>
		<hocon>
			<![CDATA[
					lighthouse{
							actorsystem: "notificationSystem"
						}
			
					akka {
						actor { 
							provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"
							
							serializers {
								wire = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
							}
						
							serialization-bindings {
								"System.Object" = wire
							}
						}
						
						remote {
							log-remote-lifecycle-events = DEBUG
							helios.tcp {
								transport-class = "Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote"
								applied-adapters = []
								transport-protocol = tcp
								#will be populated with a dynamic host-name at runtime if left uncommented
								#public-hostname = "192.168.1.100"
								hostname = "192.168.1.101"
								port = 17527
							}
						}            
						
						loggers = ["Akka.Logger.NLog.NLogLogger,Akka.Logger.NLog"]
						
						cluster {
							seed-nodes = ["akka.tcp://notificationSystem@192.168.1.101:17527"]
							roles = [lighthouse]
						}
					}
			]]>
		</hocon>
	</akka>

Sender

<akka>
    <hocon><![CDATA[
				akka{
					# stdout-loglevel = DEBUG
					loglevel = DEBUG
					# log-config-on-start = on
				
					loggers = ["Akka.Logger.NLog.NLogLogger, Akka.Logger.NLog"]
				
					actor{
						debug {  
							# receive = on 
							# autoreceive = on
							# lifecycle = on
							# event-stream = on
							# unhandled = on
						}         
					
						provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"           
						
						serializers {
							wire = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
						}
						
						serialization-bindings {
						 "System.Object" = wire
						}
						
						deployment{							
							/NotificationCoordinator/LoggingCoordinator/DatabaseActor{
								router = round-robin-pool
								resizer{
									enabled = on
									lower-bound = 3
									upper-bound = 5
								}
							}							
							
							/NotificationDeciding/NotificationDecidingWorkerActor{
								router = round-robin-pool
								resizer{
									enabled = on
									lower-bound = 3
									upper-bound = 5
								}
							}
							
							/ScheduledNotificationCoordinator/SendToProMaster/JobToProWorker{
								router = round-robin-pool
								resizer{
									enabled = on
									lower-bound = 3
									upper-bound = 5
								}
							}
						}
					}
					
				 remote{							
							log-remote-lifecycle-events = DEBUG
							log-received-messages = on
							
							helios.tcp{
								transport-class = "Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote"
								applied-adapters = []
								transport-protocol = tcp
								#will be populated with a dynamic host-name at runtime if left uncommented
								#public-hostname = "POPULATE STATIC IP HERE"
								hostname = "192.168.1.101"
								port = 0
						}
					}
					
					cluster {
						seed-nodes = ["akka.tcp://notificationSystem@192.168.1.101:17527"]
						roles = [sender]
					}
				}
			]]></hocon>
  </akka>

Cluster.Monitor

	<akka>
		<hocon>
			<![CDATA[
					akka {
						stdout-loglevel = INFO
						loglevel = INFO
						log-config-on-start = off 
						
						actor {
							provider = "Akka.Remote.RemoteActorRefProvider, Akka.Remote"				
							
							serializers {
								wire = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
							}
							serialization-bindings {
								"System.Object" = wire
							}

							deployment {								
								/clustermanager {
									dispatcher = akka.actor.synchronized-dispatcher
								}
							}
						}
						
						remote {
							log-remote-lifecycle-events = INFO
							log-received-messages = off
							log-sent-messages = off
							
							helios.tcp {                                
								transport-class = "Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote"
								applied-adapters = []
								transport-protocol = tcp
								#will be populated with a dynamic host-name at runtime if left uncommented
								#public-hostname = "127.0.0.1"
								hostname = "192.168.1.101"
								port = 0
							}
						}            

						cluster {							
						seed-nodes = ["akka.tcp://notificationSystem@192.168.1.101:17527"]
							roles = [ClusterManager]
							
							client {
								initial-contacts = ["akka.tcp://notificationSystem@192.168.1.101:17527/system/receptionist"]
							}
						}
					}
			]]>
		</hocon>
	</akka>
@Aaronontheweb
Copy link
Member

Aaronontheweb commented Mar 30, 2017

@Blind-Striker is this an issue with Lighthouse specifically or other Akka.NET services as well?

edit: Nevermind. You said it's the sender service.

This sounds like this is an issue with unclean shutdowns occurring.

I'd strongly recommend using the latest nightly builds http://getakka.net/docs/akka-developers/nightly-builds until Akka.NET 1.2 is released. The new CoordinatedShutdown feature that comes built into Akka.NET is much, much better about enforcing shutdown rules than what we had before.

@Aaronontheweb
Copy link
Member

On top of that, the new DotNetty transport is going to have improved handling of this type of stuff over Helios.

@Aaronontheweb
Copy link
Member

Removed my previous comment. That was an error. Not cluster's job to do that. We're going to get to the bottom of this though, but in the meantime use the nightlies and let us know on this issue if that improves things. 1.2 is due to be out imminently; blocked by a third party nuget release that is due out any day now.

@Blind-Striker
Copy link
Author

Thanks @Aaronontheweb, we will do necessary updates and notify you about results. In meanwhile should we open an issue to Stack Overflow for other people that experiencing this problem?

@Aaronontheweb
Copy link
Member

@Blind-Striker yep, that would be great.

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Mar 30, 2017

Working theory on the issue is that we aren't shutting down the associations a reachable node makes when it attempts to associate with an unreachable node (which happens repeatedly on a timer.) This is an Akka.Remote issue ultimately. Going to investigate.

@Blind-Striker
Copy link
Author

@alexvaluyskiy alexvaluyskiy modified the milestones: 1.2.0, 1.3.0 Apr 11, 2017
@jalchr
Copy link

jalchr commented Apr 20, 2017

I think I'm having the same issue. I have upgraded to v1.2. After 3 hours of starting a service, it tries to connect to unknown ports (not configured in the HOCON)

Here is an excerpt of my logs:

2017-04-18 03:46:53,582 [1] INFO Mayadeen.Archive.VideoExporter.ConfigureService - Setting up the service
2017-04-18 03:46:53,660 [1] INFO Topshelf.HostFactory - Configuration Result:
[Success] Name ExportService
[Success] DisplayName Almayadeen video export service
[Success] Description Almaydeen archive system, video export service
[Success] ServiceName ExportService
2017-04-18 03:46:53,660 [1] INFO Topshelf.HostConfigurators.HostConfiguratorImpl - Topshelf v4.0.0.0, .NET Framework v4.0.30319.42000
2017-04-18 03:46:53,676 [1] INFO Topshelf.Runtime.Windows.WindowsServiceHost - Starting as a Windows service
2017-04-18 03:46:53,676 [6] INFO Topshelf.Runtime.Windows.WindowsServiceHost - [Topshelf] Starting
2017-04-18 03:46:53,692 [6] INFO ProcessingService - Starting ... 'Export service'
2017-04-18 03:46:57,208 [11] INFO Akka.Event.DummyClassForStringSources - Starting remoting
2017-04-18 03:46:57,379 [8] INFO Akka.Event.DummyClassForStringSources - Remoting started; listening on addresses : [akka.tcp://ArchiveSystem@140.125.4.2:16667]
2017-04-18 03:46:57,379 [9] INFO Akka.Event.DummyClassForStringSources - Remoting now listens on addresses: [akka.tcp://ArchiveSystem@140.125.4.2:16667]
2017-04-18 03:46:57,426 [10] INFO Akka.Event.DummyClassForStringSources - Cluster Node [akka.tcp://ArchiveSystem@140.125.4.2:16667] - Starting up...
2017-04-18 03:46:57,442 [8] INFO Akka.Event.DummyClassForStringSources - Cluster Node [akka.tcp://ArchiveSystem@140.125.4.2:16667] - Started up successfully
2017-04-18 03:46:57,458 [6] INFO ProcessingService - Started - 'Export service'
2017-04-18 03:46:57,458 [6] INFO Topshelf.Runtime.Windows.WindowsServiceHost - [Topshelf] Started
2017-04-18 03:46:57,458 [26] INFO Mayadeen.Archive.VideoExporter.Akka.ApiMaster - [PreStart] ApiMaster
2017-04-18 03:46:57,458 [26] INFO Mayadeen.Archive.VideoExporter.Akka.VideoMaster - [PreStart] VideoMaster
2017-04-18 03:46:57,661 [1] INFO Topshelf.Hosts.StartHost - The ExportService service was started.
2017-04-18 03:46:58,192 [24] INFO Akka.Cluster.ClusterCoreDaemon - Welcome from [akka.tcp://ArchiveSystem@140.125.4.5:4053]
2017-04-18 03:47:46,257 [30] INFO Akka.Cluster.ClusterCoreDaemon - Leader is moving node [akka.tcp://ArchiveSystem@140.125.4.2:16668] to [Up]
2017-04-18 03:51:08,332 [26] INFO Akka.Cluster.ClusterCoreDaemon - Leader is moving node [akka.tcp://ArchiveSystem@140.125.4.1:16668] to [Up]

2017-04-18 06:50:33,183 [33] WARN Akka.Event.DummyClassForStringSources - Association to [akka.tcp://ArchiveSystem@140.125.4.1:16668] having UID [1471622119] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.

2017-04-18 06:50:40,293 [10] WARN Akka.Remote.EndpointWriter - AssociationError [akka.tcp://ArchiveSystem@140.125.4.2:16667] -> akka.tcp://ArchiveSystem@140.125.4.1:41762: Error [No connection could be made because the target machine actively refused it tcp://ArchiveSystem@140.125.4.1:41762] []

2017-04-18 06:50:40,293 [26] WARN Akka.Event.DummyClassForStringSources - Tried to associate with unreachable remote address [akka.tcp://ArchiveSystem@140.125.4.1:41762]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [No connection could be made because the target machine actively refused it tcp://ArchiveSystem@140.125.4.1:41762]

...

@Aaronontheweb
Copy link
Member

@jalchr unknown ports? Are you using port 0 somewhere?

@alexvaluyskiy alexvaluyskiy modified the milestones: 1.2.1, 1.3.0 Jun 22, 2017
@Aaronontheweb Aaronontheweb removed this from the 1.2.1 milestone Jun 23, 2017
@Aaronontheweb
Copy link
Member

@Blind-Striker I downgraded this issue from confirmed bug to potential bug because I haven't been able to reproduce this at all locally using 1.2.0. I monitoring the ports closely using SysInternals and chaos-monkey'd the cluster in several different ways. Saw all of the ports open and close as expected.

I also followed your video closely too, but even in there it doesn't look like the ports are staying open.

PM me on Gitter and let's take a look at this over Skype or email. I need more data on how to reproduce it before I can investigate further.

@Aaronontheweb
Copy link
Member

Looking at this more closely, I think we may have an issue with DotNetty.

Related: Azure/DotNetty#238

Working on some specs now to verify that it's not an Akka.Remote issue.

Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue Jun 30, 2017
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue Jul 7, 2017
Arkatufus pushed a commit that referenced this issue Jul 8, 2017
* added dotnetty shutdown specs

* added specs to verify that DotNetty transport instances are shutdown correctly

* #2575 - added DotNetty shutdown specs

* added copyright headers
@Blind-Striker
Copy link
Author

Hi @Aaronontheweb

Since we have reduced the number of IIS applications in our Web Server, we have never encountered this problem again. But in the tests that we do in our local, we get the same results when we apply the method of the video I sent above. I will write a sample application and send it to you to reproduce this case if I have time.

But i must say that we still using Akka 1.1.3 version. We will update akka to version 1.3.1 this week or next. I'll let you know the results.

Maybe this problem is no longer valid. I don't want to take up your time for this case

@ZigMeowNyan
Copy link
Contributor

We've had an installation report a similar issue last week. The first and only report like it, but it's also a VM in a cloud environment, and has a lot more unrelated network activity going on. We're also using Akka 1.1.3. It's relevant to note that this version of Akka still uses Helios and not DotNetty. So the above referenced issue isn't directly related. I don't think Helios is a supported scenario anymore, so I'm not really looking for an investigation or anything. But it is worthwhile to report additional information.

This installation isn't using clustering at all. It's a collection of five services that communicate directly with each other remotely via Akka.Net. The service that was causing problems is just an inbound connector. On startup, the actor system is created, and 3 instances of ActorSelection for the other services are created. Beyond that, it opens up a listening TCP port and waits for incoming connections, handles those incoming messages, and routes them to other services via Akka. The service somehow had over 30000 ports open when the local admin terminated it. There weren't incoming connections at the time, but I can't say it hadn't been used at all since the service had been running multiple days.

Port was set to 0 in the HOCON config (under helios.tcp). Mostly under the assumption that this service associates with others, and none of them will initiate a connection with it. So it should theoretically be ok for it to use a dynamic port.

Some of the connector code does use the Ask<T> extension, which dynamically allocates actors and is most definitely not a pure Actor design. I'll probably be spending some time investigating the lifetime of ports opened via future actors allocated that way, and whether port 0 affects that. And maybe checking whether any heartbeat stuff comes into play.

@Aaronontheweb
Copy link
Member

BTW, some updates on this.

Had some reports from users with versions of Akka.Cluster as recent as 1.3.7 that port exhaustion was an ongoing issue still, including one this week. The Akka.NET v1.3.8 release includes an upgrade to DotNetty v0.4.8, which has fixes for cleaning up sockets upon channel shutdown. Reports I've had from end-users affected by this issue indicate that the Akka.NET v1.3.8 / DotNetty v0.4.8 upgrades effectively address it.

I'm working on a methodology to prove this definitively, but I'd strongly recommend that you upgrade anyway since we appear to have eliminated at least some potential causes of this in the most recent release.

@Ulimo
Copy link

Ulimo commented Sep 5, 2018

Hi, Im running Akka Cluster 1.3.8, and I just experienced port exhaustion.
I recieved this event: A request to allocate an ephemeral port number from the global TCP port space has failed due to all such ports being in use.

Followed by:
TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint.

Unfortunately I did not do a netstat, but I saw that other services on the computer could not create an outgoing TCP connection as well in the event log. When I shut down the Akka services the problem resolved. So it still seem to be an ongoing issue.

@Aaronontheweb
Copy link
Member

Thanks @Ulimo - I shelved an idea for being able to definitively test whether or not this is an issue because I thought our most recent DotNetty upgrade had resolved it, but given your bug report it appears that this is not the case. We'll begin looking back into it.

@Aaronontheweb Aaronontheweb added this to the 1.3.10 milestone Sep 5, 2018
@Aaronontheweb Aaronontheweb self-assigned this Sep 5, 2018
@Ulimo
Copy link

Ulimo commented Sep 6, 2018

@Aaronontheweb fyi I added a script to both our test and production environment to monitor open TCP ports and timestamp, so it will be possible for me to link them together to the logs. If this happens again, is there any other information that should be collected that can simplify the process of finding the leak?

@Aaronontheweb Aaronontheweb modified the milestones: 1.3.10, 1.3.11, 1.3.12 Dec 14, 2018
@Aaronontheweb Aaronontheweb modified the milestones: 1.3.12, 1.3.13 Apr 29, 2019
@Aaronontheweb
Copy link
Member

Update on this, since @izavala and I have been working on it extensively for a few weeks.

We were able to create a reliable reproduction case for this here: https://github.com/petabridge/SocketLeakDetection along with a set of actors who can automatically kill the affected ActorSystem when the port count grows too rapidly or exceeds some configurable absolute threshold. You can read the README on that repository for a description of how that tool and its underlying math works.

More importantly, however - we have a fix for the issue here: #3764 - I'm working on cleaning this up and getting it ready for merge into 1.3.13, which we hope to release immediately. This was a very difficult bug to track down and reproduce, but thanks to the efforts and very detailed logs provided to us by affected end-users we were able to finally identify the underlying cause, which is that under some circumstances we never properly disassociated outbound associations that failed during the Akka.Remote handshake process. You can see my comment on the fix (ported from the JVM) here: #3764 (comment)

This issue happened most noticeably when a node inside Akka.Cluster was quarantined by at least one other node - which caused a perpetual loop of Akka.Cluster retrying the failed connection from the quarantined node back to the quarantiner, but having it rejected each time due to the Akka.Remote quarantine in effect. Since we never properly closed the outbound association, the inbound socket may not have properly closed either - hence why the port count could increase simultaneously on both nodes.

Please leave a comment here if you have any additional questions or insights about this - one of the nastier bugs we've ever had to track down.

@Aaronontheweb
Copy link
Member

closed via #3764

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants