Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How prevent "Scheduled sending of heartbeat was delayed" and occasionally network partitions #4432

Closed
Ralf1108 opened this issue May 25, 2020 · 5 comments

Comments

@Ralf1108
Copy link
Contributor

Akka.Net: 1.4.5
Windows Server 2012 R2

Reproduce:
Clone, compile and run ClusterTester on 2 machines. One is a seed node, the other a non seed node.
The test programm doesn't do anything and produces no load. It just creates a minimal cluster.
Let it run for some days.

Log from SeedNode:

ClusterTester.exe 10000 10000 10.112.10.181 > _log.txt

[INFO][07.05.2020 12:17:34][Thread 0001][remoting (akka://ClusterSystem)] Starting remoting
[INFO][07.05.2020 12:17:35][Thread 0001][remoting (akka://ClusterSystem)] Remoting started; listening on addresses : [akka.tcp://ClusterSystem@10.112.10.181:10000]
[INFO][07.05.2020 12:17:35][Thread 0001][remoting (akka://ClusterSystem)] Remoting now listens on addresses: [akka.tcp://ClusterSystem@10.112.10.181:10000]
[INFO][07.05.2020 12:17:35][Thread 0001][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Starting up...
[INFO][07.05.2020 12:17:35][Thread 0001][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Started up successfully
Press any key to exit
[INFO][07.05.2020 12:17:35][Thread 0003][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Node [akka.tcp://ClusterSystem@10.112.10.181:10000] is JOINING itself (with roles []) and forming a new cluster
[INFO][07.05.2020 12:17:35][Thread 0003][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Leader is moving node [akka.tcp://ClusterSystem@10.112.10.181:10000] to [Up]
[INFO][07.05.2020 12:17:35][Thread 0003][akka.tcp://ClusterSystem@10.112.10.181:10000/user/clusterListener] Member is Up: Member(address = akka.tcp://ClusterSystem@10.112.10.181:10000, Uid=1206048204 status = Up, role=[], upNumber=1)
[INFO][07.05.2020 12:17:41][Thread 0005][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Received InitJoin message from [[akka.tcp://ClusterSystem@10.112.10.171:11000/system/cluster/core/daemon/joinSeedNodeProcess-1#980317007]] to [akka.tcp://ClusterSystem@10.112.10.181:10000]
[INFO][07.05.2020 12:17:41][Thread 0005][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Sending InitJoinNack message from node [akka.tcp://ClusterSystem@10.112.10.181:10000] to [[akka.tcp://ClusterSystem@10.112.10.171:11000/system/cluster/core/daemon/joinSeedNodeProcess-1#980317007]]
[INFO][07.05.2020 12:17:41][Thread 0018][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Node [akka.tcp://ClusterSystem@10.112.10.171:11000] is JOINING, roles []
[INFO][07.05.2020 12:17:42][Thread 0018][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Leader is moving node [akka.tcp://ClusterSystem@10.112.10.171:11000] to [Up]
[INFO][07.05.2020 12:17:42][Thread 0018][akka.tcp://ClusterSystem@10.112.10.181:10000/user/clusterListener] Member is Up: Member(address = akka.tcp://ClusterSystem@10.112.10.171:11000, Uid=2052147158 status = Up, role=[], upNumber=2)
[WARNING][08.05.2020 02:31:01][Thread 0066][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2703,15] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][17.05.2020 02:31:06][Thread 0026][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [3596,5893] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][19.05.2020 18:48:49][Thread 0032][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2453,1283] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][19.05.2020 18:49:49][Thread 0031][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2437,5046] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][19.05.2020 18:50:49][Thread 0005][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2000,0024] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][19.05.2020 19:01:50][Thread 0031][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2562,5076] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][20.05.2020 02:31:19][Thread 0039][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2656,2562] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][23.05.2020 02:31:51][Thread 0033][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2234,3635] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][25.05.2020 07:38:59][Thread 0022][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [33339,2789] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.

Log from non seed node:

ClusterTester.exe 11000 10000 10.112.10.181 > _log.txt

[INFO][07.05.2020 12:17:04][Thread 0001][remoting (akka://ClusterSystem)] Starting remoting
[INFO][07.05.2020 12:17:04][Thread 0001][remoting (akka://ClusterSystem)] Remoting started; listening on addresses : [akka.tcp://ClusterSystem@10.112.10.171:11000]
[INFO][07.05.2020 12:17:04][Thread 0001][remoting (akka://ClusterSystem)] Remoting now listens on addresses: [akka.tcp://ClusterSystem@10.112.10.171:11000]
[INFO][07.05.2020 12:17:04][Thread 0001][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Starting up...
Press any key to exit
[INFO][07.05.2020 12:17:04][Thread 0001][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Started up successfully
[INFO][07.05.2020 12:17:05][Thread 0003][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Welcome from [akka.tcp://ClusterSystem@10.112.10.181:10000]
[INFO][07.05.2020 12:17:05][Thread 0015][akka.tcp://ClusterSystem@10.112.10.171:11000/user/clusterListener] Member is Up: Member(address = akka.tcp://ClusterSystem@10.112.10.181:10000, Uid=1206048204 status = Up, role=[], upNumber=1)
[INFO][07.05.2020 12:17:06][Thread 0015][akka.tcp://ClusterSystem@10.112.10.171:11000/user/clusterListener] Member is Up: Member(address = akka.tcp://ClusterSystem@10.112.10.171:11000, Uid=2052147158 status = Up, role=[], upNumber=2)
[WARNING][08.05.2020 10:04:40][Thread 0015][akka.tcp://ClusterSystem@10.112.10.171:11000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [35897,9341] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][08.05.2020 10:07:50][Thread 0022][akka.tcp://ClusterSystem@10.112.10.171:11000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [34902,6] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][20.05.2020 02:30:51][Thread 0004][akka.tcp://ClusterSystem@10.112.10.171:11000/system/cluster/core/daemon] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://ClusterSystem@10.112.10.181:10000, Uid=1206048204 status = Up, role=[], upNumber=1)]. Node roles []
[INFO][20.05.2020 02:30:51][Thread 0023][akka.tcp://ClusterSystem@10.112.10.171:11000/user/clusterListener] Member detected as unreachable: Member(address = akka.tcp://ClusterSystem@10.112.10.181:10000, Uid=1206048204 status = Up, role=[], upNumber=1)
[INFO][20.05.2020 02:30:52][Thread 0023][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Ignoring received gossip status from unreachable [UniqueAddress: (akka.tcp://ClusterSystem@10.112.10.181:10000, 1206048204)]
[INFO][20.05.2020 02:30:52][Thread 0023][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Ignoring received gossip status from unreachable [UniqueAddress: (akka.tcp://ClusterSystem@10.112.10.181:10000, 1206048204)]
[INFO][20.05.2020 02:30:52][Thread 0004][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Marking node(s) as REACHABLE [Member(address = akka.tcp://ClusterSystem@10.112.10.181:10000, Uid=1206048204 status = Up, role=[], upNumber=1)]. Node roles []

Observation:
Occasionally we log missing heartbeats from the non seed node. Ranging from 2 to 3 seconds. Largest ones are more than 30 seconds. The longer the heartbeat pauses the more often we get a network partition.

How can this happen and what can we do about it ?
The real problem that in our production environment when there is more load on the system these heartbeat pauses are more often and causes network partition multiple times a day.
As workaround we implemented:

  • a restart mechanism if network partition occurs on server side
  • a retry mechanism on client side if a timeout exception oocur (mostly because of server network partition)
  • and increased the "acceptable-heartbeat-pause" for the "cluster" from default 3s to 10s
    but there a still network partitions.

To ensure that we dont have network issues or hiccups we run a ping in parallel to the ClusterTester. We collected up to today more than 1.5 million pings completing in 1ms and faster. No hiccups or errors:

ping -t 10.112.10.181 > server_ping.txt

Reply from 10.112.10.181: bytes=32 time<1ms TTL=128

What else can we do to investigate this more?

@Ralf1108
Copy link
Contributor Author

the original problem was #4419
This issue minimize the problem area to diagnose

@Ralf1108
Copy link
Contributor Author

FYI: We improved the ClusterTester program to also send pings via akka to to seed node.
We have one node running external to the seed node and one node running on the same machine as the seed.
In addition we monitor the network health again from external node to seed node via simple ping command.

Lets check how it looks after the weekend

@Ralf1108
Copy link
Contributor Author

Ralf1108 commented Jun 2, 2020

Results from our weekend test... no error/warning/missing heartbeat when running our ClusterTester.
The observed log messages in the entry post didn't reappear.

That means that the error is not in Akka.Net.

It is in our akka usage code.

@Aaronontheweb Aaronontheweb modified the milestones: 1.4.8, 1.4.9 Jun 17, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.9, 1.4.10 Jul 21, 2020
@Aaronontheweb
Copy link
Member

So I think this actually might be related to the Akka.Cluster actors running on the shared /user dispatcher in Akka.NET - this might get resolved once #4511 is merged in

@Aaronontheweb Aaronontheweb modified the milestones: 1.4.10, 1.4.11 Aug 20, 2020
@Aaronontheweb
Copy link
Member

Resolved via #4511

Oct 26 - Nov 6 Sprint automation moved this from To do to Done Nov 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

2 participants