How prevent "Scheduled sending of heartbeat was delayed" and occasionally network partitions #4432

Ralf1108 · 2020-05-25T10:00:17Z

Akka.Net: 1.4.5
Windows Server 2012 R2

Reproduce:
Clone, compile and run ClusterTester on 2 machines. One is a seed node, the other a non seed node.
The test programm doesn't do anything and produces no load. It just creates a minimal cluster.
Let it run for some days.

Log from SeedNode:

ClusterTester.exe 10000 10000 10.112.10.181 > _log.txt

[INFO][07.05.2020 12:17:34][Thread 0001][remoting (akka://ClusterSystem)] Starting remoting
[INFO][07.05.2020 12:17:35][Thread 0001][remoting (akka://ClusterSystem)] Remoting started; listening on addresses : [akka.tcp://ClusterSystem@10.112.10.181:10000]
[INFO][07.05.2020 12:17:35][Thread 0001][remoting (akka://ClusterSystem)] Remoting now listens on addresses: [akka.tcp://ClusterSystem@10.112.10.181:10000]
[INFO][07.05.2020 12:17:35][Thread 0001][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Starting up...
[INFO][07.05.2020 12:17:35][Thread 0001][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Started up successfully
Press any key to exit
[INFO][07.05.2020 12:17:35][Thread 0003][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Node [akka.tcp://ClusterSystem@10.112.10.181:10000] is JOINING itself (with roles []) and forming a new cluster
[INFO][07.05.2020 12:17:35][Thread 0003][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Leader is moving node [akka.tcp://ClusterSystem@10.112.10.181:10000] to [Up]
[INFO][07.05.2020 12:17:35][Thread 0003][akka.tcp://ClusterSystem@10.112.10.181:10000/user/clusterListener] Member is Up: Member(address = akka.tcp://ClusterSystem@10.112.10.181:10000, Uid=1206048204 status = Up, role=[], upNumber=1)
[INFO][07.05.2020 12:17:41][Thread 0005][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Received InitJoin message from [[akka.tcp://ClusterSystem@10.112.10.171:11000/system/cluster/core/daemon/joinSeedNodeProcess-1#980317007]] to [akka.tcp://ClusterSystem@10.112.10.181:10000]
[INFO][07.05.2020 12:17:41][Thread 0005][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Sending InitJoinNack message from node [akka.tcp://ClusterSystem@10.112.10.181:10000] to [[akka.tcp://ClusterSystem@10.112.10.171:11000/system/cluster/core/daemon/joinSeedNodeProcess-1#980317007]]
[INFO][07.05.2020 12:17:41][Thread 0018][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Node [akka.tcp://ClusterSystem@10.112.10.171:11000] is JOINING, roles []
[INFO][07.05.2020 12:17:42][Thread 0018][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Leader is moving node [akka.tcp://ClusterSystem@10.112.10.171:11000] to [Up]
[INFO][07.05.2020 12:17:42][Thread 0018][akka.tcp://ClusterSystem@10.112.10.181:10000/user/clusterListener] Member is Up: Member(address = akka.tcp://ClusterSystem@10.112.10.171:11000, Uid=2052147158 status = Up, role=[], upNumber=2)
[WARNING][08.05.2020 02:31:01][Thread 0066][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2703,15] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][17.05.2020 02:31:06][Thread 0026][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [3596,5893] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][19.05.2020 18:48:49][Thread 0032][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2453,1283] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][19.05.2020 18:49:49][Thread 0031][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2437,5046] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][19.05.2020 18:50:49][Thread 0005][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2000,0024] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][19.05.2020 19:01:50][Thread 0031][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2562,5076] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][20.05.2020 02:31:19][Thread 0039][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2656,2562] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][23.05.2020 02:31:51][Thread 0033][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [2234,3635] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][25.05.2020 07:38:59][Thread 0022][akka.tcp://ClusterSystem@10.112.10.181:10000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.181:10000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [33339,2789] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.

Log from non seed node:

ClusterTester.exe 11000 10000 10.112.10.181 > _log.txt

[INFO][07.05.2020 12:17:04][Thread 0001][remoting (akka://ClusterSystem)] Starting remoting
[INFO][07.05.2020 12:17:04][Thread 0001][remoting (akka://ClusterSystem)] Remoting started; listening on addresses : [akka.tcp://ClusterSystem@10.112.10.171:11000]
[INFO][07.05.2020 12:17:04][Thread 0001][remoting (akka://ClusterSystem)] Remoting now listens on addresses: [akka.tcp://ClusterSystem@10.112.10.171:11000]
[INFO][07.05.2020 12:17:04][Thread 0001][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Starting up...
Press any key to exit
[INFO][07.05.2020 12:17:04][Thread 0001][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Started up successfully
[INFO][07.05.2020 12:17:05][Thread 0003][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Welcome from [akka.tcp://ClusterSystem@10.112.10.181:10000]
[INFO][07.05.2020 12:17:05][Thread 0015][akka.tcp://ClusterSystem@10.112.10.171:11000/user/clusterListener] Member is Up: Member(address = akka.tcp://ClusterSystem@10.112.10.181:10000, Uid=1206048204 status = Up, role=[], upNumber=1)
[INFO][07.05.2020 12:17:06][Thread 0015][akka.tcp://ClusterSystem@10.112.10.171:11000/user/clusterListener] Member is Up: Member(address = akka.tcp://ClusterSystem@10.112.10.171:11000, Uid=2052147158 status = Up, role=[], upNumber=2)
[WARNING][08.05.2020 10:04:40][Thread 0015][akka.tcp://ClusterSystem@10.112.10.171:11000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [35897,9341] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][08.05.2020 10:07:50][Thread 0022][akka.tcp://ClusterSystem@10.112.10.171:11000/system/cluster/core/daemon/heartbeatSender] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [34902,6] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
[WARNING][20.05.2020 02:30:51][Thread 0004][akka.tcp://ClusterSystem@10.112.10.171:11000/system/cluster/core/daemon] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://ClusterSystem@10.112.10.181:10000, Uid=1206048204 status = Up, role=[], upNumber=1)]. Node roles []
[INFO][20.05.2020 02:30:51][Thread 0023][akka.tcp://ClusterSystem@10.112.10.171:11000/user/clusterListener] Member detected as unreachable: Member(address = akka.tcp://ClusterSystem@10.112.10.181:10000, Uid=1206048204 status = Up, role=[], upNumber=1)
[INFO][20.05.2020 02:30:52][Thread 0023][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Ignoring received gossip status from unreachable [UniqueAddress: (akka.tcp://ClusterSystem@10.112.10.181:10000, 1206048204)]
[INFO][20.05.2020 02:30:52][Thread 0023][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Ignoring received gossip status from unreachable [UniqueAddress: (akka.tcp://ClusterSystem@10.112.10.181:10000, 1206048204)]
[INFO][20.05.2020 02:30:52][Thread 0004][Cluster (akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.112.10.171:11000] - Marking node(s) as REACHABLE [Member(address = akka.tcp://ClusterSystem@10.112.10.181:10000, Uid=1206048204 status = Up, role=[], upNumber=1)]. Node roles []

Observation:
Occasionally we log missing heartbeats from the non seed node. Ranging from 2 to 3 seconds. Largest ones are more than 30 seconds. The longer the heartbeat pauses the more often we get a network partition.

How can this happen and what can we do about it ?
The real problem that in our production environment when there is more load on the system these heartbeat pauses are more often and causes network partition multiple times a day.
As workaround we implemented:

a restart mechanism if network partition occurs on server side
a retry mechanism on client side if a timeout exception oocur (mostly because of server network partition)
and increased the "acceptable-heartbeat-pause" for the "cluster" from default 3s to 10s
but there a still network partitions.

To ensure that we dont have network issues or hiccups we run a ping in parallel to the ClusterTester. We collected up to today more than 1.5 million pings completing in 1ms and faster. No hiccups or errors:

ping -t 10.112.10.181 > server_ping.txt

Reply from 10.112.10.181: bytes=32 time<1ms TTL=128

What else can we do to investigate this more?

The text was updated successfully, but these errors were encountered:

Ralf1108 · 2020-05-25T10:02:03Z

the original problem was #4419
This issue minimize the problem area to diagnose

Ralf1108 · 2020-05-27T15:54:12Z

FYI: We improved the ClusterTester program to also send pings via akka to to seed node.
We have one node running external to the seed node and one node running on the same machine as the seed.
In addition we monitor the network health again from external node to seed node via simple ping command.

Lets check how it looks after the weekend

Ralf1108 · 2020-06-02T07:21:56Z

Results from our weekend test... no error/warning/missing heartbeat when running our ClusterTester.
The observed log messages in the entry post didn't reappear.

That means that the error is not in Akka.Net.

It is in our akka usage code.

Aaronontheweb · 2020-07-21T19:07:15Z

So I think this actually might be related to the Akka.Cluster actors running on the shared /user dispatcher in Akka.NET - this might get resolved once #4511 is merged in

Aaronontheweb · 2020-11-05T18:51:00Z

Resolved via #4511

Ralf1108 mentioned this issue May 25, 2020

How force akka system run on its own threadpool and mixed used of cluster client #4419

Closed

Aaronontheweb added akka-cluster potential bug labels May 25, 2020

Aaronontheweb modified the milestones: 1.4.7, 1.4.8 May 25, 2020

Aaronontheweb modified the milestones: 1.4.8, 1.4.9 Jun 17, 2020

Aaronontheweb modified the milestones: 1.4.9, 1.4.10 Jul 21, 2020

Aaronontheweb modified the milestones: 1.4.10, 1.4.11 Aug 20, 2020

Aaronontheweb added this to To do in Oct 26 - Nov 6 Sprint Oct 26, 2020

Aaronontheweb closed this as completed Nov 5, 2020

Oct 26 - Nov 6 Sprint automation moved this from To do to Done Nov 5, 2020

This was referenced Nov 5, 2020

added Akka.NET v1.4.11 release notes #4606

Merged

v1.4.11 Production Release #4607

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How prevent "Scheduled sending of heartbeat was delayed" and occasionally network partitions #4432

How prevent "Scheduled sending of heartbeat was delayed" and occasionally network partitions #4432

Ralf1108 commented May 25, 2020

Ralf1108 commented May 25, 2020

Ralf1108 commented May 27, 2020

Ralf1108 commented Jun 2, 2020

Aaronontheweb commented Jul 21, 2020

Aaronontheweb commented Nov 5, 2020

How prevent "Scheduled sending of heartbeat was delayed" and occasionally network partitions #4432

How prevent "Scheduled sending of heartbeat was delayed" and occasionally network partitions #4432

Comments

Ralf1108 commented May 25, 2020

Ralf1108 commented May 25, 2020

Ralf1108 commented May 27, 2020

Ralf1108 commented Jun 2, 2020

Aaronontheweb commented Jul 21, 2020

Aaronontheweb commented Nov 5, 2020