improve Akka.Cluster / Akka.Remote `DeadLetter` logging #7149

Aaronontheweb · 2024-04-10T18:04:28Z

Changes

Various cleanup inside Akka.Cluster and Akka.Remote in order to make it easier to debug cluster gossip issues.

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

This change follows the Akka.NET API Compatibility Guidelines.
This change follows the Akka.NET Wire Compatibility Guidelines.
I have reviewed my own pull request.

Aaronontheweb · 2024-04-10T18:42:28Z

Racy spec: ClusterSpec.A_cluster_must_leave_via_CoordinatedShutdownRun()

Xunit.Sdk.XunitException
Expected removed.PreviousStatus to equal MemberStatus.Exiting(3) by value, but found MemberStatus.Leaving(2).

With configuration:
- Use declared types and members
- Compare enums by value
- Match member by name (or throw)
- Without automatic conversion.
- Be strict about the order of items in byte arrays

   at FluentAssertions.Execution.XUnit2TestFramework.Throw(String message)
   at FluentAssertions.Execution.TestFrameworkProvider.Throw(String message)
   at FluentAssertions.Execution.CollectingAssertionStrategy.ThrowIfAny(IDictionary`2 context)
   at FluentAssertions.Equivalency.EquivalencyValidator.AssertEquality(EquivalencyValidationContext context)
   at FluentAssertions.Primitives.ObjectAssertions.BeEquivalentTo[TExpectation](TExpectation expectation, Func`2 config, String because, Object[] becauseArgs)
   at FluentAssertions.Primitives.ObjectAssertions.BeEquivalentTo[TExpectation](TExpectation expectation, String because, Object[] becauseArgs)
   at Akka.Cluster.Tests.ClusterSpec.A_cluster_must_leave_via_CoordinatedShutdownRun() in C:\Repositories\olympus\akka.net\src\core\Akka.Cluster.Tests\ClusterSpec.cs:line 479
   at Xunit.Sdk.TestInvoker`1.<>c__DisplayClass48_0.<<InvokeTestMethodAsync>b__1>d.MoveNext() in /_/src/xunit.execution/Sdk/Frameworks/Runners/TestInvoker.cs:line 276
--- End of stack trace from previous location ---
   at Xunit.Sdk.ExecutionTimer.AggregateAsync(Func`1 asyncAction) in /_/src/xunit.execution/Sdk/Frameworks/ExecutionTimer.cs:line 48
   at Xunit.Sdk.ExceptionAggregator.RunAsync(Func`1 code) in /_/src/xunit.core/Sdk/ExceptionAggregator.cs:line 90

Somehow, it looks like a state transition was missed between Leaving and Removed - the logs in the cluster daemon correctly record that this node is exiting:

[INFO][04/10/2024 18:19:08.362Z][Thread 0020][Cluster (akka://ClusterSpec2)] Cluster Node [2.11.1] - Node [akka.tcp://ClusterSpec2@0.0.0.0:57312] is JOINING itself (with roles [], version [2.11.1]) and forming a new cluster
[INFO][04/10/2024 18:19:08.362Z][Thread 0020][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://ClusterSpec2@0.0.0.0:57312] - is the new leader among reachable nodes (more leaders may exist)
[INFO][04/10/2024 18:19:08.362Z][Thread 0020][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://ClusterSpec2@0.0.0.0:57312] - Leader is moving node [akka.tcp://ClusterSpec2@0.0.0.0:57312] to [Up]
[INFO][04/10/2024 18:19:08.362Z][Thread 0013][akka.tcp://ClusterSpec2@0.0.0.0:57312/system/cluster/core/daemon/downingProvider] This node is now the leader responsible for taking SBR decisions among the reachable nodes (more leaders may exist).
[INFO][04/10/2024 18:19:08.362Z][Thread 0020][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://ClusterSpec2@0.0.0.0:57312] - Marked address [akka.tcp://ClusterSpec2@0.0.0.0:57312] as [Leaving]
[INFO][04/10/2024 18:19:09.360Z][Thread 0013][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://ClusterSpec2@0.0.0.0:57312] - Leader is moving node [akka.tcp://ClusterSpec2@0.0.0.0:57312] to [Exiting]
[INFO][04/10/2024 18:19:09.360Z][Thread 0013][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://ClusterSpec2@0.0.0.0:57312] - Exiting completed.
[INFO][04/10/2024 18:19:09.360Z][Thread 0013][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://ClusterSpec2@0.0.0.0:57312] - Shutting down...
[INFO][04/10/2024 18:19:09.360Z][Thread 0013][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://ClusterSpec2@0.0.0.0:57312] - Successfully shut down
[INFO][04/10/2024 18:19:09.360Z][Thread 0013][akka://ClusterSpec2/system/cluster/core/publisher] Message [PublishChanges] from [akka://ClusterSpec2/system/cluster/core/daemon#1459903719] to [akka://ClusterSpec2/system/cluster/core/publisher#1716656279] was not delivered. [1] dead letters encountered. If this is not an expected behavior then [akka://ClusterSpec2/system/cluster/core/publisher#1716656279] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. Message content: Akka.Cluster.InternalClusterAction+PublishChanges
[INFO][04/10/2024 18:19:09.360Z][Thread 0013][akka://ClusterSpec2/system/cluster/core/publisher] Message [PublishChanges] from [akka://ClusterSpec2/system/cluster/core/daemon#1459903719] to [akka://ClusterSpec2/system/cluster/core/publisher#1716656279] was not delivered. [2] dead letters encountered. If this is not an expected behavior then [akka://ClusterSpec2/system/cluster/core/publisher#1716656279] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. Message content: Akka.Cluster.InternalClusterAction+PublishChanges
[DEBUG][04/10/2024 18:19:09.362Z][Thread 0014][ActorSystem(ClusterSpec2)] System shutdown initiated
[INFO][04/10/2024 18:19:09.362Z][Thread 0105][remoting-terminator] Shutting down remote daemon.
[INFO][04/10/2024 18:19:09.362Z][Thread 0105][remoting-terminator] Remote daemon shut down; proceeding with flushing remote transports.
[DEBUG][04/10/2024 18:19:09.362Z][Thread 0063][ActorSystem(ClusterSpec-130)] System shutdown initiated
[INFO][04/10/2024 18:19:09.363Z][Thread 0013][remoting-terminator] Shutting down remote daemon.
[INFO][04/10/2024 18:19:09.363Z][Thread 0013][remoting-terminator] Remote daemon shut down; proceeding with flushing remote transports.
[INFO][04/10/2024 18:19:09.364Z][Thread 0014][remoting-terminator] Remoting shut down.
[DEBUG][04/10/2024 18:19:09.364Z][Thread 0013][EventStream] Shutting down: StandardOutLogger started
[DEBUG][04/10/2024 18:19:09.364Z][Thread 0013][EventStream] All default loggers stopped

That PublishChanges event that gets deadletter'd might be received AFTER this issue has occurred, so I'm not sure how significant that is.

Aaronontheweb

Detailed my changes

Aaronontheweb · 2024-04-10T19:03:54Z

src/contrib/cluster/Akka.Cluster.Sharding.Tests/ShardRegionSpec.cs

@@ -44,7 +44,6 @@ public string ShardId(object message)
                => message switch
                {
                    int i => (i % 10).ToString(),
-                    ShardRegion.StartEntity se => (int.Parse(se.EntityId) % numberOfShards).ToString(),


Removed per analyzer warning

Aaronontheweb · 2024-04-10T19:05:18Z

src/core/Akka.Cluster.Tests/ClusterSpec.cs

@@ -476,7 +476,7 @@ public async Task A_cluster_must_leave_via_CoordinatedShutdownRun()
                await probe.ExpectMsgAsync<ClusterEvent.MemberLeft>();
                // MemberExited might not be published before MemberRemoved
                var removed = (ClusterEvent.MemberRemoved)await probe.FishForMessageAsync(m => m is ClusterEvent.MemberRemoved);
-                removed.PreviousStatus.Should().BeEquivalentTo(MemberStatus.Exiting);
+                new [] {MemberStatus.Exiting, MemberStatus.Leaving}.Should().Contain(removed.PreviousStatus);


Fix for racy spec - we'd already implemented this for one of the other "node leaving" specs.

The problem is that the PublishChanges message gets DeadLetter'd between the exiting and the removed statuses - the Removed status is generated, in this case, from the singleton cluster shutting itself down, rather than from gossip messages. This is a timing issue that mostly effects the spec - has no real-world impact. So we accept either of these two "terminating" statuses instead.

Aaronontheweb · 2024-04-10T19:06:04Z

src/core/Akka.Cluster.Tests/GossipSpec.cs

@@ -257,11 +257,11 @@ public void A_gossip_must_find_two_oldest_per_role_as_targets_for_Exiting_change
            Member a8 = TestMember.Create(new Address("akka.tcp", "sys", "a8", 2552), MemberStatus.Exiting, ImmutableHashSet<string>.Empty.Add("role1"), upNumber: 8);
            Member a9 = TestMember.Create(new Address("akka.tcp", "sys", "a9", 2552), MemberStatus.Exiting, ImmutableHashSet<string>.Empty.Add("role2"), upNumber: 9);

-            IEnumerable<Member> theExiting = new Member[] { a5, a6 };
+            var theExiting = new Member[] { a5, a6 };


Cleaned this up to just use an IReadOnlyCollection<Member> - to avoid some of the multiple enumeration problems GossipTargetsForExitingMembers was susceptible to.

Aaronontheweb · 2024-04-10T19:06:38Z

src/core/Akka.Cluster/ClusterDaemon.cs

@@ -2339,15 +2338,19 @@ public void LeaderActionsOnConvergence()
                }

                PublishMembershipState();
-                GossipExitingMembersToOldest(changedMembers.Where(i => i.Status == MemberStatus.Exiting));
+                GossipExitingMembersToOldest(changedMembers.Where(i => i.Status == MemberStatus.Exiting).ToArray());


Turn capture an array and pass it here, in order to solve multiple enumeration problems. No real perf hit here - only happens when leaders are converging on cluster membership status changes.

Aaronontheweb · 2024-04-10T19:06:52Z

src/core/Akka.Cluster/ClusterDaemon.cs

        }

        /// <summary>
        /// Gossip the Exiting change to the two oldest nodes for quick dissemination to potential Singleton nodes
        /// </summary>
        /// <param name="exitingMembers"></param>
-        private void GossipExitingMembersToOldest(IEnumerable<Member> exitingMembers)
+        private void GossipExitingMembersToOldest(IReadOnlyCollection<Member> exitingMembers)


Fix for the multi-enumeration problem.

Aaronontheweb · 2024-04-10T19:09:06Z

src/core/Akka.Remote/Endpoint.cs

@@ -711,7 +703,7 @@ private void Gated(bool writerTerminated, bool earlyUngateRequested)
                    // remote address at the EndpointManager level stopping this actor. In case the remote system becomes reachable
                    // again it will be immediately quarantined due to out-of-sync system message buffer and becomes quarantined.
                    // In other words, this action is safe.
-                    if (_bailoutAt != null && _bailoutAt.IsOverdue)
+                    if (_bailoutAt is { IsOverdue: true })


ReSharper suggested some pattern matching here.

Aaronontheweb · 2024-04-10T19:09:13Z

src/core/Akka.Remote/Endpoint.cs

@@ -1195,7 +1187,7 @@ private void Initializing()
            {
                // Assert handle == None?
                Context.Parent.Tell(
-                    new ReliableDeliverySupervisor.GotUid((int)handle.ProtocolHandle.HandshakeInfo.Uid, RemoteAddress));


Unnecessary cast.

Aaronontheweb · 2024-04-10T19:09:35Z

src/core/Akka/Event/DeadLetterListener.cs

        }

-        private Receive ReceiveWithAlwaysLogging()
+        private bool ReceiveWithAlwaysLogging(object message)


Minor perf optimization - avoid allocating new behaviors here if we can help it.

Aaronontheweb · 2024-04-10T19:19:04Z

src/core/Akka/Event/DeadLetterListener.cs

        }

-        private Receive ReceiveWithMaxCountLogging()
+        private bool ReceiveWithMaxCountLogging(object message)


refactoring to be just a method, instead of allocating a new Receive each time

Aaronontheweb · 2024-04-10T19:19:19Z

src/examples/Cluster/ClusterSharding/ShoppingCart/MessageExtractor.cs

@@ -27,7 +27,6 @@ public MessageExtractor(int maxNumberOfShards) : base(maxNumberOfShards)
        public override object EntityMessage(object message)
            => message switch
            {
-                ShardingEnvelope e => e.Message,


Resolving build warning by removing automatically handled sharding messages

Aaronontheweb · 2024-04-10T19:22:22Z

src/core/Akka.Cluster/Gossip.cs

+
+        public override string ToString()
+        {
+            return $"GossipEnvelope(from={From}, to={To}, gossip={Gossip}, deadline={Deadline})";


This is the primary DeadLetter logging change I wanted to make here - should make it much easier to understand what was lost in the DeadLetter pile in terms of missing gossip messages when there's things shutting down inside cluster nodes.

Arkatufus

LGTM

Aaronontheweb added 5 commits April 10, 2024 12:26

reformatted DefaultMessageDispatcher

3bf7dd4

added real XML-DOC comments for ReliableDeliverySupervisor

bc84a3d

perf optimized DeadLetterListener

41f0600

added realistic ToString method on GossipEnvelope

9634979

fixed multiple enumeration problem inside ClusterDaemon

6785450

Aaronontheweb added akka-cluster akka-remote logging labels Apr 10, 2024

Aaronontheweb added this to the 1.5.19 milestone Apr 10, 2024

Aaronontheweb added the akka-actor label Apr 10, 2024

Aaronontheweb added 2 commits April 10, 2024 13:11

solved another multiple enumeration problem

6e571f4

minor perf optimization

dc0965d

Aaronontheweb added 4 commits April 10, 2024 13:46

test and cluster event publishing clean up

264f2ec

clean up message extractor build warning

523e4b1

cleanup additional build warning related to StartEntity

12f7550

harden failing ClusterSpec

8bcd274

Aaronontheweb commented Apr 10, 2024

View reviewed changes

Aaronontheweb changed the title ~~[WIP] improve Akka.Cluster / Akka.Remote DeadLetter logging~~ improve Akka.Cluster / Akka.Remote DeadLetter logging Apr 10, 2024

Aaronontheweb commented Apr 10, 2024

View reviewed changes

Aaronontheweb marked this pull request as ready for review April 10, 2024 19:22

Arkatufus approved these changes Apr 10, 2024

View reviewed changes

Aaronontheweb merged commit 49fe208 into akkadotnet:dev Apr 10, 2024
7 of 12 checks passed

Aaronontheweb deleted the remoting-deadletters branch April 10, 2024 20:32

Arkatufus mentioned this pull request Apr 15, 2024

Update RELEASE_NOTES.md for 1.5.19 release #7159

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve Akka.Cluster / Akka.Remote `DeadLetter` logging #7149

improve Akka.Cluster / Akka.Remote `DeadLetter` logging #7149

Aaronontheweb commented Apr 10, 2024 •

edited

Aaronontheweb commented Apr 10, 2024

Aaronontheweb left a comment

Aaronontheweb Apr 10, 2024

Aaronontheweb Apr 10, 2024

Aaronontheweb Apr 10, 2024

Aaronontheweb Apr 10, 2024

Aaronontheweb Apr 10, 2024

Aaronontheweb Apr 10, 2024

Aaronontheweb Apr 10, 2024

Aaronontheweb Apr 10, 2024

Aaronontheweb Apr 10, 2024

Aaronontheweb Apr 10, 2024

Aaronontheweb Apr 10, 2024

Arkatufus left a comment

improve Akka.Cluster / Akka.Remote DeadLetter logging #7149

improve Akka.Cluster / Akka.Remote DeadLetter logging #7149

Conversation

Aaronontheweb commented Apr 10, 2024 • edited

Changes

Checklist

Aaronontheweb commented Apr 10, 2024

Aaronontheweb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Arkatufus left a comment

Choose a reason for hiding this comment

improve Akka.Cluster / Akka.Remote `DeadLetter` logging #7149

improve Akka.Cluster / Akka.Remote `DeadLetter` logging #7149

Aaronontheweb commented Apr 10, 2024 •

edited