Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed: Akka.Cluster.Singleton no longer moves when node with matching role and higher AppVersion joins cluster #7197

Open
wants to merge 13 commits into
base: dev
Choose a base branch
from

Conversation

Aaronontheweb
Copy link
Member

@Aaronontheweb Aaronontheweb commented May 23, 2024

Changes

Fixes #7196 - not done yet.

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

@Aaronontheweb Aaronontheweb changed the title WIP Akka.Cluster.Singleton fixes Fixed: Akka.Cluster.Singleton no longer moves when node with matching role and higher AppVersion joins cluster May 23, 2024
@Aaronontheweb Aaronontheweb marked this pull request as ready for review May 23, 2024 17:08
Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detailed my changes


[Fact(DisplayName =
"Singletons should not move to higher AppVersion nodes until after older incarnation is downed")]
public async Task Bugfix7196Spec()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End to End reproduction spec:

  1. Forms 2 node cluster: 1 with the role enabled for singleton ( _hostNodeV1), the other without (Sys). Both nodes are running with AppVersion == 1.0.0.
  2. Sys will create the ClusterSingletonProxy - which will point to the SingletonManger and the resulting Singleton actor hosted on _hostNodeV1.
  3. _otherNodeV2 joins the cluster with AppVersion == 1.0.2. The bug from Akka.Cluster.Tools.Singleton: singleton moves as soon as node with higher AppVersion joins cluster? #7196 meant that this node would IMMEDIATELY become the Oldest and the singleton would move (and the old one would not die, creating a split brain.)
  4. We validate that the singleton is still on _hostNodeV1 - which means we've fixed the bug, but we still need to make sure the oldest gets replaced if the current one leaves.
  5. We down _hostNodeV1 and observe that the singleton migrates over to _otherNodeV2 afterwards.

This validates, end to end, that the issue is resolved. The only way we could make this more robust is adding a second _hostNodeV1 instance and observe that _otherNodeV2 gets elected in favor over it, but we have other tests below that ensure this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests for my OldestChangedBuffersState abstraction, which is now shared by both the SingletonProxy and the OldestChangedBuffer

@@ -5,6 +5,7 @@
// </copyright>
//-----------------------------------------------------------------------

#nullable enable
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable nullability here - just trying to do that as I go back through older code.

@@ -102,7 +99,8 @@ public ClusterSingletonProxy(string singletonManagerPath, ClusterSingletonProxyS
_memberAgeComparer = settings.ConsiderAppVersion
? MemberAgeOrdering.DescendingWithAppVersion
: MemberAgeOrdering.Descending;
_membersByAge = ImmutableSortedSet<Member>.Empty.WithComparer(_memberAgeComparer);
_state = new OldestChangedBufferState(ImmutableSortedSet<Member>.Empty.WithComparer(_memberAgeComparer),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the new OldestChangedBufferState instead of separate fields

{
var (newState, oldestChanged) = _state.AddMember(member);
_state = newState;
if (oldestChanged)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check to see if the oldest has changed, if so retry identifying the singleton (consistent with the behavior we had before, just no Action delegates and the TrackChanges method to do it.)


State = State with { MembersByAge = newMembersByAge };

// compute the initial oldest - doesn't matter if it's not "safe" or not. That's our parent's problem.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment for details - behavior is the same as before, just computed by the State

_membersByAge = _membersByAge.Remove(member);
_membersByAge = _membersByAge.Add(member);
});
_changes = _changes.Enqueue(new OldestChanged(State.CurrentOldest?.UniqueAddress));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same behavior as before, but this is just now internally handled by the State.

/// <summary>
/// Immutable data object that represents the state of the oldest changed buffer.
/// </summary>
internal sealed record OldestChangedBufferState
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New immutable data structure designed to do two things:

  1. Predict who the next leader should be
  2. Not actually allow the leader to be changed until the previous leader has exited the cluster <-- THIS IS THE KEY BUGFIX.

if (MatchingRole(state, member))
{
// remove then add node to replace it, as it's possible that the upNumber is changed
return ComputeNextOldest(state with { MembersByAge = state.MembersByAge.Remove(member).Add(member) });
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same "replacement" behavior that we used inside the OldestChanged buffer before.

public static (OldestChangedBufferState newState, bool oldestChanged) ComputeNextOldest(this OldestChangedBufferState state)
{
// if the current oldest has not been removed, then it remains the oldest
if(state.CurrentOldest is not null && state.MembersByAge.Contains(state.CurrentOldest))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key bugfix happens right here - we don't allow a new "oldest" node to be decided until AFTER the previous one is removed from the cluster. This prevents split brains from happening AND it stops the ClusterSingletonManager from moving immediately after a new with a high AppVersion and matching role joins the cluster.

@Aaronontheweb
Copy link
Member Author

I have a racy failure / instability of some kind in my test. Working on ruling that out before I get anyone else to look at this PR. I think it might have been related to the auto-downing setting I copied and pasted from the Restart2Spec.

Copy link
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aaronontheweb Aaronontheweb added the NO MERGE Don't merge. label May 24, 2024
@Aaronontheweb
Copy link
Member Author

So this "fix" will actually create additional problems - I'm thinking that the right thing to do here is to remove the "consider app version" support from Akka.Cluster.Singleton altogether. That problem doesn't seem workable without adding another communication layer around singletons, which I am not inclined to do. Better that we have a single heuristic for deciding where singletons live so we can keep the design of the infrastructure as simple as possible and eliminate any possibility of split brains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Akka.Cluster.Tools.Singleton: singleton moves as soon as node with higher AppVersion joins cluster?
2 participants