Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow PersistentShardCoordinator to tolerate duplicate ShardHomeAllocated messages #5967

Conversation

Aaronontheweb
Copy link
Member

@Aaronontheweb Aaronontheweb commented May 26, 2022

Fixes #5604

Changes

Allows the PersistentShardCoordinator to tolerate multiple ShardHomeAllocated messages in the journal for the same shard - this is done in order to allow the sharding system to try its best to restart itself without bringing the system offline in the event of duplicate data appearing in the journal.

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

@Aaronontheweb Aaronontheweb changed the title Allow PersistentShardCoordinator to tolerate duplicate `ShardHomeAl… Allow PersistentShardCoordinator to tolerate duplicate ShardHomeAllocated messages May 26, 2022
Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed my changes and offered a rationale.

/// itself and go offline.</param>
/// <exception cref="ArgumentException">Thrown if an event is illegal in the current state.</exception>
/// <returns>An update copy of this state.</returns>
public State Updated(IDomainEvent e, bool isRecovering = false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now accept a new parameter here, isRecovering - when this value is set to true then we enable "reader tolerance" inside the PersistentShardCoordinator.

@@ -153,7 +156,27 @@ public State Updated(IDomainEvent e)
if (!Regions.TryGetValue(message.Region, out var shardRegions))
throw new ArgumentException($"Region {message.Region} not registered", nameof(e));
if (Shards.ContainsKey(message.Shard))
throw new ArgumentException($"Shard {message.Shard} is already allocated", nameof(e));
{
if (!isRecovering)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to implement reader tolerance in only one location for now the ShardHomeAllocated message, since this is where the issue has come up most often and it's also probably the safest place to do it.

// we're going to allow new value to overwrite previous
var newRegions = Regions;
var previousRegion = Shards[message.Shard];
if (Regions.TryGetValue(previousRegion, out var previousShards))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALGO is as follows:

  1. Take previous ShardRegion actor, if any, remove shard from its collection.
  2. Add shard to new ShardRegion actor. Add shard to its collection.
  3. It is very likely that both ShardRegion actors are the same in many cases, in which these two operations are identical to what the normal happy path does - but in case they are not, results in the new ShardHomeAllocated message overwriting the old one.

@@ -1346,26 +1369,26 @@ protected override bool ReceiveRecover(object message)
switch (evt)
{
case ShardRegionRegistered _:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure all of the Recovery methods are updated to support the new isRecovering flag.

Copy link
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Arkatufus Arkatufus merged commit 4fa7abb into akkadotnet:v1.4 May 26, 2022
Arkatufus added a commit to Arkatufus/akka.net that referenced this pull request May 26, 2022
…erate duplicate ShardHomeAllocated messages
@Aaronontheweb Aaronontheweb deleted the fix-5604-tolerant-persistent-shard-coordinator branch May 27, 2022 16:24
@Aaronontheweb
Copy link
Member Author

Conversation on how safe these changes are #5970 (comment)

Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this pull request May 30, 2022
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this pull request May 31, 2022
@Aaronontheweb Aaronontheweb mentioned this pull request May 31, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants