-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow PersistentShardCoordinator
to tolerate duplicate ShardHomeAllocated
messages
#5967
Allow PersistentShardCoordinator
to tolerate duplicate ShardHomeAllocated
messages
#5967
Conversation
…located` messages close akkadotnet#5604
PersistentShardCoordinator
to tolerate duplicate `ShardHomeAl…PersistentShardCoordinator
to tolerate duplicate ShardHomeAllocated
messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed my changes and offered a rationale.
/// itself and go offline.</param> | ||
/// <exception cref="ArgumentException">Thrown if an event is illegal in the current state.</exception> | ||
/// <returns>An update copy of this state.</returns> | ||
public State Updated(IDomainEvent e, bool isRecovering = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now accept a new parameter here, isRecovering
- when this value is set to true
then we enable "reader tolerance" inside the PersistentShardCoordinator
.
@@ -153,7 +156,27 @@ public State Updated(IDomainEvent e) | |||
if (!Regions.TryGetValue(message.Region, out var shardRegions)) | |||
throw new ArgumentException($"Region {message.Region} not registered", nameof(e)); | |||
if (Shards.ContainsKey(message.Shard)) | |||
throw new ArgumentException($"Shard {message.Shard} is already allocated", nameof(e)); | |||
{ | |||
if (!isRecovering) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to implement reader tolerance in only one location for now the ShardHomeAllocated
message, since this is where the issue has come up most often and it's also probably the safest place to do it.
// we're going to allow new value to overwrite previous | ||
var newRegions = Regions; | ||
var previousRegion = Shards[message.Shard]; | ||
if (Regions.TryGetValue(previousRegion, out var previousShards)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ALGO is as follows:
- Take previous
ShardRegion
actor, if any, removeshard
from its collection. - Add shard to new
ShardRegion
actor. Addshard
to its collection. - It is very likely that both
ShardRegion
actors are the same in many cases, in which these two operations are identical to what the normal happy path does - but in case they are not, results in the newShardHomeAllocated
message overwriting the old one.
@@ -1346,26 +1369,26 @@ protected override bool ReceiveRecover(object message) | |||
switch (evt) | |||
{ | |||
case ShardRegionRegistered _: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure all of the Recovery
methods are updated to support the new isRecovering
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…erate duplicate ShardHomeAllocated messages
Conversation on how safe these changes are #5970 (comment) |
…rdHomeAllocated` messages (akkadotnet#5967)" This reverts commit 4fa7abb.
…rdHomeAllocated` messages (akkadotnet#5967)" This reverts commit 4fa7abb.
Fixes #5604
Changes
Allows the
PersistentShardCoordinator
to tolerate multipleShardHomeAllocated
messages in the journal for the same shard - this is done in order to allow the sharding system to try its best to restart itself without bringing the system offline in the event of duplicate data appearing in the journal.Checklist
For significant changes, please ensure that the following have been completed (delete if not relevant):
PersistentShardCoordinator
a tolerant reader #5604