Avoid drop of messages in Akka Cluster Sharding on rebalance #6486

Binelli · 2023-03-04T03:11:16Z

Binelli
Mar 4, 2023

Hi,

Sometimes, when I start a new node in the cluster and the rebalance occurs, some messages get dropped from the Shard that is being moved to another node. I know the reason for this is to avoid reordering.
Couldn't we avoid this drop of messages using the following strategy:
1 - ShardCoordinator sends a "CoordinatedBeginHandOff" message to the ShardRegion that owns the Shard with the ShardId and the IActorRef of all the ShardRegions part of the cluster
2 - ShardRegion receives the message and for each other ShardRegion sends the BeginHandOff message for them. It then waits for all BeginHandOffAcks and forwards then to the ShardCoordinator
3 - When it receives the last BeginHandOffAck, it can do the BeginHandOff process itself and sends back to the ShardCoordinator the last BeginHandOffAck
4 - ShardCoordinator then sends the ShardRegion the HandOff

This way I think we could be sure that the ShardRegion that owns the Shard being moved will not receive messages from other ShardRegions after they send the BeginHandOffAck and any messages buffered between it processing BeginHandOff and received the HandOff message all belong to the same origin (local) and when the HandOff process is completed can be forwarded to the new destination without the risk of being re-ordered.

Aaronontheweb · 2023-03-04T15:04:59Z

Aaronontheweb
Mar 4, 2023
Maintainer

We have a system in place that attempts to do this today for Akka.Cluster.Sharding, but the problem comes down to a matter of timing - it's not really feasible to implement a global pause on message traffic to a particular node in a given cluster under all circumstances. Some nodes leave abruptly, some nodes lag due to non-software factors, and a global lock on a ShardRegion would need to be checked every time we messaged anything in the region (even during the happy path.) Messages can get lost in transit today despite our buffering and hand-off system due to that last factor - even happy path handoffs require a small amount of eventual consistency and messages can be lost during that time.

The approach I'd recommend for solving this is to keep the Sharding infrastructure as-is in order to keep it as simple and as fast as possible is to put layer a reliable delivery layer on top of it. That way more robust modes of delivery can be added as-needed without complicating the sharding infrastructure (which is already fairly complex.)

One of the projects we still have planned for the Akka.NET 1.5 project lifecycle is a new package, "Akka.Delivery" or "Akka.ReliableDelivery" (we're open to suggestions on naming) that basically replaces the AtLeastOnceDelivery actor's model of push-based delivery and also offers a pull-based delivery model too: #4720 - having senders push (with Acks from entities) or entities pull is the right model to apply here.

We're getting some interest from users in this module already - so I think we'll likely start work on it sooner rather than later.

0 replies

ismaelhamed · 2023-03-14T08:59:37Z

ismaelhamed
Mar 14, 2023

@Aaronontheweb maybe porting the sharding pattern also as part of #4720 would ease @Binelli's concerns.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid drop of messages in Akka Cluster Sharding on rebalance #6486

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Avoid drop of messages in Akka Cluster Sharding on rebalance #6486

Binelli Mar 4, 2023

Replies: 2 comments

Aaronontheweb Mar 4, 2023 Maintainer

ismaelhamed Mar 14, 2023

Binelli
Mar 4, 2023

Aaronontheweb
Mar 4, 2023
Maintainer

ismaelhamed
Mar 14, 2023