Improve the heuristic on when to send a snapshot or events to a follower #7784

deepthidevaki · 2021-09-08T08:34:29Z

When the leader replicates to a follower, it can either send the next event to be replicated or a snapshot at a later index if one exists. If the follower is lagging behind many events it is more efficient to send a snapshot and skip all the missing events. However if the follower is only lagging behind by a few events, replicating snapshot has more overhead. Now we always send the snapshot if the snapshot exists. Now that the follower is building its own state, replicating snapshot is wasteful because the state build by the follower will be thrown away and it has to restart from the new snapshot.

This behavior was observed in one of our benchmark.
There are frequent snapshot replications:

Frequent re-installation of streamprocessor also leads to temporary increase in memory consumption:

Frequent replication of snapshot is not optimal if the snapshot is big. Hence we need a better heuristic on when to send a snapshot vs when to send the event even if a snapshot exists at a higher index. A simple strategy can be to send the events if the number of events until the snapshot index is less than a threshold, else send the snapshot. A more complex strategy could be based on the size of snapshot vs size of missing events.
This also means we shouldn't compact the logs immediately after taking a snapshot, but wait until the followers are caught up with either the snapshot or the events. For safety, compact after a specific timeout even if followers are not caught up.

Zelldon · 2021-09-13T09:16:28Z

I like the second idea, but I would prefer the first and simpler solution for now. If we see issues with that we can still improve and move to a more sophisticated solution I think

lenaschoenburg · 2021-10-05T12:53:50Z

I've started to look into this a bit and wanted to note down what I found out and also confirm with you that I'm looking in the right direction.

I think the starting point for replicating snapshots or events is in LeaderAppender.tryToReplicateSnapshot where we try to replicate a snapshot if possible and otherwise we replicate events. If I understand correctly, we currently always meet the conditions to replicate the snapshot. It should be possible to "just" add a new condition that checks (heuristically) if it would be beneficial to only replicate the events instead.

deepthidevaki · 2021-10-05T13:42:02Z

You are in the right direction.

If I understand correctly, we currently always meet the conditions to replicate the snapshot.

If the follower is slower, then currently we "always" meet the condition. But it can also happen that the follower is expecting an event which is after the snapshot. In this case, the leader will always send the event and not the snapshot. This issue is regarding when the follower is slower, should we send the snapshot or the event.

It should be possible to "just" add a new condition that checks (heuristically) if it would be beneficial to only replicate the events instead.

Yes. You can add another condition to decide if you should send the snapshot or the event.

deepthidevaki · 2021-10-05T13:43:36Z

Remember that it can happen that events are already deleted. So if the heuristics decide to send the event, it may not always have that the event. In that case it has to sent the snapshot.

lenaschoenburg · 2021-10-07T10:50:02Z

I've tried to write a test that validates the new behaviour by disconnecting a follower, appending new entries and then reconnecting the follower. I've noticed that in the onInstall of the PassiveRole we always try to send a snapshot, regardless of how much the follower is lagging behind. Should we change this to use the new heuristic as well or should we leave it as is?

If we leave it as is I think I might need to write a new test helper that supports stopping/slowing down a follower so that I can create a follower that is lagging behind. (This seems related to #4586 which was just closed).

deepthidevaki · 2021-10-07T11:02:22Z

onInstall deals with receiving snapshots. It doesn't trigger sending new snapshots. If you have issues with the test, it could be something else. Let's look into it together.

deepthidevaki added kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. scope/broker Marks an issue or PR to appear in the broker section of the changelog area/performance Marks an issue as performance related labels Sep 8, 2021

npepinpe added this to Planned in Zeebe Sep 9, 2021

lenaschoenburg self-assigned this Oct 5, 2021

lenaschoenburg moved this from Planned to In progress in Zeebe Oct 6, 2021

lenaschoenburg mentioned this issue Oct 8, 2021

Prefer replicating events instead of snapshots #7957

Merged

9 tasks

npepinpe moved this from In progress to Review in progress in Zeebe Oct 8, 2021

lenaschoenburg mentioned this issue Oct 11, 2021

Configurable threshold for deciding between replicating events or snapshots #7968

Closed

This was referenced Oct 13, 2021

Unexpected drop of performance with 1.2 #7955

Closed

OutOfMemory in follower #7992

Closed

ghost closed this as completed in 8a4adb8 Oct 15, 2021

Zeebe automation moved this from Review in progress to Done Oct 15, 2021

menski added the Release: 1.3.0-alpha1 label Nov 8, 2021

korthout added the version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0 label Jan 4, 2022

KerstinHebel removed this from Done in Zeebe Mar 23, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the heuristic on when to send a snapshot or events to a follower #7784

Improve the heuristic on when to send a snapshot or events to a follower #7784

deepthidevaki commented Sep 8, 2021

Zelldon commented Sep 13, 2021

lenaschoenburg commented Oct 5, 2021

deepthidevaki commented Oct 5, 2021

deepthidevaki commented Oct 5, 2021

lenaschoenburg commented Oct 7, 2021 •

edited

Loading

deepthidevaki commented Oct 7, 2021

Improve the heuristic on when to send a snapshot or events to a follower #7784

Improve the heuristic on when to send a snapshot or events to a follower #7784

Comments

deepthidevaki commented Sep 8, 2021

Zelldon commented Sep 13, 2021

lenaschoenburg commented Oct 5, 2021

deepthidevaki commented Oct 5, 2021

deepthidevaki commented Oct 5, 2021

lenaschoenburg commented Oct 7, 2021 • edited Loading

deepthidevaki commented Oct 7, 2021

lenaschoenburg commented Oct 7, 2021 •

edited

Loading