-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the heuristic on when to send a snapshot or events to a follower #7784
Comments
I like the second idea, but I would prefer the first and simpler solution for now. If we see issues with that we can still improve and move to a more sophisticated solution I think |
I've started to look into this a bit and wanted to note down what I found out and also confirm with you that I'm looking in the right direction. I think the starting point for replicating snapshots or events is in |
You are in the right direction.
If the follower is slower, then currently we "always" meet the condition. But it can also happen that the follower is expecting an event which is after the snapshot. In this case, the leader will always send the event and not the snapshot. This issue is regarding when the follower is slower, should we send the snapshot or the event.
Yes. You can add another condition to decide if you should send the snapshot or the event. |
Remember that it can happen that events are already deleted. So if the heuristics decide to send the event, it may not always have that the event. In that case it has to sent the snapshot. |
I've tried to write a test that validates the new behaviour by disconnecting a follower, appending new entries and then reconnecting the follower. I've noticed that in the If we leave it as is I think I might need to write a new test helper that supports stopping/slowing down a follower so that I can create a follower that is lagging behind. (This seems related to #4586 which was just closed). |
|
When the leader replicates to a follower, it can either send the next event to be replicated or a snapshot at a later index if one exists. If the follower is lagging behind many events it is more efficient to send a snapshot and skip all the missing events. However if the follower is only lagging behind by a few events, replicating snapshot has more overhead. Now we always send the snapshot if the snapshot exists. Now that the follower is building its own state, replicating snapshot is wasteful because the state build by the follower will be thrown away and it has to restart from the new snapshot.
This behavior was observed in one of our benchmark.
![image](https://user-images.githubusercontent.com/1997478/132472330-3da72d30-0b0a-4fe4-83d0-bc096621e5d6.png)
There are frequent snapshot replications:
Frequent re-installation of streamprocessor also leads to temporary increase in memory consumption:
![image](https://user-images.githubusercontent.com/1997478/132472594-b28e8b20-bc28-4c2f-ab0b-cdf1f3cf0439.png)
Frequent replication of snapshot is not optimal if the snapshot is big. Hence we need a better heuristic on when to send a snapshot vs when to send the event even if a snapshot exists at a higher index. A simple strategy can be to send the events if the number of events until the snapshot index is less than a threshold, else send the snapshot. A more complex strategy could be based on the size of snapshot vs size of missing events.
This also means we shouldn't compact the logs immediately after taking a snapshot, but wait until the followers are caught up with either the snapshot or the events. For safety, compact after a specific timeout even if followers are not caught up.
The text was updated successfully, but these errors were encountered: