Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer replicating events instead of snapshots #7957

Merged
5 commits merged into from
Oct 15, 2021

Conversation

lenaschoenburg
Copy link
Member

Description

This adds a threshold on number of events a follower can lag behind a leader to decide wether we should replicate events or a snapshot. This should save resources because replicating a low number of events should be more efficient than replicating a snapshot.
We can't always replicate events even if the lag is below the threshold because some events might not be available anymore due to log compaction.

Related issues

closes #7784

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/0.25) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement

@lenaschoenburg
Copy link
Member Author

There are a couple of tests that I'll need to adjust because they are assuming that we'd replicate snapshots in most cases.

@lenaschoenburg
Copy link
Member Author

@deepthidevaki Seems like there are still some unresolved test failures that I have to work on, sorry for the early request for review.

@deepthidevaki
Copy link
Contributor

@oleschoenburg Let me know if you need help with the failing tests.

@lenaschoenburg
Copy link
Member Author

Tests seem to run through now (except a temporary issue for LGTM, not sure how to restart that). At least on my machine the relevant tests seem unusually flaky. One issue seems to be that during tests, filling a segment takes too long. I'd like to add a large payload to every message that is used to fill the segment to hopefully speed up this process.

Copy link
Contributor

@deepthidevaki deepthidevaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀
Please see my comment on the new "condition" and the test to reproduce the error. Othwerise it looks good, just have some optional comments.

This enables us to perform snapshots on members without triggering log compaction. To achieve this we remove the RaftSnapshotListener and (optionally) run compaction synchronously.
… be preferred

Adds a config value that indicates how much a follower must lag behind the leader before we prefer replicating a snapshot instead of events. The default value of 100 was chosen arbitrarily and might need readjustment.
Copy link
Contributor

@deepthidevaki deepthidevaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! 🎉

Please see my comment before merging.

Various tests assumed that we always replicate snapshots. Here we either remove tests that are no longer necessary or adjust tests such that snapshots are replicated, for example by triggering log compaction
@lenaschoenburg
Copy link
Member Author

I'll run some benchmarks before merging to confirm that the current threshold of 100 is a sensible value.

@lenaschoenburg
Copy link
Member Author

Looking at the benchmarks it appears that this PR has no performance impact. With the current benchmark setup, some followers are lagging behind by >10k events which means that the current threshold of 100 is too low to prevent replicating snapshots.

Throughput

We can still find a better threshold or a different method to choose between replicating snapshots or events later.

@lenaschoenburg
Copy link
Member Author

bors merge

@deepthidevaki
Copy link
Contributor

Thanks @oleschoenburg for running the benchmark.
If the follower is lagging behind be ~10K events, then it is better to send the snapshot instead of the events. As you said it make sense to evaluate what would be a good threshold for it. 100 might be low, but 10K is too big.

@ghost
Copy link

ghost commented Oct 15, 2021

Build succeeded:

@ghost ghost merged commit 8a4adb8 into develop Oct 15, 2021
@ghost ghost deleted the 7784-event-replication-heuristic branch October 15, 2021 09:32
@Zelldon
Copy link
Member

Zelldon commented Oct 22, 2021

@npepinpe are we planning to backport this?

@npepinpe
Copy link
Member

Would this fix or help fix any bugs?

@Zelldon
Copy link
Member

Zelldon commented Oct 22, 2021

My current assumption is that it fixes #7955 but I'm still investigating. I started a new benchmark yesterday and will keep it running for a while.

@npepinpe
Copy link
Member

npepinpe commented Oct 22, 2021

Then we can do that 👍 @oleschoenburg @deepthidevaki - how do you see these changes? How is the risk/value ratio of backporting this? It doesn't seem like too much at a quick glance so I'd be fine with backporting, but you probably have a better idea.

If you think it's fine and worth it, then can one of you please just walk Ole through how to backport things? Thanks!

@deepthidevaki
Copy link
Contributor

I don't see any risk backporting this. If there is a chance this fixes the bug, we can backport it.
In @oleschoenburg 's benchmark, there were still many "InstallRequests" because the follower was lagging behind by 1000s of events.

@lenaschoenburg
Copy link
Member Author

I'm not sure about the risk, I'll defer to @deepthidevaki for that. As for "is it worth": I'd be positively surprised if this solves #7955 because in our (admittedly limited) benchmarks we weren't able to see any impact.

@Zelldon
Copy link
Member

Zelldon commented Oct 22, 2021

Lets wait until next week, then I will check my benchmark again. :)

@Zelldon
Copy link
Member

Zelldon commented Oct 25, 2021

We can scratch that from our list. It is still failing with one partition and this fix. http://34.77.165.228/d/I4lo7_EZk/zeebe?orgId=1&from=1634801065067&to=1634854108575&var-DS_PROMETHEUS=Prometheus&var-namespace=zell-chaos-cw42&var-pod=All&var-partition=All

general

We can see again lot of install requests 🤷
raft

I think the breaking of performance is also highly depended on the state size

runtime
snapshot

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the heuristic on when to send a snapshot or events to a follower
5 participants