-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.1: kvserver: don't put a timeout on circuit breaker probes #106054
Conversation
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Before this commit, replica circuit breaker probes used a timeout. This was done as a safeguard: even if there were some bug in reproposal code, we could conceivably end up with an eternally stuck probe even though new commands would go through. In reality though, well-intended is often the opposite of well done. Under longer outages, these probe attempts piled up in the `r.mu.proposals` map, where they wouldn't get removed even if the probe got canceled - because we remove a proposal only when it applies (this is necessary for commands that hold latches, though probes don't hold latches and thus could be treated differently). This led to memory build-up on nodes with affected replicas, but another effect of having many probe proposals is that they would all put new entries into the raft log every couple of seconds. A constant arrival rate of requests (given by the probes) plus the regular duplication of everything that is inflight implies quadratic growth of the number of entries in the log. Since we currently don't have an effective limit on the memory usage in the replication layer, these large raft logs could have a profoundly destabilizing effect on clusters once they started to recover, and in extreme cases would lead into a metaunstable regime. This commit removes the timeout from the probe, meaning that unless the node is restarted, it will send ~one probe per replica and block until this probe returns. Under the hood, the replication layer will still re-add this probe to the log periodically, however this only results in linear growth. Follow-up work can then reduce this linear growth further. Touches cockroachdb#103908. I'm leaving the issue since the goal should be to not even have linear growth in this case; will re-title. ---- To verify these changes, I ran a three-node local roachprod cluster with and without the changes. After initial up-replication, I stopped nodes 2 and 3 and had a coffee. Upon returning, I looked at the ranges endpoint for the remaining node and found a range that had the circuit breaker tripped, noting its rangeID. I then stopped the node, and dumped the raft log for that rangeID. Unsurprisingly, this confirmed that we were only seeing reproposals of a single probe in the log with this PR, and multiple separate probes being reproposed multiple times without this PR. In other words, we were seeing linear and not quadratic growth with this PR. Epic: CRDB-25287 Release note (bug fix): under prolonged unavailability (such as loss of quorum), affected ranges would exhibit raft log growth that was quadratic as a function of the duration of the outage. Now this growth is approximately linear instead.
is this ready to be merged ? also the 22.2 backport |
Yes, my scheduled reminder to merge it just fired :-) |
Backport 1/1 commits from #105896.
Going to let this bake for a week or two, just in case.
Release justification: stability fix coming out of a post-mortem
/cc @cockroachdb/release
Before this commit, replica circuit breaker probes used a timeout. This was
done as a safeguard: even if there were some bug in reproposal code, we could
conceivably end up with an eternally stuck probe even though new commands would
go through.
In reality though, well-intended is often the opposite of well done. Under
longer outages, these probe attempts piled up in the
r.mu.proposals
map,where they wouldn't get removed even if the probe got canceled - because we
remove a proposal only when it applies (this is necessary for commands that
hold latches, though probes don't hold latches and thus could be treated
differently).
This led to memory build-up on nodes with affected replicas, but another effect
of having many probe proposals is that they would all put new entries into the
raft log every couple of seconds. A constant arrival rate of requests (given by
the probes) plus the regular duplication of everything that is inflight implies
quadratic growth of the number of entries in the log. Since we currently don't
have an effective limit on the memory usage in the replication layer, these
large raft logs could have a profoundly destabilizing effect on clusters once
they started to recover, and in extreme cases would lead into a metaunstable
regime.
This commit removes the timeout from the probe, meaning that unless the node
is restarted, it will send ~one probe per replica and block until this probe
returns. Under the hood, the replication layer will still re-add this probe
to the log periodically, however this only results in linear growth.
Follow-up work can then reduce this linear growth further.
Touches #103908. I'm leaving the issue since the goal should be to not even
have linear growth in this case; will re-title.
To verify these changes, I ran a three-node local roachprod cluster with and
without the changes1. After initial up-replication, I stopped nodes 2 and 3 and
had a coffee. Upon returning, I looked at the ranges endpoint for the remaining
node and found a range that had the circuit breaker tripped, noting its
rangeID. I then stopped the node, and dumped the raft log for that rangeID.
Unsurprisingly, this confirmed that we were only seeing reproposals of a single
probe in the log with this PR, and multiple separate probes being reproposed
multiple times without this PR. In other words, we were seeing linear and not
quadratic growth with this PR.
Epic: CRDB-25287
Release note (bug fix): under prolonged unavailability (such as loss of
quorum), affected ranges would exhibit raft log growth that was quadratic as a
function of the duration of the outage. Now this growth is approximately linear
instead.
Footnotes
and disabled CheckQuorum which independently fixes this bug, however
this PR is intended to be backported. ↩