release-23.1: kvserver: don't put a timeout on circuit breaker probes #106054

tbg · 2023-07-03T11:32:44Z

Backport 1/1 commits from #105896.

Going to let this bake for a week or two, just in case.

Release justification: stability fix coming out of a post-mortem

/cc @cockroachdb/release

Before this commit, replica circuit breaker probes used a timeout. This was
done as a safeguard: even if there were some bug in reproposal code, we could
conceivably end up with an eternally stuck probe even though new commands would
go through.

In reality though, well-intended is often the opposite of well done. Under
longer outages, these probe attempts piled up in the r.mu.proposals map,
where they wouldn't get removed even if the probe got canceled - because we
remove a proposal only when it applies (this is necessary for commands that
hold latches, though probes don't hold latches and thus could be treated
differently).

This led to memory build-up on nodes with affected replicas, but another effect
of having many probe proposals is that they would all put new entries into the
raft log every couple of seconds. A constant arrival rate of requests (given by
the probes) plus the regular duplication of everything that is inflight implies
quadratic growth of the number of entries in the log. Since we currently don't
have an effective limit on the memory usage in the replication layer, these
large raft logs could have a profoundly destabilizing effect on clusters once
they started to recover, and in extreme cases would lead into a metaunstable
regime.

This commit removes the timeout from the probe, meaning that unless the node
is restarted, it will send ~one probe per replica and block until this probe
returns. Under the hood, the replication layer will still re-add this probe
to the log periodically, however this only results in linear growth.

Follow-up work can then reduce this linear growth further.

Touches #103908. I'm leaving the issue since the goal should be to not even
have linear growth in this case; will re-title.

To verify these changes, I ran a three-node local roachprod cluster with and
without the changes¹. After initial up-replication, I stopped nodes 2 and 3 and
had a coffee. Upon returning, I looked at the ranges endpoint for the remaining
node and found a range that had the circuit breaker tripped, noting its
rangeID. I then stopped the node, and dumped the raft log for that rangeID.
Unsurprisingly, this confirmed that we were only seeing reproposals of a single
probe in the log with this PR, and multiple separate probes being reproposed
multiple times without this PR. In other words, we were seeing linear and not
quadratic growth with this PR.

Epic: CRDB-25287
Release note (bug fix): under prolonged unavailability (such as loss of
quorum), affected ranges would exhibit raft log growth that was quadratic as a
function of the duration of the outage. Now this growth is approximately linear
instead.

and disabled CheckQuorum which independently fixes this bug, however
this PR is intended to be backported. ↩

blathers-crl · 2023-07-03T11:32:46Z

blathers-crl · 2023-07-03T11:32:48Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-07-03T11:32:52Z

This change is

Before this commit, replica circuit breaker probes used a timeout. This was done as a safeguard: even if there were some bug in reproposal code, we could conceivably end up with an eternally stuck probe even though new commands would go through. In reality though, well-intended is often the opposite of well done. Under longer outages, these probe attempts piled up in the `r.mu.proposals` map, where they wouldn't get removed even if the probe got canceled - because we remove a proposal only when it applies (this is necessary for commands that hold latches, though probes don't hold latches and thus could be treated differently). This led to memory build-up on nodes with affected replicas, but another effect of having many probe proposals is that they would all put new entries into the raft log every couple of seconds. A constant arrival rate of requests (given by the probes) plus the regular duplication of everything that is inflight implies quadratic growth of the number of entries in the log. Since we currently don't have an effective limit on the memory usage in the replication layer, these large raft logs could have a profoundly destabilizing effect on clusters once they started to recover, and in extreme cases would lead into a metaunstable regime. This commit removes the timeout from the probe, meaning that unless the node is restarted, it will send ~one probe per replica and block until this probe returns. Under the hood, the replication layer will still re-add this probe to the log periodically, however this only results in linear growth. Follow-up work can then reduce this linear growth further. Touches cockroachdb#103908. I'm leaving the issue since the goal should be to not even have linear growth in this case; will re-title. ---- To verify these changes, I ran a three-node local roachprod cluster with and without the changes. After initial up-replication, I stopped nodes 2 and 3 and had a coffee. Upon returning, I looked at the ranges endpoint for the remaining node and found a range that had the circuit breaker tripped, noting its rangeID. I then stopped the node, and dumped the raft log for that rangeID. Unsurprisingly, this confirmed that we were only seeing reproposals of a single probe in the log with this PR, and multiple separate probes being reproposed multiple times without this PR. In other words, we were seeing linear and not quadratic growth with this PR. Epic: CRDB-25287 Release note (bug fix): under prolonged unavailability (such as loss of quorum), affected ranges would exhibit raft log growth that was quadratic as a function of the duration of the outage. Now this growth is approximately linear instead.

shralex · 2023-07-17T02:40:37Z

is this ready to be merged ? also the 22.2 backport

tbg · 2023-07-17T07:35:25Z

is this ready to be merged ? also the 22.2 backport

Yes, my scheduled reminder to merge it just fired :-)

tbg requested a review from a team July 3, 2023 11:32

tbg requested a review from erikgrinaker July 3, 2023 11:32

erikgrinaker approved these changes Jul 3, 2023

View reviewed changes

tbg force-pushed the backport23.1-105896 branch from 4b438f2 to 025ce89 Compare July 11, 2023 10:29

tbg merged commit cbcef9f into cockroachdb:release-23.1 Jul 17, 2023
5 of 6 checks passed

tbg deleted the backport23.1-105896 branch July 18, 2023 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-23.1: kvserver: don't put a timeout on circuit breaker probes #106054

release-23.1: kvserver: don't put a timeout on circuit breaker probes #106054

tbg commented Jul 3, 2023 •

edited

Loading

blathers-crl bot commented Jul 3, 2023

blathers-crl bot commented Jul 3, 2023

cockroach-teamcity commented Jul 3, 2023

shralex commented Jul 17, 2023

tbg commented Jul 17, 2023

release-23.1: kvserver: don't put a timeout on circuit breaker probes #106054

release-23.1: kvserver: don't put a timeout on circuit breaker probes #106054

Conversation

tbg commented Jul 3, 2023 • edited Loading

Footnotes

blathers-crl bot commented Jul 3, 2023

blathers-crl bot commented Jul 3, 2023

cockroach-teamcity commented Jul 3, 2023

shralex commented Jul 17, 2023

tbg commented Jul 17, 2023

tbg commented Jul 3, 2023 •

edited

Loading