Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: kvserver: don't put a timeout on circuit breaker probes #106054

Merged
merged 1 commit into from
Jul 17, 2023

Conversation

tbg
Copy link
Member

@tbg tbg commented Jul 3, 2023

Backport 1/1 commits from #105896.

Going to let this bake for a week or two, just in case.

Release justification: stability fix coming out of a post-mortem

/cc @cockroachdb/release


Before this commit, replica circuit breaker probes used a timeout. This was
done as a safeguard: even if there were some bug in reproposal code, we could
conceivably end up with an eternally stuck probe even though new commands would
go through.

In reality though, well-intended is often the opposite of well done. Under
longer outages, these probe attempts piled up in the r.mu.proposals map,
where they wouldn't get removed even if the probe got canceled - because we
remove a proposal only when it applies (this is necessary for commands that
hold latches, though probes don't hold latches and thus could be treated
differently).

This led to memory build-up on nodes with affected replicas, but another effect
of having many probe proposals is that they would all put new entries into the
raft log every couple of seconds. A constant arrival rate of requests (given by
the probes) plus the regular duplication of everything that is inflight implies
quadratic growth of the number of entries in the log. Since we currently don't
have an effective limit on the memory usage in the replication layer, these
large raft logs could have a profoundly destabilizing effect on clusters once
they started to recover, and in extreme cases would lead into a metaunstable
regime.

This commit removes the timeout from the probe, meaning that unless the node
is restarted, it will send ~one probe per replica and block until this probe
returns. Under the hood, the replication layer will still re-add this probe
to the log periodically, however this only results in linear growth.

Follow-up work can then reduce this linear growth further.

Touches #103908. I'm leaving the issue since the goal should be to not even
have linear growth in this case; will re-title.


To verify these changes, I ran a three-node local roachprod cluster with and
without the changes1. After initial up-replication, I stopped nodes 2 and 3 and
had a coffee. Upon returning, I looked at the ranges endpoint for the remaining
node and found a range that had the circuit breaker tripped, noting its
rangeID. I then stopped the node, and dumped the raft log for that rangeID.
Unsurprisingly, this confirmed that we were only seeing reproposals of a single
probe in the log with this PR, and multiple separate probes being reproposed
multiple times without this PR. In other words, we were seeing linear and not
quadratic growth with this PR.

Epic: CRDB-25287
Release note (bug fix): under prolonged unavailability (such as loss of
quorum), affected ranges would exhibit raft log growth that was quadratic as a
function of the duration of the outage. Now this growth is approximately linear
instead.

Footnotes

  1. and disabled CheckQuorum which independently fixes this bug, however
    this PR is intended to be backported.

@tbg tbg requested a review from a team July 3, 2023 11:32
@blathers-crl
Copy link

blathers-crl bot commented Jul 3, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl
Copy link

blathers-crl bot commented Jul 3, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from erikgrinaker July 3, 2023 11:32
Before this commit, replica circuit breaker probes used a timeout. This was
done as a safeguard: even if there were some bug in reproposal code, we could
conceivably end up with an eternally stuck probe even though new commands would
go through.

In reality though, well-intended is often the opposite of well done. Under
longer outages, these probe attempts piled up in the `r.mu.proposals` map,
where they wouldn't get removed even if the probe got canceled - because we
remove a proposal only when it applies (this is necessary for commands that
hold latches, though probes don't hold latches and thus could be treated
differently).

This led to memory build-up on nodes with affected replicas, but another effect
of having many probe proposals is that they would all put new entries into the
raft log every couple of seconds. A constant arrival rate of requests (given by
the probes) plus the regular duplication of everything that is inflight implies
quadratic growth of the number of entries in the log. Since we currently don't
have an effective limit on the memory usage in the replication layer, these
large raft logs could have a profoundly destabilizing effect on clusters once
they started to recover, and in extreme cases would lead into a metaunstable
regime.

This commit removes the timeout from the probe, meaning that unless the node
is restarted, it will send ~one probe per replica and block until this probe
returns. Under the hood, the replication layer will still re-add this probe
to the log periodically, however this only results in linear growth.

Follow-up work can then reduce this linear growth further.

Touches cockroachdb#103908. I'm leaving the issue since the goal should be to not even
have linear growth in this case; will re-title.

----

To verify these changes, I ran a three-node local roachprod cluster with and
without the changes. After initial up-replication, I stopped nodes 2 and 3 and
had a coffee. Upon returning, I looked at the ranges endpoint for the remaining
node and found a range that had the circuit breaker tripped, noting its
rangeID. I then stopped the node, and dumped the raft log for that rangeID.
Unsurprisingly, this confirmed that we were only seeing reproposals of a single
probe in the log with this PR, and multiple separate probes being reproposed
multiple times without this PR. In other words, we were seeing linear and not
quadratic growth with this PR.

Epic: CRDB-25287
Release note (bug fix): under prolonged unavailability (such as loss of
quorum), affected ranges would exhibit raft log growth that was quadratic as a
function of the duration of the outage. Now this growth is approximately linear
instead.
@shralex
Copy link
Contributor

shralex commented Jul 17, 2023

is this ready to be merged ? also the 22.2 backport

@tbg
Copy link
Member Author

tbg commented Jul 17, 2023

is this ready to be merged ? also the 22.2 backport

Yes, my scheduled reminder to merge it just fired :-)

@tbg tbg merged commit cbcef9f into cockroachdb:release-23.1 Jul 17, 2023
5 of 6 checks passed
@tbg tbg deleted the backport23.1-105896 branch July 18, 2023 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants