Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: restarted node in need of snapshots can be wedged for long time #37906

Open
tbg opened this issue May 29, 2019 · 13 comments

Comments

5 participants
@tbg
Copy link
Member

commented May 29, 2019

repro steps:

  • set up a three node tpcc-2k cluster (any warehouse count will do, but with more data the problem is more obvious)
  • run tpcc with --tolerate-errors and --wait=false
  • kill one node, but leave everything else running
  • wait a few minutes (maybe 10? more if it doesn't repro)

Restart the dead node and run a SELECT COUNT(*) on the tpcc tables through it

Chances are this will hang for minutes before returning. The reason is that none of the replicas will have been moved off that node while it was down (since there are only three nodes in the cluster), but now all of them will need Raft snapshots. There is no particular ordering in which these snapshots are requested. Some of our COUNT(*) requests (or even worse, liveness requests etc) may hit a replica on the local node in need of a snapshot. The lease request will hang until the snapshot is applied, which can take a long time.

We should do something here, like short-circuit lease request (for a blank NotLeaseholderErr) if we can with reasonable accuracy detect that the node is in need of a snapshot. This isn't trivial to detect on a follower (would be easier on the leader, but alas). We could also prioritize snapshots for system ranges higher to resolve the somewhat orthogonal problem that such hung requests on the system range can prevent the node from becoming live in the first place, which causes hard-to-diagnose problems.

@andy-kimball

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2019

Can the leader detect if one or more of the Raft followers is behind, and if so, delay Raft log truncation until the dead node timeout expires (i.e. 5 minutes)?

@tbg

This comment has been minimized.

Copy link
Member Author

commented Jun 26, 2019

Raft log truncation at the time of writing takes into account node liveness in the sense that it will immediately become willing to "sacrifice" followers that are not live (i.e. after ~4-5s of downtime). As we discussed in the weekly, this amounts to a historical artifact (because we had to be very careful to not let the Raft log grow large historically due to its inclusion in snapshots) that isn't a concern any more as apparent from this issue the current behavior is undesirable.

I suggest the following heuristics:

  • if the Raft log is excessively large (say > 2x larger than a snapshot would be, and at least a few MB - maybe much more; tuning this is TBD), sacrifice the down followers (i.e. truncate the log). Also do this if the store is low on disk. Otherwise,
  • if there is any down follower, don't truncate (regardless of how long the node has been down).

The second step may seem unexpectedly simple, but I think it checks out. If we sacrifice followers the moment they're marked as dead they will still have most of their replicas (removing them requires sending snapshots elsewhere, so it'll go at a leisurely pace compared to how fast the log queue can truncate). A dead node's replicas are supposedly going to be transferred away completely over time OR the node will restart, at which point log truncation will become possible again (since no dead follower exists any more). And the first criterion protects us from amassing a prohibitively long Raft log. At the same time, it decouples the log queue from the store dead timeout, which is a nice property.

@tbg

This comment has been minimized.

Copy link
Member Author

commented Jun 26, 2019

The other problem we have here is that replicas in need of a snapshot translate the latency of queueing for and finally receiving the snapshot to any clients waiting on the replica. If we can detect the situation in which a replica is waiting for a snapshot without a timeout, we're in a good place to prevent this from becoming a problem because we can refuse the requests with a NotLeaseholderError which should prompt DistSender to try another replica (eventually discovering and caching the true leaseholder). Ideally we can achieve this without digging into the Raft internals too much, though it's even unclear what the Raft internals provide. On a follower, we don't even receive the latest commit index, and we're also not really sure if we will receive a snapshot or not. But there will be some way we can reliably figure this out

However, one spicy problem is that when a request first arrives at such a range, the result will be a lease request proposed to Raft. This lease request's outcome will only be known after the snapshot arrives (and the replica can catch up all the way to the result of the lease request). As a result, if the replica actually succeeds in acquiring the lease (it may well if no other lease has been active since the replica went down), no other replica can serve traffic until the stuck replica receives its snapshot, effectively stalling that range for that duration (which can be many hours depending on the amount of data that needs to be transferred). We could "probe" Raft (with a timeout) before issuing any lease requests and could build a small state machine around the probes that can translate into a circuit breaker to solve the first problem above. "Probing" could either mean submitting a noop through Raft, or sending a one-off RPC to a quorum of replicas and noting the highest committed index, which would be required to be surpassed by that of the local replica before declaring the probe a success.

I think to defuse this issue to 80% we need to fix the log truncation and come up with a relatively reliable indicator for returning a NotLeaseholderError, falling back to a timeout on the lease request to prevent the worst case of requests getting stuck indefinitely.

The situation in which the down node gets a lease is hopefully rare because it needing a snapshot sort of implies that there was another leader in the meantime that wrote more log entries.

Recording the outcome of what @danhhz and I just discussed: we're prioritizing getting something in place for 19.1.3 which is going to be cut in ~ a week, so time is of the essence. We're conjecturing that we'll get most bang for our buck by relaxing the raft log queue truncation heuristics to avoid sacrificing dead followers until the size of the raft log gets bigger than a few mb (already true today). At worst this will lead to few-mb Raft logs which aren't a concern (for example, during IMPORT/RESTORE we regularly dozens of mbs in the Raft log, though concentrated in only a handful of large entries). This will achieve the absence of snapshots for any but the "hot" ranges (that write more than X mb during the downtime) where previously even moderately lukewarm ranges would require snapshots.

We'll get this landed on master ASAP and let it bake, and additionally get some more manual testing in using the steps in the first issue. Additionally we'll return a NotLeaseholderError after a timeout from getting the lease as a failsafe.

@bdarnell

This comment has been minimized.

Copy link
Member

commented Jun 26, 2019

Could the quota pool cause this to escalate into even more severe failures? If the node is up and responding to heartbeats but can't accept new log entries until it gets a snapshot, it will prevent the release of proposal quota and then block writes even on the live nodes.

@tbg

This comment has been minimized.

Copy link
Member Author

commented Jun 26, 2019

@bdarnell the quota pool handles this case, though we'll need to check that it actually works as advertised, I don't think it ever mattered before.

// Only consider followers who are in advance of the quota base
// index. This prevents a follower from coming back online and
// preventing throughput to the range until it has caught up.
if progress.Match < r.mu.proposalQuotaBaseIndex {
return
}

@danhhz

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2019

I can reproduce this readily with a ~10m roachtest that's basically Tobi's original instructions. In addition to absurdly long response times on SQL queries, I sometimes can't even get a SQL connection to the just restarted node (getting user info?). (I'm also curious if this effect is the underlying cause of what I was seeing in #36879.)

I've looked at this today and I think we need two mitigations. First, when a node is down for a bit (~minutes), we make it less likely for it to need a snapshot when it is restarted. Second, we lessen the impact if it does need snapshots. I've focused first on things that are hopefully safe enough to backport into 19.1.3, to unblock our affected customer. 19.1.3 is cut a week from Monday, so I'm hoping to get fixes in soon so they can bake for a week and shake out any unforeseen consequences.

To avoid snapshots when a node comes back, we'll make the log truncation less aggressive. I think something like what's described in #37906 (comment) is a good long term solution, but to me it feels too risky to backport without more baking time than we have. A good compromise is to simply raise the threshold for what the raft log queue considers "recently active". (The raft log queue doesn't truncate when a follower is missing but has been recently active.) Recent currently has a threshold of 10s and shares the same MaxQuotaReplicaLivenessDuration constant as the quota pool. This sharing seems to have been introduced in #32439 but not for any particular principled reason. A natural place to raise the threshold to is the server.time_until_store_dead cluster setting, which is the time we give up on the node and start transferring away its replicas.

Experimentally this seems to work decently well. There are still a small number of snapshots, but it's small enough that the node recovers quickly when it comes back up. We should investigate why there are still this small number of snapshots, but in the meantime the second mitigation (lessening the impact of shapshots on a node that's just come up) will cover the rest. Note that this only helps if the node is down for less than server.time_until_store_dead, which defaults to 5m. If I leave the node down for 7m, there is a rush of raft log truncations at 5m and when it comes back up, it needs tons of snapshots. However, our customer can temporarily raise the server.time_until_store_dead setting while doing the rolling migration and lower it when the migration is done.

For the second mitigation, I think we need to have some sort of timeout on redirectOnOrAcquireLease, will will return NotLeaseholderError when it fires. redirectOnOrAcquireLease currently blocks on getting the snapshot, which can be hours if every replica is getting one and this replica is at the end of the line. Relying on this timeout will have an affect on tail latencies, but so far, it's the least risky thing I've thought of. Some more sophisticated ideas have been tossed around to do this without the tail latency, but it's nice to have a low-risk, obvious-how-it-works CYA here. I have yet to run experiments for this. The two big open questions are what threshold to use for the timeout and what happens if the timeout fires on the leaseholder?

danhhz added a commit to danhhz/cockroach that referenced this issue Jun 27, 2019

storage: truncate raft log less aggressively when replica is missing
Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, when we replicated the raft log truncation, we'd have to include
it in snapshots anyway, but that's changed recently.

The problem is when a node is down for a short time but more than 10
seconds (think a couple minutes). It might come back up to find that it
needs a snapshot for most ranges. We rate limit snapshots fairly
aggressively because they've been disruptive in the past, but this means
that it could potentially take hours for a node to recover from a 2
minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit ties the threshold that the raft log queue uses for recent
activity to `server.time_until_store_dead`, which is already the
threshold where we give up on a down node and start transferring away
its replicas. It defaults to 5 minutes, which is still short, but it's
more reasonable than 10 seconds. Crucially, it also is a cluster
setting, so it can be overridden.

We'd like to move even further in the direction of leniency about raft
log truncation while a replica is missing, but this is a compromise
that's intended to be less risky to backport to 19.1.x.

Partial mitigation for cockroachdb#37906

Potentially also helps with cockroachdb#36879

Release note (bug fix): Nodes that have been down for less than
`server.time_until_store_dead` now recover more quickly when they
rejoin.

danhhz added a commit to danhhz/cockroach that referenced this issue Jun 27, 2019

storage: truncate raft log less aggressively when replica is missing
Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, when we replicated the raft log truncation, we'd have to include
it in snapshots anyway, but that's changed recently.

The problem is when a node is down for a short time but more than 10
seconds (think a couple minutes). It might come back up to find that it
needs a snapshot for most ranges. We rate limit snapshots fairly
aggressively because they've been disruptive in the past, but this means
that it could potentially take hours for a node to recover from a 2
minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit ties the threshold that the raft log queue uses for recent
activity to `server.time_until_store_dead`, which is already the
threshold where we give up on a down node and start transferring away
its replicas. It defaults to 5 minutes, which is still short, but it's
more reasonable than 10 seconds. Crucially, it also is a cluster
setting, so it can be overridden.

We'd like to move even further in the direction of leniency about raft
log truncation while a replica is missing, but this is a compromise
that's intended to be less risky to backport to 19.1.x.

Partial mitigation for cockroachdb#37906

Potentially also helps with cockroachdb#36879

Release note (bug fix): Nodes that have been down for less than
`server.time_until_store_dead` now recover more quickly when they
rejoin.
@tbg

This comment has been minimized.

Copy link
Member Author

commented Jun 27, 2019

I posted some discussion of the log truncation approach on the review in #38484. The TL;DR is that the most intuitive safe code change seems to not care about the 5 minute distinction, which is nice. We leave nodes alone when they're up, and when they're not up we also leave them alone unless the log is large, in which case we sacrifice them. (Today the "leave them alone unless" isn't there, we go straight to sacrificing by pretending the followers aren't even there).

For the timeout in redirectOnOrAcquireLease, I think we'll want to start with something conservative (i.e. in the at least a few seconds range. 5?). This is owed just to the fact that it'll go into a patch release and while a shorter timeout would solve the problem "better" it may also "do things" in unexpected places.

For implementing this, do you really have to do anything more than change this code

case <-slowTimer.C:
slowTimer.Read = true
log.Warningf(ctx, "have been waiting %s attempting to acquire lease",
base.SlowRequestThreshold)
r.store.metrics.SlowLeaseRequests.Inc(1)

to early return with the NotLeaseholderError, plus lowering duration until that branch fires?

@bdarnell

This comment has been minimized.

Copy link
Member

commented Jun 27, 2019

A long-term solution to the redirectOnOrAcquireLease problem would be to replace raft-level proposal forwarding with an RPC that could return the index of the new proposal. Then we could know to stop waiting if our index is way behind the index of our proposed lease. (or we could forward the proposal with a maxIndex so it could be dropped if the leader is ahead). Replacing the proposal flow is also contemplated in #21849.

danhhz added a commit to danhhz/cockroach that referenced this issue Jun 27, 2019

storage: truncate raft log less aggressively when replica is missing
Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, snapshots historically had to include the Raft log which implied
a need to keep the size of the Raft log tightly controlled due to being
pulled into memory at the snapshot receiver, but that's changed
recently.

The problem is when a node is down for longer than 10 seconds but
shorter than the time it takes to upreplicate all of its ranges onto new
nodes. It might come back up to find that it needs a snapshot for most
ranges. We rate limit snapshots fairly aggressively because they've been
disruptive in the past, but this means that it could potentially take
hours for a node to recover from a 2 minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit avoids truncating the log past any follower's position when
all replicas have recently been active (the quota pool keeps it from
growing without bound in this case). If at least one replica hasn't
recently been active, it holds off any truncation until the log reaches
a size threshold.

Partial mitigation for cockroachdb#37906

Potentially also helps with cockroachdb#36879

Release note (bug fix): Nodes that have been down for less than
`server.time_until_store_dead` now recover more quickly when they
rejoin.
@tbg

This comment has been minimized.

Copy link
Member Author

commented Jun 27, 2019

I proposed something similar in #37906 (comment), but I like the idea of making this about Raft proposal forwarding. If the follower includes its commit index into the proposal the leader will be in a good position to make a reasonable call via heuristics, though this necessarily involves some magic values which would be good to avoid. Instead, the leader could refuse proposals unless the follower is in StateReplicate. There are lots of details to figure out with this, but it seems promising.

danhhz added a commit to danhhz/cockroach that referenced this issue Jun 28, 2019

storage: truncate raft log less aggressively when replica is missing
Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, snapshots historically had to include the Raft log which implied
a need to keep the size of the Raft log tightly controlled due to being
pulled into memory at the snapshot receiver, but that's changed
recently.

The problem is when a node is down for longer than 10 seconds but
shorter than the time it takes to upreplicate all of its ranges onto new
nodes. It might come back up to find that it needs a snapshot for most
ranges. We rate limit snapshots fairly aggressively because they've been
disruptive in the past, but this means that it could potentially take
hours for a node to recover from a 2 minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit avoids truncating the log past any follower's position when
all replicas have recently been active (the quota pool keeps it from
growing without bound in this case). If at least one replica hasn't
recently been active, it holds off any truncation until the log reaches
a size threshold.

Partial mitigation for cockroachdb#37906

Potentially also helps with cockroachdb#36879

Release note (bug fix): Nodes that have been down for less than
`server.time_until_store_dead` now recover more quickly when they
rejoin.

tbg added a commit to danhhz/cockroach that referenced this issue Jul 2, 2019

storage: truncate raft log less aggressively when replica is missing
Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, snapshots historically had to include the Raft log which implied
a need to keep the size of the Raft log tightly controlled due to being
pulled into memory at the snapshot receiver, but that's changed
recently.

The problem is when a node is down for longer than 10 seconds but
shorter than the time it takes to upreplicate all of its ranges onto new
nodes. It might come back up to find that it needs a snapshot for most
ranges. We rate limit snapshots fairly aggressively because they've been
disruptive in the past, but this means that it could potentially take
hours for a node to recover from a 2 minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit avoids truncating the log past any follower's position when
all replicas have recently been active (the quota pool keeps it from
growing without bound in this case). If at least one replica hasn't
recently been active, it holds off any truncation until the log reaches
a size threshold.

Partial mitigation for cockroachdb#37906

Potentially also helps with cockroachdb#36879

Release note (bug fix): Nodes that have been down now recover more
quickly when they rejoin, assuming they weren't down for much more
than the value of the `server.time_until_store_dead` cluster setting
(which defaults to 5 minutes).

craig bot pushed a commit that referenced this issue Jul 2, 2019

Merge #38484
38484: storage: truncate raft log less aggressively when replica is missing r=tbg a=danhhz

Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, when we replicated the raft log truncation, we'd have to include
it in snapshots anyway, but that's changed recently.

The problem is when a node is down for a short time but more than 10
seconds (think a couple minutes). It might come back up to find that it
needs a snapshot for most ranges. We rate limit snapshots fairly
aggressively because they've been disruptive in the past, but this means
that it could potentially take hours for a node to recover from a 2
minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit ties the threshold that the raft log queue uses for recent
activity to `server.time_until_store_dead`, which is already the
threshold where we give up on a down node and start transferring away
its replicas. It defaults to 5 minutes, which is still short, but it's
more reasonable than 10 seconds. Crucially, it also is a cluster
setting, so it can be overridden.

We'd like to move even further in the direction of leniency about raft
log truncation while a replica is missing, but this is a compromise
that's intended to be less risky to backport to 19.1.x.

Partial mitigation for #37906

Potentially also helps with #36879

Release note (bug fix): Nodes that have been down now recover more
quickly when they rejoin, assuming they weren't down for much more
than the value of the `server.time_until_store_dead` cluster setting
(which defaults to 5 minutes).

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

tbg added a commit to tbg/cockroach that referenced this issue Jul 3, 2019

storage: truncate raft log less aggressively when replica is missing
Previously, we'd hold off on truncating the raft log if a replica was
missing but contactable in the last 10 seconds. This meant that if a
node was down for *more* than 10 seconds, there was a good chance that
we'd truncate logs for some of its replicas (especially for its
high-traffic ones) and it would need snapshots for them when it came
back up.

This was for two reasons. First, we've historically assumed that it's
cheaper to catch a far-behind node up with a snapshot than with entries.
Second, snapshots historically had to include the Raft log which implied
a need to keep the size of the Raft log tightly controlled due to being
pulled into memory at the snapshot receiver, but that's changed
recently.

The problem is when a node is down for longer than 10 seconds but
shorter than the time it takes to upreplicate all of its ranges onto new
nodes. It might come back up to find that it needs a snapshot for most
ranges. We rate limit snapshots fairly aggressively because they've been
disruptive in the past, but this means that it could potentially take
hours for a node to recover from a 2 minute outage.

This would be merely unfortunate if there wasn't a second compounding
issue. A recently restarted node has a cold leaseholder cache. When it
gets traffic for one of its replicas, it first tries itself as the
leaseholder (maybe it will get lucky and won't need the network hop).
Unfortunately, if the replica needs a snapshot, this decision currently
blocks on it. This means that requests sent to the recently started node
could block for as long as the heavily-throttled snapshots take, hours
or even days.

Short outages of more than 10 seconds are reasonably common with routine
maintenance (rolling to a new version, swapping hardware, etc), so it's
likely that customers will hit this (and one did).

This commit avoids truncating the log past any follower's position when
all replicas have recently been active (the quota pool keeps it from
growing without bound in this case). If at least one replica hasn't
recently been active, it holds off any truncation until the log reaches
a size threshold.

Partial mitigation for cockroachdb#37906

Potentially also helps with cockroachdb#36879

Release note (bug fix): Nodes that have been down now recover more
quickly when they rejoin, assuming they weren't down for much more
than the value of the `server.time_until_store_dead` cluster setting
(which defaults to 5 minutes).
@tbg

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2019

A first mitigation has landed on master and is very likely to be included in the next 19.1.x patch release. While this first mitigation has solved the problem in our testing and should greatly reduce it in most real-world scenarios too, we are working on additional safeguards related to this issue:

  • prevent lease requests from hanging
  • prevent leadership to be attained by node that is far behind.
@bdarnell

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

Should we backport #38484 to 2.1 as well?

@danhhz

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

Does 2.1 have the work that removes the raft log from snapshots? If not, we might want to do 2.1-specific stability testing of that PR before backporting.

@tbg

This comment has been minimized.

Copy link
Member Author

commented Jul 16, 2019

No, that's 19.1. With 2.1 we run a higher risk of seeing that snapshot refusal loop again (I think snapshots with 8mb are refused, and truncation would kick in at 4mb). I believe the refusal code was often triggered by SSTs which clock in at ~22mb each from what I remember, so changing the truncation heuristics likely wouldn't make a big difference for the worse (but it should make one for the better from what we can tell). But we'd need to do our due diligence there, I'm thinking that it would pass the new roachtest 5x and a reasonably sized import (1+ hour) that passes reliably (are there any of them?) before the cherry pick should continue to do so after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.