-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: capability to quarantine a range #62199
Comments
If we build a quarantining / mark unavailable tool we'll want a way to get it out of that state too. That is probably obvious but wanted to make note here so that we don't forget |
With serverless coming, the priority of this kind of thing goes up. We need to avoid host cluster wide outages, or at least keep them very short. A panic in the raft log would cause one (granted not host cluster wide but would affect multiple customers with no clear mitigation path). |
@joshimhoff I think what you really need is this #33007, I am skeptical of the need for this quarantine tool in the short term (i.e. next release) once we have the circuit breaker. The original way we got into this issue at the customer is no longer possible and to even get into that state they had to muck around with esoteric cluster settings, which we are unlikely to do in the cloud unless there is something bigger happening. |
TL;DR I think we ultimately want an offline tool that zeroes out a raft entry for a given rangeID on a store and rely on improving unsafe-remove-unsafe-replicas for everything else. But step one is making replicas unavailable instead of crashing on apply-loop failures and working on the circuit breakers. (#33007). Longer circuitous train of thought follows: If a raft command reliably panics the node on apply, we have a problem. We can mitigate that to some degree by appropriate use of In #70860 I suggest an offline tool that replaces a given raft log entry with an empty entry. That seemed like an appropriate first step (but keep reading) and provides a way out if the panic is completely deterministic - then we know that no replica will successfully apply whatever the command was trying to do to the state machine. We can thus use the tool the replace the log entry with a benign one (either completely offline or in rolling-restart fashion). Where it gets tricky is if the panic is only "approximately" deterministic. Say two out of three nodes have their replica of r1 stalled, but the third node managed to push through for some reason (maybe got lucky). Then we don't want to replace the entry in the first two nodes' log, or the replica inconsistency checker will get us later. What we'd want to do in this case is truncate the log (on the two nodes) such that the offending entry is gone. The raft leader (node that managed to apply the entry) can then replicate snapshots to them and things will be back to normal. So we need two offline tools:
Neither are terribly difficult to write. We'd use them as follows:
We cannot get away with using the truncate tool for the "all replicas stalled/crashed" case unfortunately - at least not easily. The basic problem is that we need the state machine to have its AppliedState at or above the problematic index, but this won't be true (due to the crash). We could say we can move it up by one as part of the recovery, but this is fraught as well: the applied index may lag the problematic command (since application of command isn't synched) and so we'd also have to teach the tool how to apply raft commands, which is unrealistic. However, there's also the opportunity to replace the "offline raft log truncation" tool with (an improved) |
This all makes sense to me. In the longer term, as CC serverless grows in scale & reliability requirements, I think the simplicity of saying "this range is now unavailable" via
Yes agreed. |
re: |
Yes, we'd immediately return errors when ops on the range are made, as in circuit breaking (IIUC circuit breaking). We'd also pause async machinery such as the raft apply loop, hence the ability to mitigate stuff like:
|
Here is a related POC: #78092. In the POC, the capability is called "cordoning", and it applies to a replica, rather than a whole range. In the POC, cordoning also happens automatically in face of a below apply panic. There is no CLI tool to trigger it manually, but that could be changed. |
This came up in the postmortem for internal issue https://cockroachdb.zendesk.com/agent/tickets/21124, where only a handful of ranges were affected but LoQ tool had to be run on 1000s of ranges since some nodes were crash looping. |
This came up in the context of a postmortem where reading an extremely large value in the raft log for a range was causing a panic and node crashes. In some cases such crashing can spread, if different nodes get picked to host a range (and the crashing does not happen in the act of generating the range snapshot itself). Even if it does not spread, repeatedly crashing all the nodes that host replicas of a range can cause extreme degradation of the cluster due to the unavailability of other ranges hosted by those nodes, or make most of the nodes unavailable for small clusters.
The concept of range quarantining exists in other distributed range partitioned systems. In the CockroachDB context this would extend down to the raft log, to prevent any reads/writes to the log. Although other systems have support for automated quarantining, we would start with manual quarantining, where a human could take into account the other consequences of range data unavailability. Any queries reading/writing to that range would immediately return with an error and the corresponding transactions would fail.
Doing quarantining is not straightforward in the CockroachDB context: unlike a database that disaggregates storage for a range into a distributed file system, such that data availability can be maintained independent of the database, turning down node(s) that hosted replicas of the range will decrease the number of copies and can in the extreme result in loss of all data for that range. So we would want to keep track of the current replication levels for a range, and which replicas are up-to-date wrt having all the committed entries in the raft log, and warn admins from turning down all such up-to-date nodes. Also implied in the above is that rebalancing of such quarantined ranges is not possible -- this means disk space pressure at a node will need to be alleviated by moving one of the non-quarantined ranges.
cc @andreimatei @lunevalex
Jira issue: CRDB-2825
The text was updated successfully, but these errors were encountered: