Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: pause replication activity in a cluster #81953

Open
lunevalex opened this issue May 26, 2022 · 2 comments
Open

kvserver: pause replication activity in a cluster #81953

lunevalex opened this issue May 26, 2022 · 2 comments
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) P-3 Issues/test failures with no fix SLA T-kv KV Team
Projects

Comments

@lunevalex
Copy link
Collaborator

lunevalex commented May 26, 2022

In #81935 we discuss the prioritization of replication activity at the store level. In the same vein we should consider manual knobs to disable classes of replication/snapshot activity in a cluster. This should be controlled at the cluster level via a setting(s). For example: a situation could arise where an operator starts decommissioning a node, which may cause a latency impact or instability in the cluster. We have seen that happen for a variety of reasons before and numerous customers. In this case it would be extremely helpful to have a single universal knob to pause all decommissioning across all nodes.

We should consider the following buckets of replication activity we should consider having control over

  • Decommissioning
  • Upreplication
  • Rebalancing (i.e. range count, constraint based rebalancing)
  • Load-based rebalancing
  • Load-based lease transfers
  • Lease Transfers (i.e. count based lease transfers)

Jira issue: CRDB-16313

@lunevalex lunevalex added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. O-postmortem Originated from a Postmortem action item. labels May 26, 2022
@lunevalex lunevalex added this to Incoming in KV via automation May 26, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label May 26, 2022
@mwang1026 mwang1026 moved this from Incoming to Prioritized in KV Jun 1, 2022
@irfansharif irfansharif added the A-kv-decom-rolling-restart Decommission and Rolling Restarts label Jun 2, 2022
@irfansharif
Copy link
Contributor

+cc @AlexTalks. For decommission, I wonder if we should make strides towards cancellation ("recommission") being as non-disruptive as possible WRT foreground tail impact + throughput. Arguably decommission should be too, but it's equally unfortunate that canceling inflight decommission attempts (due to observed app impact) will get you into the same regime of impact because we'd still be shuffling snapshots around (except this time, back to the node we were trying to previously decommission).

@lunevalex
Copy link
Collaborator Author

There is already a setting to disable/enable the store rebalancer

var LoadBasedRebalancingMode = settings.RegisterEnumSetting(

@exalate-issue-sync exalate-issue-sync bot removed the O-postmortem Originated from a Postmortem action item. label Mar 7, 2023
@exalate-issue-sync exalate-issue-sync bot added the P-3 Issues/test failures with no fix SLA label Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) P-3 Issues/test failures with no fix SLA T-kv KV Team
Projects
KV
Prioritized
Development

No branches or pull requests

2 participants