Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Start nodes without immediately accepting KV or SQL requests #70122

Open
bobvawter opened this issue Sep 13, 2021 · 2 comments
Open

RFE: Start nodes without immediately accepting KV or SQL requests #70122

bobvawter opened this issue Sep 13, 2021 · 2 comments
Labels
A-configurability Pertains to cluster settings, CLI flags, env vars etc C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Originated from a customer P-3 Issues/test failures with no fix SLA T-server-and-security DB Server & Security

Comments

@bobvawter
Copy link
Member

bobvawter commented Sep 13, 2021

This RFE is motivated by a desire to reduce the risk of adding new nodes to production environments, especially those with non-trivial network configurations. Without presupposing an implementation, it would be useful to be able to require newly-added nodes to be explicitly activated by the operator after they have joined the RPC/gossip mesh, but before they begin accepting KV or SQL requests.

Many of our enterprise customers do not have the luxury of working in flat network topologies, where arbitrary in- or cross-region traffic is guaranteed to "just work". Consider this actual customer scenario:

  • Kubernetes pod IPs are not directly reachable, but must have a per-pod, dedicated Services, necessitating the use of the --advertise-addr flags.
  • Every network flow between a pair of IPs and/or Regions must be accounted for by firewall rules, acted upon by some other team within the company.
  • The teams that manage the CockroachDB cluster, k8s configurations, and network firewalls are disjoint and high-latency.

These sorts of O(n) or O(n^2) configuration issues would ideally be taken care of in an automated, repeatable fashion, but that is not a reality in all situations. We have had customers suffer cluster disfunction due to asymmetric network reachability that could not be tested for without actually launching a new Cockroach node. Past discussions about a network-quality simulator have uniformly converged to "use CockroachDB itself".

As a straw-man proposal, here is a possible set of ergonomics around an implementation:

  • A new cluster setting cluster.require_node_activation
  • When a new node is cockroach started, it will connect to existing nodes, obtain a node id, but behave as though it were drained and not a valid target for rebalancing.
  • Operators (human or otherwise) would be able to verify node functionality (e.g.: examine the network latency data to verify that full-mesh communication is possible with the newly-added node).
  • An explicit cockroach node activate # command is executed at a time of the operator's choosing.
  • Once a node has been marked as activated, it can never be deactivated, just drained and/or decommissioned.

Jira issue: CRDB-9952

@bobvawter bobvawter added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Originated from a customer A-configurability Pertains to cluster settings, CLI flags, env vars etc labels Sep 13, 2021
@knz knz added this to To do in DB Server & Security via automation Sep 13, 2021
@knz knz removed their assignment Sep 13, 2021
@blathers-crl blathers-crl bot added the T-server-and-security DB Server & Security label Sep 13, 2021
@knz knz moved this from To do to Queued for roadmapping in DB Server & Security Sep 20, 2021
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@knz
Copy link
Contributor

knz commented Aug 24, 2023

still relevant

@lunevalex lunevalex added the P-3 Issues/test failures with no fix SLA label Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-configurability Pertains to cluster settings, CLI flags, env vars etc C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Originated from a customer P-3 Issues/test failures with no fix SLA T-server-and-security DB Server & Security
Projects
DB Server & Security
  
Queued for roadmapping
Development

No branches or pull requests

3 participants