server: auto-discover peer nodes instead of --join #32374

centerorbit · 2018-11-15T04:26:57Z

Is your feature request related to a problem? Please describe.
I've been playing around with cockroach, particularly in Docker. It seems odd to me that one needs to instruct new containers to 'join' to existing containers.

Describe the solution you'd like
I would like to create a new instance within a private subnet, and it has the ability to auto-discover its own friend nodes, and join by itself.

Describe alternatives you've considered
It even looks like Kubernetes config Found here
Defines a default scale of 3, and specifies for all to join themselves as well.
--join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb

It's odd that db-0 is told to join with itself along with 1 and 2. I'm assuming there is already handling to gracefully ignore joining to ones self, but still that seems like a bit of a 'hack' to get a Kubernetes up and running. It'd be nice if you didn't have to specify the node names at all!

It's be much easier to say something like:
--join-auto and call it a day.

Additional context
I figure in most cases, DBs will be clustered within their own private subnet, and could use a designated port and broadcast IP to make requests to join. When you want to scale to multiple AZs, either a VPN can be established, or a bridge of some sort to enable communication across two subnets.

I'm not sure how instances running in Kubernetes would react to broadcast pings, but they may be able to use the Kube API to discover others to join, there would just need to be some sort of environment detection, or another flag to tell it which auto-discover method to use.

Perhaps something like:
--join-auto=broadcast
--join-auto=kube

Jira issue: CRDB-4753

The text was updated successfully, but these errors were encountered:

knz · 2018-11-15T13:46:44Z

@centerorbit Thank you for your suggestion. Indeed the use case sounds appealing. However until/unless CockroachDB serves as its own certification authority, it won't be possible to auto-generate and synchronize secure certificate for nodes that are automatically discovered. I think your proposal will become relevant once there is a CA inside CockroachDB.

bladefist · 2018-11-15T14:11:11Z

This would be cool. The --join command could take a subnet like 192.168.1.0/24

@knz Couldn't we pre-setup the cert to include our entire subnet to plan for growth?

knz · 2018-11-15T14:13:17Z

@mberhault would it make sense to have a cert valid for an entire subnet?

(I know that wildcard certs can be used for web sites, unsure about cockroachdb)

mberhault · 2018-11-15T14:23:59Z

I vaguely recall testing them and I think they work fine with the Go TLS client. We should double check, they recently tightened certificate validation in 1.10.x (though I don't think it impacts this).

Our plan for easier k8s autoscaling was exactly this, instead of per-node CSRs, there would be a single "node" secret storing the wildcard certificate. Adding new nodes would thus require no manual intervention.

IP addresses would not be included in the certificate as there is no way to specify a subnet. Instead, all communication would be DNS based which can use wildcards for host matching (usually, the wildcard only applies at the first level, it does not recurse).

knz · 2018-11-15T14:28:59Z

@mberhault are you suggesting that the request by OP (see issue title + desc) was already in the works?

mberhault · 2018-11-15T14:55:14Z

It's been talked about, there's no issue for it.

centerorbit · 2018-11-15T23:54:46Z

@knz @mberhault very good point, I've been testing Cockroach locally in --insecure mode, which I'm sure bypasses any certs. Would it make any sense to try and prototype how joining might work in --insecure mode, and then layer in certificates? Or would there need to be extensive rework to support certs even if insecure mode worked?

I don't yet understand the flow involved for a node to acquire the proper certs. I will do work to try and understand @mberhault 's DNS concept, and how that interacts with certs and Cockroach join code.

Thanks for the quick response and feedback!

salzig · 2018-11-28T21:31:48Z

Using all A-records returned by a dns resolve would also be helpful, as swarm allows you to resolve tasks.$servicename to fetch the IP-addresses for all containers currently running as $servicename

centerorbit · 2018-11-29T03:51:17Z

Yeah, I was actually researching how systems like Kubernetes and Docker discover services within a cluster. @salzig is right. Looking up the DNS for all of the IP Addresses for a particular service name, and then using that would probably be the most straight forward. I'm not sure how the TLS certs are created, but could there be a potential to use those service names for the cert?

See:
https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/#namespaces-and-dns
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#services
https://docs.docker.com/docker-cloud/apps/service-links/#dns-hostnames-vs-service-links

centerorbit · 2018-11-30T04:48:53Z

I'm struggling a bit to get CockroachDB building on my systems. My Chromebook doesn't have enough RAM or disk space to perform builds reasonably (it works, but just barely) and my Windows machine is not very compatible with the builder.sh script (even with Windows Subsystem for Linux)... so I'm still trying to come up with decent ways to work on these platforms.

In the meantime, I think that I could use a similar DNS lookup method to what's documented here: https://jameshfisher.com/2017/08/03/golang-dns-lookup.html

And apply it in this general area of code:

cockroach/pkg/server/config.go

Lines 506 to 516 in 6fb1b00

    
           	// Get the gossip bootstrap resolvers. 
        
           	resolvers, err := cfg.parseGossipBootstrapResolvers() 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	if len(resolvers) > 0 { 
        
           		cfg.GossipBootstrapResolvers = resolvers 
        
           	} 
        
           	return nil 
        
           }

Naturally, if I get it working, I'll need to circle-back and make a branch/PR add flags, figure out the TLS (so it can run in secure mode), tests, etc. But this is where I'm currently at with it.

This will allow the --join CLI option to "find" many nodes to connect to, instead of needing to specify specific individuals. This will cater well to auto-scaling, Kubernetes and Docker DNS behaviors. See: cockroachdb#32374 Release note: None

centerorbit · 2019-02-05T05:05:48Z

Started working on implementation here: master...centerorbit:feature-auto-join

It compiles! Now to come up with a few test scenarios and test Docker, K8s, normal, etc and see if it does what I expect.

Currently (I hope) it just uses the params from the --join flag, and will attempt to lookup all hosts from any resolvable names passed in via that. Therefore instead of using the above (and docs) described:
--join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb

You could instead just say something like:
--join cockroachdb

And the DNS resolver (Kubernetes or Docker) should return the IPs for cockroachdb-0, cockroachdb-1, and cockroachdb-2 (assuming they exist).

(Again, this is certs discussion aside. I'm proof-of-concepting this with the --insecure flag, and then I'll need help figuring out what hoops need to be jumped through for certs.)

sodabrew · 2020-02-20T16:10:49Z

I was just looking into this, and ran down from the Cluster Name RFC to this ticket. I like that your implementation is simple and based on DNS entries; this should integrate easily with Kubernetes, Consul, or "bare cloud" on EC2 with an auto-scaling-group behind a load balancer.

Figuring out the CA situation will be important. Since there need to be a script or wrapper (be it Kubernetes or hand-written) to fetch a certificate, that same script/wrapper can fetch the IP addresses of the other cluster members. This continues to feel like an area for improvement for CockroachDB.

As an expansion of varieties for discovery, consider the https://github.com/hashicorp/go-discover library from Hashicorp?

knz · 2020-09-03T20:08:55Z

We just got another user request, also pointing to go-discover.

gklijs · 2020-09-09T10:07:05Z

Having an auto-discover would greatly simplify our setup as well. I guess for all the 'cloud native' setups it would be a big improvement.

traverseda · 2021-03-28T14:48:54Z

Struggling to find a way to create cluster-aware apps that doesn't require a bunch of setup, this small change (even running in insecure mode since my overlay network is already secured) would make things a whole lot easier.

Being able to just use a compose file like this would be amazing.

version: "3.2"
services:
  db:
    image: cockroachdb/cockroach:latest
    command: start --insecure --join tasks.{{.Service.Name}}
    volumes:
      - db:/cockroach/cockroach-data
    deploy:
      mode: global

volumes:
  db:

zandeez · 2022-01-05T09:57:10Z

This issue seems fairly inactive but I'd like to add my +1 with some details about my use case and potential workarounds I am considering.

I am using Consul for service discovery and gives me a number of tools I could use to get around some of the issues above, although given that is also written in GO it seems feasible that native support for Consul and Consul Connect wouldn't be outside the realms of possibility.

Firstly, discovery. With appropriate service definitions, it's possible to publish all instances in Consul and query them to find all existing nodes of the cluster, then apply that to the join parameter on start.

Leader-election / cluster init. There are well-documented processes for using Consul KV Locking to perform leader election, and therefore a mechanism to select a node to run cluster-init on.

Locality. You can query the hosts consul region and datacenter and pass them dynamically to cockroachdb

mTLS. A few options I'm considering. Using the CA built into Consul or Vault to generate node certificates, or using Consul Connect with CockroachDB (optionally) in insecure mode. The latter allows more granular access controls via intentions. Even then, though, there are new AutoTLS options that may have appeared since this thread was last updated.

It's also possible to write appropriate wrapper scripts to do all this externally.

knz · 2022-01-05T11:52:09Z

Thanks for the reminder. This issue had been mistriaged and fell between the cracks.

@mwang1026 we'll want to place this back into the radar. It's still relevant today and would also simplify (and lower the cost of) our CC infrastructure.

regbo · 2022-01-13T15:49:18Z

I've been looking into this as well. Right now we have Traefik in front of our services and it would be fantastic to have CockroachDB behind the traefik TLS load blancer. Even if it's using insecure mode (we can use an encrypted network on swarm).

Happy to test if anyone is familiar enough with Consul/KV stores to get this going.

Lord-Y · 2023-06-09T07:07:19Z

Any plans to support https://github.com/hashicorp/go-discover or equivalent?

knz · 2023-06-09T10:18:27Z

go-discover is problematic because it would require provisioning API keys to the CockroachDB process, which then in turn would need to be protected somehow. The complexity of provisioning the API keys in a secure way is not trivial, and I'm not sure it would result in a setup that's objectively simpler to operate than the --join flag.

(Something that folk may forget is that it's possible to point the --join flag to a load balancer / DNS name that maps to multiple node addresses, so it's not necessary to update --join every time a node is added/removed.)

Also another thing to consider is that CockroachDB is designed to operate across regions and even across cloud providers and so we need the --join flag anyway because there's no clear way to do service discovery across datacenters / clouds.

knz added O-community Originated from the community C-wishlist A wishlist feature. A-kv-gossip labels Nov 15, 2018

knz assigned kannanlakshmi Nov 15, 2018

knz unassigned kannanlakshmi Jan 5, 2022

knz added this to To do in DB Server & Security via automation Jan 5, 2022

blathers-crl bot added the T-server-and-security DB Server & Security label Jan 5, 2022

knz added the A-server-networking Pertains to network addressing,routing,initialization label Jan 5, 2022

knz changed the title ~~Scaling: Auto-discover instead of --join~~ server: auto-discover peer nodes instead of --join Jan 5, 2022

knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed C-wishlist A wishlist feature. labels Jan 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: auto-discover peer nodes instead of --join #32374

server: auto-discover peer nodes instead of --join #32374

centerorbit commented Nov 15, 2018 •

edited by cockroach-jira-scripts

knz commented Nov 15, 2018 •

edited

bladefist commented Nov 15, 2018

knz commented Nov 15, 2018

mberhault commented Nov 15, 2018

knz commented Nov 15, 2018

mberhault commented Nov 15, 2018

centerorbit commented Nov 15, 2018

salzig commented Nov 28, 2018

centerorbit commented Nov 29, 2018

centerorbit commented Nov 30, 2018 •

edited

centerorbit commented Feb 5, 2019 •

edited

sodabrew commented Feb 20, 2020

knz commented Sep 3, 2020

gklijs commented Sep 9, 2020

traverseda commented Mar 28, 2021

zandeez commented Jan 5, 2022

knz commented Jan 5, 2022

regbo commented Jan 13, 2022

Lord-Y commented Jun 9, 2023

knz commented Jun 9, 2023

server: auto-discover peer nodes instead of --join #32374

server: auto-discover peer nodes instead of --join #32374

Comments

centerorbit commented Nov 15, 2018 • edited by cockroach-jira-scripts

knz commented Nov 15, 2018 • edited

bladefist commented Nov 15, 2018

knz commented Nov 15, 2018

mberhault commented Nov 15, 2018

knz commented Nov 15, 2018

mberhault commented Nov 15, 2018

centerorbit commented Nov 15, 2018

salzig commented Nov 28, 2018

centerorbit commented Nov 29, 2018

centerorbit commented Nov 30, 2018 • edited

centerorbit commented Feb 5, 2019 • edited

sodabrew commented Feb 20, 2020

knz commented Sep 3, 2020

gklijs commented Sep 9, 2020

traverseda commented Mar 28, 2021

zandeez commented Jan 5, 2022

knz commented Jan 5, 2022

regbo commented Jan 13, 2022

Lord-Y commented Jun 9, 2023

knz commented Jun 9, 2023

centerorbit commented Nov 15, 2018 •

edited by cockroach-jira-scripts

knz commented Nov 15, 2018 •

edited

centerorbit commented Nov 30, 2018 •

edited

centerorbit commented Feb 5, 2019 •

edited