Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: auto-discover peer nodes instead of --join #32374

Open
centerorbit opened this issue Nov 15, 2018 · 20 comments
Open

server: auto-discover peer nodes instead of --join #32374

centerorbit opened this issue Nov 15, 2018 · 20 comments
Labels
A-kv-gossip A-server-networking Pertains to network addressing,routing,initialization C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-community Originated from the community T-server-and-security DB Server & Security

Comments

@centerorbit
Copy link

centerorbit commented Nov 15, 2018

Is your feature request related to a problem? Please describe.
I've been playing around with cockroach, particularly in Docker. It seems odd to me that one needs to instruct new containers to 'join' to existing containers.

Describe the solution you'd like
I would like to create a new instance within a private subnet, and it has the ability to auto-discover its own friend nodes, and join by itself.

Describe alternatives you've considered
It even looks like Kubernetes config Found here
Defines a default scale of 3, and specifies for all to join themselves as well.
--join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb

It's odd that db-0 is told to join with itself along with 1 and 2. I'm assuming there is already handling to gracefully ignore joining to ones self, but still that seems like a bit of a 'hack' to get a Kubernetes up and running. It'd be nice if you didn't have to specify the node names at all!

It's be much easier to say something like:
--join-auto and call it a day.

Additional context
I figure in most cases, DBs will be clustered within their own private subnet, and could use a designated port and broadcast IP to make requests to join. When you want to scale to multiple AZs, either a VPN can be established, or a bridge of some sort to enable communication across two subnets.

I'm not sure how instances running in Kubernetes would react to broadcast pings, but they may be able to use the Kube API to discover others to join, there would just need to be some sort of environment detection, or another flag to tell it which auto-discover method to use.

Perhaps something like:
--join-auto=broadcast
--join-auto=kube

Jira issue: CRDB-4753

@knz knz added O-community Originated from the community C-wishlist A wishlist feature. A-kv-gossip labels Nov 15, 2018
@knz
Copy link
Contributor

knz commented Nov 15, 2018

@centerorbit Thank you for your suggestion. Indeed the use case sounds appealing. However until/unless CockroachDB serves as its own certification authority, it won't be possible to auto-generate and synchronize secure certificate for nodes that are automatically discovered. I think your proposal will become relevant once there is a CA inside CockroachDB.

@bladefist
Copy link

This would be cool. The --join command could take a subnet like 192.168.1.0/24

@knz Couldn't we pre-setup the cert to include our entire subnet to plan for growth?

@knz
Copy link
Contributor

knz commented Nov 15, 2018

@mberhault would it make sense to have a cert valid for an entire subnet?

(I know that wildcard certs can be used for web sites, unsure about cockroachdb)

@mberhault
Copy link
Contributor

I vaguely recall testing them and I think they work fine with the Go TLS client. We should double check, they recently tightened certificate validation in 1.10.x (though I don't think it impacts this).

Our plan for easier k8s autoscaling was exactly this, instead of per-node CSRs, there would be a single "node" secret storing the wildcard certificate. Adding new nodes would thus require no manual intervention.

IP addresses would not be included in the certificate as there is no way to specify a subnet. Instead, all communication would be DNS based which can use wildcards for host matching (usually, the wildcard only applies at the first level, it does not recurse).

@knz
Copy link
Contributor

knz commented Nov 15, 2018

@mberhault are you suggesting that the request by OP (see issue title + desc) was already in the works?

@mberhault
Copy link
Contributor

It's been talked about, there's no issue for it.

@centerorbit
Copy link
Author

@knz @mberhault very good point, I've been testing Cockroach locally in --insecure mode, which I'm sure bypasses any certs. Would it make any sense to try and prototype how joining might work in --insecure mode, and then layer in certificates? Or would there need to be extensive rework to support certs even if insecure mode worked?

I don't yet understand the flow involved for a node to acquire the proper certs. I will do work to try and understand @mberhault 's DNS concept, and how that interacts with certs and Cockroach join code.

Thanks for the quick response and feedback!

@salzig
Copy link

salzig commented Nov 28, 2018

Using all A-records returned by a dns resolve would also be helpful, as swarm allows you to resolve tasks.$servicename to fetch the IP-addresses for all containers currently running as $servicename

@centerorbit
Copy link
Author

Yeah, I was actually researching how systems like Kubernetes and Docker discover services within a cluster. @salzig is right. Looking up the DNS for all of the IP Addresses for a particular service name, and then using that would probably be the most straight forward. I'm not sure how the TLS certs are created, but could there be a potential to use those service names for the cert?

See:
https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/#namespaces-and-dns
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#services
https://docs.docker.com/docker-cloud/apps/service-links/#dns-hostnames-vs-service-links

@centerorbit
Copy link
Author

centerorbit commented Nov 30, 2018

I'm struggling a bit to get CockroachDB building on my systems. My Chromebook doesn't have enough RAM or disk space to perform builds reasonably (it works, but just barely) and my Windows machine is not very compatible with the builder.sh script (even with Windows Subsystem for Linux)... so I'm still trying to come up with decent ways to work on these platforms.

In the meantime, I think that I could use a similar DNS lookup method to what's documented here: https://jameshfisher.com/2017/08/03/golang-dns-lookup.html

And apply it in this general area of code:

// Get the gossip bootstrap resolvers.
resolvers, err := cfg.parseGossipBootstrapResolvers()
if err != nil {
return err
}
if len(resolvers) > 0 {
cfg.GossipBootstrapResolvers = resolvers
}
return nil
}

Naturally, if I get it working, I'll need to circle-back and make a branch/PR add flags, figure out the TLS (so it can run in secure mode), tests, etc. But this is where I'm currently at with it.

centerorbit added a commit to centerorbit/cockroach that referenced this issue Feb 5, 2019
This will allow the --join CLI option to "find" many nodes
to connect to, instead of needing to specify specific individuals.

This will cater well to auto-scaling, Kubernetes and Docker DNS
behaviors.

See: cockroachdb#32374

Release note: None
@centerorbit
Copy link
Author

centerorbit commented Feb 5, 2019

Started working on implementation here: master...centerorbit:feature-auto-join

It compiles! Now to come up with a few test scenarios and test Docker, K8s, normal, etc and see if it does what I expect.

Currently (I hope) it just uses the params from the --join flag, and will attempt to lookup all hosts from any resolvable names passed in via that. Therefore instead of using the above (and docs) described:
--join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb

You could instead just say something like:
--join cockroachdb

And the DNS resolver (Kubernetes or Docker) should return the IPs for cockroachdb-0, cockroachdb-1, and cockroachdb-2 (assuming they exist).

(Again, this is certs discussion aside. I'm proof-of-concepting this with the --insecure flag, and then I'll need help figuring out what hoops need to be jumped through for certs.)

@sodabrew
Copy link

I was just looking into this, and ran down from the Cluster Name RFC to this ticket. I like that your implementation is simple and based on DNS entries; this should integrate easily with Kubernetes, Consul, or "bare cloud" on EC2 with an auto-scaling-group behind a load balancer.

Figuring out the CA situation will be important. Since there need to be a script or wrapper (be it Kubernetes or hand-written) to fetch a certificate, that same script/wrapper can fetch the IP addresses of the other cluster members. This continues to feel like an area for improvement for CockroachDB.

As an expansion of varieties for discovery, consider the https://github.com/hashicorp/go-discover library from Hashicorp?

@knz
Copy link
Contributor

knz commented Sep 3, 2020

We just got another user request, also pointing to go-discover.

@gklijs
Copy link

gklijs commented Sep 9, 2020

Having an auto-discover would greatly simplify our setup as well. I guess for all the 'cloud native' setups it would be a big improvement.

@traverseda
Copy link

Struggling to find a way to create cluster-aware apps that doesn't require a bunch of setup, this small change (even running in insecure mode since my overlay network is already secured) would make things a whole lot easier.

Being able to just use a compose file like this would be amazing.

version: "3.2"
services:
  db:
    image: cockroachdb/cockroach:latest
    command: start --insecure --join tasks.{{.Service.Name}}
    volumes:
      - db:/cockroach/cockroach-data
    deploy:
      mode: global

volumes:
  db:

@zandeez
Copy link

zandeez commented Jan 5, 2022

This issue seems fairly inactive but I'd like to add my +1 with some details about my use case and potential workarounds I am considering.

I am using Consul for service discovery and gives me a number of tools I could use to get around some of the issues above, although given that is also written in GO it seems feasible that native support for Consul and Consul Connect wouldn't be outside the realms of possibility.

Firstly, discovery. With appropriate service definitions, it's possible to publish all instances in Consul and query them to find all existing nodes of the cluster, then apply that to the join parameter on start.

Leader-election / cluster init. There are well-documented processes for using Consul KV Locking to perform leader election, and therefore a mechanism to select a node to run cluster-init on.

Locality. You can query the hosts consul region and datacenter and pass them dynamically to cockroachdb

mTLS. A few options I'm considering. Using the CA built into Consul or Vault to generate node certificates, or using Consul Connect with CockroachDB (optionally) in insecure mode. The latter allows more granular access controls via intentions. Even then, though, there are new AutoTLS options that may have appeared since this thread was last updated.

It's also possible to write appropriate wrapper scripts to do all this externally.

@knz knz added this to To do in DB Server & Security via automation Jan 5, 2022
@blathers-crl blathers-crl bot added the T-server-and-security DB Server & Security label Jan 5, 2022
@knz knz added the A-server-networking Pertains to network addressing,routing,initialization label Jan 5, 2022
@knz
Copy link
Contributor

knz commented Jan 5, 2022

Thanks for the reminder. This issue had been mistriaged and fell between the cracks.

@mwang1026 we'll want to place this back into the radar. It's still relevant today and would also simplify (and lower the cost of) our CC infrastructure.

@knz knz changed the title Scaling: Auto-discover instead of --join server: auto-discover peer nodes instead of --join Jan 5, 2022
@knz knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed C-wishlist A wishlist feature. labels Jan 5, 2022
@regbo
Copy link

regbo commented Jan 13, 2022

I've been looking into this as well. Right now we have Traefik in front of our services and it would be fantastic to have CockroachDB behind the traefik TLS load blancer. Even if it's using insecure mode (we can use an encrypted network on swarm).

Happy to test if anyone is familiar enough with Consul/KV stores to get this going.

@Lord-Y
Copy link

Lord-Y commented Jun 9, 2023

Any plans to support https://github.com/hashicorp/go-discover or equivalent?

@knz
Copy link
Contributor

knz commented Jun 9, 2023

go-discover is problematic because it would require provisioning API keys to the CockroachDB process, which then in turn would need to be protected somehow. The complexity of provisioning the API keys in a secure way is not trivial, and I'm not sure it would result in a setup that's objectively simpler to operate than the --join flag.

(Something that folk may forget is that it's possible to point the --join flag to a load balancer / DNS name that maps to multiple node addresses, so it's not necessary to update --join every time a node is added/removed.)

Also another thing to consider is that CockroachDB is designed to operate across regions and even across cloud providers and so we need the --join flag anyway because there's no clear way to do service discovery across datacenters / clouds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-gossip A-server-networking Pertains to network addressing,routing,initialization C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-community Originated from the community T-server-and-security DB Server & Security
Projects
Development

No branches or pull requests