-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-mgr HA considerations #53
Comments
Consider using galera to replicate the sql db. I have used galera in past projects and it works well.
What are your thoughts on having clustermgr use a K/V store such as etcd or boltdb to store persistent data? In the etcd scenario, we could leverage etcd's existing HA deployment model for clustermgr. If all contiv/cluster services run as containers, I think this effects our approach to HA. For example, we should think in terms of "How do we ensure x # of clustermgr/collins/etc containers are always running in the contiv cluster" |
Yeah, we have discussed this internally before as well and it is perfectly reasonable question. The challenge is that cluster manager is a service that is supposed to bootstrap a KV store in a cluster, so to bring up cluster manager if we need (or ask user) to setup etcd, it kind of beats the purpose of simplifying cluster provisioning. And is a chicken and egg problem.
yes, this kind of goes back to assigning semantics to the host-group management in clusterm as I just commented on here (#87 (comment)). This is not implemented yet but basically as clusterm is aware about how many master nodes are alive it can take care of ensuring a certain number of master nodes are always online which in turn makes sure that infrastructure services (like collins, K8s API server or swarm-replicas) keep functioning well even in case of node failures. Note that this may sound like what k8s or swarm does for containers. But the difference is that cluster manager does it for these services instead, i.e. it ensures a certain instances of K8s API server instances are always running. Does this makes sense? |
My thought is not necessarily using a single etcd cluster for clusterm and the cluster scheduler. I agree that their should be clear separation between the two etcd clusters. However, I think their is value in minimizing the number of storage permutations in the overall solution. Creating an etcd backend for Collins in not a high-priority, but is a dev effort that would benefit us and the user.
From a k8s standpoint, ^ is covered by podmaster: https://github.com/kubernetes/kubernetes/blob/master/docs/admin/high-availability/podmaster.yaml |
hmm, podmaster seems to depend on etcd. Who ensure etcd is up and running? |
I think that's where clusterm comes in. Let the cluster schedulers be responsible for managing their services (if supported within the cluster scheduler) and have clusterm handle the infra svcs the schedulers rely on. Thoughts? |
yep, I think we are on same page here :) quoting myself from above (i didn't cover kv-stores and plugins in there, but I was intending for all infra services):
|
This bug tracks the items needed for cluster-mgr's HA. It s a big list and will be addressed by one or more future PRs preferably introducing small features.
Current behavior:
Desired behavior and considerations for HA (++: priority high, +: low priority)
Possible approaches (this space will be changing for a bit as I explore and document different approaches):
The text was updated successfully, but these errors were encountered: