New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrete etcd cluster #544

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
10 participants
@colhom
Contributor

colhom commented Jun 13, 2016

Work in progress- no review necessary yet (unless you're bored ;)

@aaronlevy aaronlevy changed the title from Discrete etcd cluster to WIP: Discrete etcd cluster Jun 27, 2016

@wolfeidau

This comment has been minimized.

wolfeidau commented Jun 28, 2016

Would love to test this once your happy with it, if your looking for volunteers.

@colhom

This comment has been minimized.

Contributor

colhom commented Jul 5, 2016

@wolfeidau the TLS stuff still isn't done yet, not usable. I've been slacking on this one but will circle back very soon- if you have some time, I'm accepting sub-PRs on colhom:discrete-etcd-cluster fork branch ;)

@mumoshu

This comment has been minimized.

Contributor

mumoshu commented Jul 22, 2016

@colhom What a PR! Great job. Btw, would you mind sharing me which part of TLS stuff isn't done?(I've read through your code but couldn't figure it)

@@ -60,6 +60,8 @@ kmsKeyArn: "{{.KMSKeyARN}}"
# Price (Dollars) to bid for spot instances. Omit for on-demand instances.
# workerSpotPrice: "0.05"
# Instance type for etcd node

This comment has been minimized.

@mumoshu

mumoshu Jul 22, 2016

Contributor

nit: Missing descriptions and examples for newly added parameters

#etcdInstanceType: "m3.medium"

# Number of etcd nodes(should be 1, 3, 5 and so on to prevent split-brain)
# etcdCount: 1

# Disk size (GiB) for etcd node
#etcdRootVolumeSize: 30

# Disk size (GiB) for etcd node. This one is user by Etcd to store its key-value data and write-ahead logs
#etcdDataVolumeSize: 30

@colhom colhom force-pushed the colhom:discrete-etcd-cluster branch from 857324c to c5c91b8 Aug 1, 2016

dnsSuffix = "ec2.internal"
} else {
dnsSuffix = fmt.Sprintf("%s.compute.internal", config.Region)
}

This comment has been minimized.

@mumoshu

mumoshu Aug 2, 2016

Contributor

Encountered this yesterday and have patched exactly the same as you did 😄 Great.

content: |
[Unit]
Description=decrypt kubelet tls assets using amazon kms
Before=kubelet.service

This comment has been minimized.

@mumoshu

mumoshu Aug 2, 2016

Contributor

Probably unnecessary because we don't have kubelets in etcd nodes?

This comment has been minimized.

@colhom

colhom Aug 2, 2016

Contributor

we have etcd key/cert and the ca cert.

This comment has been minimized.

@mumoshu

mumoshu Aug 2, 2016

Contributor

Ah, sorry. I meant about the Before=kubelet.service part.

This comment has been minimized.

@colhom

colhom Aug 2, 2016

Contributor

oh yes, i'm blind ;) that should be changed to etcd2.

This was referenced Aug 2, 2016

@colhom

This comment has been minimized.

Contributor

colhom commented Aug 2, 2016

@wolfeidau go ahead a give it a whirl! I recommend jumping ahead to #596 and grabbing the HA control plane stuff as well on top of this PR- should all be functional.

@colhom

This comment has been minimized.

Contributor

colhom commented Aug 4, 2016

Hi everyone- update on things.

I've been experimenting with full cluster upgrades and frankly- statically defined etcd clusters (ie: this PR) do not work well with CloudFormation upgrades.

The only strategy that really works well "in general" with CloudFormation updates is doing a rolling replacement on an autoscaling group- and currently we don't have a good solution for running etcd robustly in an ASG.

We're currently investigating various possibilities for implementing a dynamic/managed etcd cluster- maybe in an ASG, or more optimally hosting etcd on kubernetes and greatly simplifying the CloudFormation stack as a result. The eventual goal is to entirely decouple Kubernetes cluster level upgrades (control plane, kubelet, etcd) from updating the underlying AWS virtual hardware. Etcd on Kubernetes would be a step in the right direction.

Input?

@colhom

This comment has been minimized.

Contributor

colhom commented Aug 4, 2016

The band-aide for now would be to manually ensure that the etcd instances are never updated by cloudformation after the are created. This is crappy for a lot of reasons, but the only workable solution until we can have etcd nodes that aren't pets. At least CoreOS can take care of doing the OS upgrades 👍

On kube-aws up --upgrade, we could pull the existing resource definitions for the etcd instances and splice that in to our outgoing template body in the update stack api call- ensuring no doomed-to-fail update action is taken on those instances.

It's ugly, but it's a grokable hack that would allow us to limp along and take care of our named etcd pets until we have etcd on Kubernetes ready for prime time.

\cc @mumoshu @pieterlange

@pieterlange

This comment has been minimized.

pieterlange commented Aug 4, 2016

@colhom i'm personally looking into implementing https://crewjam.com/etcd-aws/ as my backing cluster. I actually had a bad experience today losing etcd state across a autoscaling group managed by monsanto's solution.

This is after 3 months+ of running perfectly etcd-aws-cluster btw. There's a lot of work involved in getting software production ready and @crewjam's solution seems the most fleshed out, so after tryouts chances are i'll be going with that. The integrated backup (recovery from complete cluster failure) is the selling point for me after today.

@colhom

This comment has been minimized.

Contributor

colhom commented Aug 4, 2016

@pieterlange sorry to hear about your data loss. Out of morbid curiosity- did your etcd instances fail health checks due to load, causing the ASG to replace them one-by-one until your data was gone?

@pieterlange

This comment has been minimized.

pieterlange commented Aug 5, 2016

Nothing like that, but i did run them (3) on t2.nano's. (with 20 workers and 1 master) I spun up the etcd cluster separately and never really "managed" it.. it's a small miracle i got away with it as long as i did :)

@crewjam

This comment has been minimized.

crewjam commented Aug 7, 2016

@pieterlange and everyone -- I'd be happy to work with you on this. We did a bit of research and planning on production use of etcd on AWS which resulted in the repo & blog describing it. One caveat though -- for reasons not at all related to the work, our focus shifted a bit and so we haven't actually put it into production. We got close, and we are still planning to, but it hasn't happened just yet.

@colhom colhom force-pushed the colhom:discrete-etcd-cluster branch from 278d69d to 56a77fe Aug 8, 2016

@colhom colhom referenced this pull request Aug 9, 2016

Closed

Cluster upgrades #608

@colhom colhom force-pushed the colhom:discrete-etcd-cluster branch from 56a77fe to ca749f8 Aug 30, 2016

@colhom colhom force-pushed the colhom:discrete-etcd-cluster branch from ca749f8 to bbe383e Sep 8, 2016

@colhom colhom changed the title from WIP: Discrete etcd cluster to Discrete etcd cluster Sep 8, 2016

@colhom

This comment has been minimized.

Contributor

colhom commented Sep 8, 2016

Resources map[string]map[string]interface{} `json:"Resources"`
}
func (c *Cluster) lockEtcdResources(cfSvc *cloudformation.CloudFormation, stackBody string) (string, error) {

This comment has been minimized.

@aaronlevy

aaronlevy Sep 8, 2016

Member

Can you add a bit more documentation explaining why this is necessary / how the functionality will be replaced (for those of us unfamiliar with cloudformation).

This comment has been minimized.

@colhom

colhom Sep 9, 2016

Contributor

Problem: We must ensure that a cloudformation update never affects the etcd cluster instances.

Reason: When updating ec2 instances, cloudformation may choose a destroy and replace strategy, which has the potential to corrupt the etcd cluster.

Solution: lockEtcdResources is a 50-line function which only runs on cluster update- it queries the current state of the etcd cluster and ensures the definition does not change across CloudFormation updates.

Concerns over fragility: This is a simple, mechanical operation- but even those can go wrong.

Assuming:

  • lockEtcdResources() errors out: the update will fail
  • lockEtcdResources() munges the etcd cluster definition: In that case, Cloudformation will attempt to update the etcd cluster instances and fail because they are marked as "non-updatable" when first created. The CloudFormation update will then roll-back, and the user will be left with a non-updatable cluster

In either case, the failure mode is "cluster non-updatable"- which is exactly where every user is today anyways.

This comment has been minimized.

@colhom

colhom Sep 9, 2016

Contributor

^^ Something to that effect?

This comment has been minimized.

@aaronlevy
ip := subnetCIDR.IP
//TODO:(chom) this is sloppy, but "soon-ish" etcd with be self-hosted so we'll leave this be
for i := 0; i < 3; i++ {
ip = incrementIP(ip)

This comment has been minimized.

@aaronlevy

aaronlevy Sep 9, 2016

Member

What is the purpose of this?

This comment has been minimized.

@colhom

colhom Sep 9, 2016

Contributor

Assuming a 10.0.0.0/24, we put the first etcd instance at 10.0.0.3. That's saying start 3 addresses from the lowest address in the CIDR.

This comment has been minimized.

@aaronlevy

aaronlevy Sep 9, 2016

Member

Any particular reason?

This comment has been minimized.

@colhom

colhom Sep 9, 2016

Contributor

Not really- needed to pick a small offset, so 3 seemed like a good one.

This comment has been minimized.

@aaronlevy

aaronlevy Sep 9, 2016

Member

What is the reason for the small offset? What are you trying to accomplish here and why? I'm trying to understand why we need this code - right now it seems completely arbitrary.

config.EtcdInstances[etcdIndex] = instance
//TODO: ipv6 support
if len(instance.IPAddress) != 4 {

This comment has been minimized.

@aaronlevy

aaronlevy Sep 9, 2016

Member

if instance.IPAddress.To4() == nil?

This comment has been minimized.

@colhom

colhom Sep 9, 2016

Contributor

yep

Requires=docker.service
Before=kubelet.service flanneld.service
After=early-docker.service
Requires=early-docker.service

This comment has been minimized.

@dghubble

dghubble Sep 9, 2016

Contributor

why is this needed now?

This comment has been minimized.

@aaronlevy

aaronlevy Sep 9, 2016

Member

I'm assuming this is because this is being executed via docker run, but I'd prefer we used rkt

This comment has been minimized.

@colhom

colhom Sep 9, 2016

Contributor

Yeah, that would be preferable. This PR or follow-up?

This comment has been minimized.

@aaronlevy

aaronlevy Sep 9, 2016

Member

This pr

@dghubble

This comment has been minimized.

Contributor

dghubble commented Sep 9, 2016

Is the separate etcd optional? Are you planning to remove/merge this feature with self-hosted Kubernetes, they seem at odds? Also, the addition of etcd TLS here seems orthogonal.

@pieterlange

This comment has been minimized.

pieterlange commented Sep 9, 2016

Self-hosted would be really nice, but speaking as a lowly sysadmin: there's enough moving parts in this system already. I need to be able to rely on/trust a stable etcd. I'm even proposing to leave this outside of the management of kube-aws altogether in #629!

for encKey in $(find /etc/etcd2/ssl/*.pem.enc);do
tmpPath="/tmp/$(basename $encKey).tmp"
docker run --net host --rm -v /etc/etcd2/ssl:/etc/etcd2/ssl --rm quay.io/coreos/awscli aws --region {{.Region}} kms decrypt --ciphertext-blob fileb://$encKey --output text --query Plaintext | base64 --decode > $tmpPath

This comment has been minimized.

@aaronlevy

aaronlevy Sep 9, 2016

Member

would prefer that we use rkt here

@@ -72,6 +72,28 @@ kmsKeyArn: "{{.KMSKeyARN}}"
# Price (Dollars) to bid for spot instances. Omit for on-demand instances.
# workerSpotPrice: "0.05"
## Etcd Cluster config

This comment has been minimized.

@aaronlevy

aaronlevy Sep 9, 2016

Member

Probably need documentation going over how the nodes will be distributed across subnets / how that logic works. E.g. we recommend X nodes, with Y subnets, and it will be distributed in this way.

This comment has been minimized.

@colhom

colhom Sep 9, 2016

Contributor

Agreed. Will put that in there, along with links to AWS ec2 SLA.

workerConfig := tlsutil.ClientCertConfig{
CommonName: "kube-worker",
DNSNames: []string{
"*.*.compute.internal",
fmt.Sprintf("*.%s.compute.internal", c.Region),
"*.ec2.internal",

This comment has been minimized.

@aaronlevy

aaronlevy Sep 9, 2016

Member

nit: you could do the same region check here to only use this if in us-east-1

@colhom

This comment has been minimized.

Contributor

colhom commented Sep 9, 2016

Is the separate etcd optional?

No , it's very much mandatory. If etcd is not on separate machines, Kubernetes ec2 instances cannot be updated, which is a big deal

It's a pre-req for dynamically scaling compute resources for both control plane and worker pool- something AWS users (including me!) are very excited about.

Until self-hosted etcd is stable and production ready, this is a requirement.

Are you planning to remove/merge this feature with self-hosted Kubernetes, they seem at odds?

Bootkube currently provisions a static single-node etcd instance and points the control plane at it- and that's really it. A single point of non-fault-tolerant failure for your otherwise fault-tolerant control plane.

If anything, this would complement self-hosted by making etcd more robust.( > 1 nodes, multi-zone, etc) and really not change anything about how self-hosted works currently, as neither bootkube nor this PR deal at all with self-hosting etcd.

Also, the addition of etcd TLS here seems orthogonal

If an etcd interface is exposed to any sort of network, TLS encryption is a must. We have just exposed etcd to the subnet, hence encyption becomes mandatory. If it's only listening on localhost, not as big of a deal.

@dghubble

@colhom

This comment has been minimized.

Contributor

colhom commented Sep 9, 2016

Update- just did some fishing around it bootkube- it appears the multi-node bootkube example actually uses a discrete/external multi-node etcd cluster- just like this PR! So, in reality, this would make bootkube on AWS look almost exactly like the multi-node vagrant setup, just plus TLS!

@aaronlevy

This comment has been minimized.

Member

aaronlevy commented Sep 9, 2016

@colhom that review pass is done. Mostly questions at this stage.

@dghubble @pieterlange I'm personally on the fence about this functionality as well - but @colhom is championing for this to be included, and I won't explicitly block on "I feel iffy about this" if @colhom is willing to support long-term.

My general concerns:

This feels somewhat like a big workaround. If cloud-formation does not support this type of deployment, why are we fighting it to try? Parsing out the etcd-sections to keep them from being changed really does seem fragile to me - but to be fair I'm not very familiar with cloud-formation.

We have near term plans to moving toward self-hosted installations and this diverges from that work. This may be fine, it just means we will likely end up with a separate tool (which to be fair, may have happened anyway).

kube-aws could generate multiple cloud-formations in the case that you want a larger etcd cluster. I'm still unclear what we gain from trying to force this all into a single deployment (when we have to work around the fact that it doesn't actually work natively).

The single master likely works well for a lot of people (kube-up, gke, kops, kube-aws (until now)). I also still question that in-place updates cannot occur - can you remind me what the issue is here? Something about ebs volumes will not be reattached after a cloud-formation update?

@aaronlevy

This comment has been minimized.

Member

aaronlevy commented Sep 9, 2016

Also, fwiw etcd access was always available to the worker nodes (for flannel) -- so the TLS stuff isn't just now necessary due to running multiple etcd nodes -- it's always been a need. But along those same lines - as of v1.4 (and self-hosted) we're moving toward no worker etcd access at all - so another temporary divergence here.

owner: root:root
content: |
#!/bin/bash -e
if [[ "$(wipefs -n -p $1 | grep ext4)" == "" ]];then

This comment has been minimized.

@aaronlevy

aaronlevy Sep 9, 2016

Member

If you actually want to do this (it scares me) use full paths to binaries and long flag names

This comment has been minimized.

@colhom

colhom Sep 9, 2016

Contributor

I'm trying to get a less sketchy solution in place.

content: |
#!/bin/bash -e
if [[ "$(wipefs -n -p $1 | grep ext4)" == "" ]];then
mkfs.ext4 $1

This comment has been minimized.

@aaronlevy

aaronlevy Sep 9, 2016

Member

Quote: "$1"

exit 0
elif [[ $exitStatus -eq 2 ]];then
echo "No ext4 fs detected- will format"
mkfs.ext4 -p "$1"

This comment has been minimized.

@crawford

crawford Sep 9, 2016

Member

If you are using -p, why bother with blkid?

This comment has been minimized.

@colhom

colhom Sep 9, 2016

Contributor

Just to be explicit? I guess we should just keep this as simple as possible... I'll change it

@colhom colhom force-pushed the colhom:discrete-etcd-cluster branch from 795e45f to 011ad7e Sep 9, 2016

@@ -61,7 +61,7 @@ coreos:
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/opt/bin/ext4-format-volume-once /dev/xvdf
ExecStart=/usr/sbin/mkfs.ext4 -p /dev/xvdf

This comment has been minimized.

@colhom

colhom Sep 9, 2016

Contributor

@aaronlevy final answer- wdyt?

This comment has been minimized.

@colhom

colhom Sep 10, 2016

Contributor

disregard my previous comment- after talking to crawford I've reverted to first checking with blkid, and then leaving the -p arg to mkfs for extra safety.

@colhom colhom force-pushed the colhom:discrete-etcd-cluster branch from 011ad7e to 795e45f Sep 9, 2016

@dghubble

This comment has been minimized.

Contributor

dghubble commented Sep 10, 2016

@colhom Sure thing. It sounds perfectly reasonable for kube-aws to focus on the needs of this particular cluster design, platform, and users. My notes just highlight conflicts between different cluster designs, but it sounds like that's a non-goal, please disregard.

@colhom

This comment has been minimized.

Contributor

colhom commented Sep 13, 2016

@aaronlevy any more review items? I've addressed the conditional ext4 formatting logic in a significantly more robust way in the latest two commits, and am running a conformance test on this now.

@pieterlange

This comment has been minimized.

pieterlange commented Sep 15, 2016

I do not wish to delay this feature because i think the work is just fine. But maybe it would be better to spin up etcd as a separate cloudformation template - this is exactly what i've been doing and what's reflected in my PR (#629). I guess it kind of depends on if you can live with having to spin up mutliple cloudformation templates. I can :-)

@colhom

This comment has been minimized.

Contributor

colhom commented Sep 15, 2016

@pieterlange I have experimented with the separate cf stack approach for etcd- there is too much added complexity in cross-correlating the resources and really not much to be gained (in my opinion).

I've discussed with @aaronlevy and this will not merge into master until after the v1.4 hyperkube and associated changes are in (weeks).

To unblock further development in this area, we will be merging this PR, HA control plane, cluster upgrades and @mumoshu 's kubectl drain PR on a separate experimental branch. After master is updated to v1.4, we will look at merging the experimental stuff back in.

I'll update the community once that branch is ready and passing conformance/e2e tests.

@aaronlevy

This comment has been minimized.

Member

aaronlevy commented Sep 15, 2016

I am still wondering about this:

I also still question that in-place updates cannot occur - can you remind me what the issue is here? Something about ebs volumes will not be reattached after a cloud-formation update?

@aaronlevy

This comment has been minimized.

Member

aaronlevy commented Sep 15, 2016

Also, can you expand on:

there is too much added complexity in cross-correlating the resources

Is it not just etcd network addressability (hostnames/ips) that needs to be shared?

@pieterlange

This comment has been minimized.

pieterlange commented Sep 15, 2016

Is it not just etcd network addressability (hostnames/ips) that needs to be shared?

that's basically all i'm doing in the PoC PR. Obviously the situation is a little more hairy than that and my PR is just a complete hack. But it worked fine for me.

@colhom

This comment has been minimized.

Contributor

colhom commented Sep 15, 2016

I also still question that in-place updates cannot occur - can you remind me what the issue is here? Something about ebs volumes will not be reattached after a cloud-formation update?

If a CloudFormation update decides it needs to replace a machine, there is no way to ensure the new machine to receives the same EBS volume as the old machine. So our single controller/ebs backed etcd combo is non-updatable via CloudFormation. More generally, any machine running etcd is non-updatable, whether or not it's EBS backed. That's why this PR separates etcd out and quarantines the instances off from updates, until we can host on Kubernetes.

Also, can you expand on:

there is too much added complexity in cross-correlating the resources

Beyond what @pieterlange mentioned, additional points of difficulty:

  • security group references for ingress rules between two stacks would have to be done manually by making read calls to cloudformation API.
  • having the stacks share subnets is quite tricky, so in reality we'd need separate subnets for etcd instances and k8s instances, each managed by their respective stack. Then there would be ACL's in the middle, along with shared VPC. Basically means you have to create the stacks in a specific order and tear them down in a specific order, with some manual steps ahead of time to "detach" them from eachother. Right now we have a single API call to create, and a single api call to destroy.

In general, breaking the "everything is in this single cloudformation stack" invariant greatly changes the nature of kube-aws, and is out of scope for a problem that is will be solved in a better way within the next 6 months (or so ™️).

The hackery necessary to quarantine etcd instances from cloudformation updates boils down to masking updates to a few select subtrees of a cloudformation json document. It fits neatly within the confines of a single function which is referenced in one place. You can literally comment out the function and it's one reference, and it's gone.

\cc @aaronlevy

@colhom colhom force-pushed the colhom:discrete-etcd-cluster branch from 795e45f to 985e451 Sep 16, 2016

@pieterlange

This comment has been minimized.

pieterlange commented Sep 16, 2016

Against my best intentions i feel like my comment has started to derail/undermine the good work being done here. I trust @colhom's judgment on this item.

I suggest discussing the pro's and con's of managing etcd under a separate cloudformation under my PR or a new issue so we don't further pollute this thread. I like the idea of hosting etcd separately and robustly but that's only because of 1 nasty prior experience.

@mumoshu

This comment has been minimized.

Contributor

mumoshu commented Sep 29, 2016

@colhom Thanks for your great work here and there. I'm really looking forward to see this get merged 👍
Sorry for bothering you but did you have any chance to set the experimental branch up?
I'd like to experiment #667 on top of it.

@dzavalkinolx

This comment has been minimized.

dzavalkinolx commented Nov 2, 2016

We created our fork of kube-aws and merged this requested. After ~3 months of experience with I have to admit that this is not a way to go in production. Because 3 etcd nodes have static IPs there is no option to update them without some manual interaction. Since there is no backup of etcd state to somewhere outside cluster (for example S3) manual interactions are quite risky. I decided not to upgrade etcd nodes (by manually editing generated stack template json file). As a result now controller and worker nodes in one of the clusters I manage are running latest CoreOS AMI with Dirty COW bug fixed but etcd nodes run old AMI.

I believe the right approach is described in this article: https://crewjam.com/etcd-aws/, i.e. separate ASG for etcd nodes and etcd state mirroring/backup to S3. In this case etcd nodes can be also upgraded by running cloudformation stack update without any manual interaction.

@camilb

This comment has been minimized.

camilb commented Nov 2, 2016

@dzavalkinolx I also forked kube-awsand merged #608 and #629 plus some changes. Also started to integrate crewjam's etcd-aws. Got it working but without TLS for now. I see more people are looking on this solution and I think the best place to track the progress is here: kubernetes-incubator/kube-aws#9

@pieterlange

This comment has been minimized.

pieterlange commented Nov 2, 2016

@dzavalkinolx @camilb i made a new issue for this at kubernetes-incubator/kube-aws#27. I think we're not far off from a solution but it's nice to hear everyone's input on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment