Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Allow workers to sit inside private subnets within an existing VPC #44

Closed
cknowles opened this issue Nov 8, 2016 · 25 comments
Closed
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.

Comments

@cknowles
Copy link
Contributor

cknowles commented Nov 8, 2016

Use case proposal:

  • Use an existing VPC, e.g. has an existing RDS deployment inside
  • Private subnets used for at least workers, perhaps controllers too if we use a bastion (first use case we want public controllers so we can SSH into the cluster)
  • Use existing NAT Gateway per AZ for HA with the appropriate route tables used per worker AZ

Similar to kubernetes/kops#428.

It was part of what coreos/coreos-kubernetes#716 was intended to cover. Pre-filing this issue so we can move discussion out of #35.

@cknowles cknowles changed the title Allow workers to sit inside a private subnet with a NAT Gateway per AZ Allow workers to sit inside private subnets within an existing VPC Nov 8, 2016
@cknowles
Copy link
Contributor Author

cknowles commented Nov 8, 2016

@pieterlange @mumoshu the original pull request does two things:

  1. Allows existing route tables to be associated with kube-aws created subnets
  2. Separates out subnets of the controller(s) so their route tables can be different

The reason using the existing routeTableId is less useable for this case is because unless we are very careful with the route table we defeat the HA aspects as we need to route traffic on very specific CIDRs to the right AZs.

@cknowles
Copy link
Contributor Author

cknowles commented Nov 8, 2016

I have a feeling we should separate out the various use cases as I've kind of mixed them together since the change was relatively small to cope with all of this.

@pieterlange
Copy link
Contributor

pieterlange commented Nov 8, 2016

The reason using the existing routeTableId is less useable for this case is because unless we are very careful with the route table we defeat the HA aspects as we need to route traffic on very specific CIDRs to the right AZs.

Are you referring to the route to 0.0.0.0/0 via NAT gateway here? As the NAT gateway can only live in on AZ at a time, there is an inherent dependency here. There is more than one way to skin this cat.

  • Maintain multiple route tables, one per AZ (your suggestion)
  • Keep using one routetable and manually / lamba failover the NAT gateway to a different NAT gateway on AZ downtime

I believe that ultimately this problem can be solved without constraining operator choices by implementing node pools. The controllers would be in a different pool. Workers in each AZ could be in different pools, one pool per AZ. Or they can be in one pool and use one routetable.

@cknowles
Copy link
Contributor Author

cknowles commented Nov 8, 2016

Yeah, our private subnet route tables look something like:

10.0.0.0/16 -> local
0.0.0.0/0 - > nat-UNIQUEID

Having looked at it again just now I think there is no way to switch traffic based on the source CIDR in a single route table. So in other words, we can't use a single route table while still routing the worker traffic from each individual AZ to the appropriate NAT Gateway in the same AZ.

I think generally we'd want to have a route table per AZ to mirror a typical multi-AZ NAT Gateway setup.

The original pull was a simple version on the basis of a whole AZ going down, however if we are talking about service level outages per AZ then if a single NAT Gateway is down I'm not sure of how the failover could/should work. Probably something along the lines of what you mention with lambda/manual.

@pieterlange
Copy link
Contributor

pieterlange commented Nov 8, 2016

if a single NAT Gateway is down

The NAT gateways themselves are HA (or so amazon says). The issue currently is if the gateway lives in AZ-a and that entire AZ goes down, AZ-b and AZ-c lose connectivity to the internet until a new NAT gateway is booted in one of those AZ's and the route table is updated.

I think generally we'd want to have a route table per AZ to mirror a typical multi-AZ NAT Gateway setup.

👍

@cknowles
Copy link
Contributor Author

cknowles commented Nov 8, 2016

Yep agreed. From the docs:
NAT gateways in each Availability Zone are implemented with redundancy. Create a NAT gateway in each Availability Zone to ensure zone-independent architecture.

Do you have more details on the node pools? I'd like to have a look over it.

@pieterlange
Copy link
Contributor

pieterlange commented Nov 8, 2016

@mumoshu wrote something about this in #46 just now :)

@mumoshu mumoshu added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 9, 2016
@mumoshu
Copy link
Contributor

mumoshu commented Nov 17, 2016

@c-knowles I've announced my POC for node pools in #46 (comment) yesterday!

Could it be something that can be a foundation to address this issue?

With the poc, we now have separate subnet(s) for each node pool(=a set of worker nodes).
Is that enough or do we need ways to differentiate route table per subnet/AZ as you've discussed so far?

P.S. I don't recommend reading each commit in the node-pools branch for now, mainly because those are really dirty and not ready for reviews 😆

@cknowles
Copy link
Contributor Author

Cool, thanks. I was going to review it soon for this use case. Prior to that, what I would say is the current setup we are using from code at coreos/coreos-kubernetes#716 uses a different subnet per AZ because each AZ has it's own NAT Gateway. I think control over different AZs within a tool like kube-aws will usually come down wanting control over what gets set per AZs (with sensible defaults).

@cknowles
Copy link
Contributor Author

cknowles commented Dec 5, 2016

@mumoshu After reviewing node pools further, I think this use case should be a matter of:

  • For workers, specify the appropriate routeTableId in each AZ via a node pool for each AZ which allows us to then use the autoscaler as per Feature: Node pools #46.
  • For controllers, specify the appropriate routeTableId depending on whether public/private controllers are needed.

I think that's it. I'm going to try it out tomorrow. My only reservation would be that it's a quite complex setup for what I think is a very common scenario. I also could not see a way to disable the new separate etcd node creation, I don't really want/need one in the dev cluster but I guess that's not supported any more?

@mumoshu
Copy link
Contributor

mumoshu commented Dec 6, 2016

@c-knowles Really looking forward to your results!

Regarding the separate etcd nodes: Yes, collocating etcd and control plane in a single node is not supported.

I also believe that the uniform architectures for the dev and the prod envs would be good in several points including e.g. dev-prod parity, less user confusion/support, less code complexity.
I'm a bit afraid of being obtrusive here but basically,I recommend adapting to etcd node separation, regardless of dev/prod.

If you're suffering from cost, I guess you can use smaller instance types for etcd nodes in a dev env and then wasted cost would be minimum.

@cknowles
Copy link
Contributor Author

cknowles commented Dec 6, 2016

I tend to agree to a certain level. Our dev cluster is where we development cluster changes/upgrades and we run automated tests against it for our apps and deployment mechanisms. We may even have a few of those at any time. We actually have a staging cluster after dev as well. The cost currently went up 50% minimum for any of those clusters (min was 2 nodes, now it's 3). I'd like to support external etcd as well such as compose.io.

Updates/feedback so far:

@mumoshu
Copy link
Contributor

mumoshu commented Dec 6, 2016

@c-knowles 50% increase is serious. I'd definitely like to make it more cost effective in another ways.
I know it would sound a bit adventurous but how about utilizing spot fleets not only for worker nodes but also for controller nodes?

@cknowles
Copy link
Contributor Author

cknowles commented Dec 6, 2016

Could do. Is there any simple way to provide at least one controller? e.g. some ASG/fleet automated scaling rule+action like "if all spots are going to shutdown, increment on demand ASG from min 0 to min 1". Or alternatively leave that for now and assume at least one spot will come back?

@mumoshu
Copy link
Contributor

mumoshu commented Dec 6, 2016

@c-knowles AFAIK there's no way to coordinate an ASG and a spot fleet like that.

However, I guess setting very low TargetCapacity(maybe 1) for a spot fleet and enabling the "diversified" strategy to diversify your instance types will work like that.

For example, if you've chosen 1 unit = 3.75GB of memory:

worker:
  spotFleet:
    targetCapacity: 1
    launchSpecifications:
    - weightedCapacity: 1
       instanceType: m3.medium
    - weightedCapacity: 1
       instanceType: c3.large
    - weightedCapacity: 1
       instanceType: c4.large
    - weightedCapacity: 2
       instanceType: m3.large
    - weightedCapacity: 2
       instanceType: m4.large

would ideally bring up only 1 m3.medium in peaceful days and if and only if the spot fleet loses to bids, a larger instance would be brought up.

c3.large, c4.large, m3.large, m4.large is approximately 2x more expensive than m3.medium.
However, they're spot instances hence discounts are 80~90%.
High availability plus reduced cost is achieved.

Not tested though 😉

@cknowles
Copy link
Contributor Author

cknowles commented Dec 6, 2016

That may be ok to start with. Possible future changes could be termination notices linked to an ASG action/lambda. BTW, if you are available in the #sig-aws k8s Slack it may be useful to chat so we can keep it out of the issues.

@mumoshu
Copy link
Contributor

mumoshu commented Dec 6, 2016

@c-knowles I guess in such case my kube-spot-termination-notice-handler in combination with cluster-autoscaler would fit.

My kube-spot-termination-notice-handler basically kubectl drain the node almost immediately after it recognizes a termination notice(~2min before the actual termination).

kubectl drain would end up making several pending pods over the cluster.
cluster-autoscaler notices pending pods and scales up an ASG.

However, I'm not yet sure if cluster-autoscaler supports scaling down an ASG to zero nodes, which I believe required for your use-case.

@cknowles
Copy link
Contributor Author

cknowles commented Dec 6, 2016

I have good news! It's possible to fulfil this use case with only minor modifications to kube-aws using node pools. I used one of our shared VPCs and managed to create the main cluster for just the controller(s) in the public subnets while putting a node pool in a private subnet of one AZ. I also double checked rolling out a k8s Service with ELB to double check it attached ok to the workers and that works too with a basic pod responding to HTTP calls.

Assuming adding a couple more node pools works, we only need minor mods for this.

I have a few feedback items which I will start to put some pull requests in for. I may need some initial feedback on those items to start on them.

@cknowles
Copy link
Contributor Author

cknowles commented Dec 7, 2016

@mumoshu on my feedback above, I want to check a few things before doing the pulls if you could provide a very quick comment on each:

  1. Do you know why worker count min/max are hard coded to count-1 and count+1? I'd like to change to explicitly support min/max config and deprecate the existing config. We can keep supporting it for at least one release if we do that and just set min/max to the count (which was what it was previously before 1f37a9b)
  2. I will copy some more values from the main stack to the node pool like the tags and worker instance type.
  3. I will align the main cluster.yaml default to the node pool one (useCalico explicitly set to false)
  4. I'm not sure whether we should allow multi AZ node pools, thoughts?
  5. Is there a plan to add node pools validate? If so I will leave the inconsistencies in command naming for that task otherwise we should remove the instructions from the render output which state a validation command is available.
  6. Where can I add instructions for node pools? Is it ok inside Documentation?

@mumoshu
Copy link
Contributor

mumoshu commented Dec 7, 2016

@c-knowles First of all, thanks for your feedbacks!

1: No. Anyways, It would be nice to make them configurable in cluster.yaml if and only if we keep backward-compatiblity in stack-template.json i.e. compute default values for {Worker,Controller}{Min,Max}Count according to {Worker,Controller}Count if they're omitted. So 👍

2: I'm not yet sure it copying instance type is good. I guess you should explicitly select appropriate instance type regardless of what is selected in the main cluster.

Copying stack tags would be basically good as I assume stack tags are used to identify whole the cluster resources not only main cluster or node pool.
However, I guess you should'nt put the exactly same set of tags to both main cluster and node pools. Doing that results in inability to identify only part of the whole cluster e.g. you can't identify only the resources from node pool if all the stacks are tagged identically.

Not 👍 for now but definitely like to discuss more!

3: I'd like to keep cluster.yaml simple. Only keys for required configuration with no default values would be commented out in cluster.yaml. I'd rather like to fix node pools' cluster.yaml so that it properly comments out useCalico: if useCalico is false in main cluster's cluster.yaml.

Basically 👍 but I'd rather like to fix node pools' cluster.yaml instead of main cluster's.

4: Indeed. For now, I guess we should at least error out when different AZs are specified in multiple subnets, instead of just removing all the subnets: from node pools' cluster.yaml(just for minimum delta from current code hence easier pull request plus extra period of time to receive feedbacks from users if subnets itself are required or not.

5: Yes! I'd appreciate it if you could add it but I'm willing to do myself.

For example, kube-aws node-pools update could be added like this.

6: Thanks for your attention to documentation! I'd greatly appreciate it.

Adding the brand-new documentation named like kubernetes-on-aws-node-pools.md and linking it from existing kubernetes-on-aws.md which is the entrance of CoreOS's doc would be nice.

@mumoshu
Copy link
Contributor

mumoshu commented Dec 8, 2016

@c-knowles

How to update node pool stacks? The Stack Policy is set to {} so updates are currently disallowed. #130?

Thanks for spotting it! Fixed via #140

@cknowles
Copy link
Contributor Author

cknowles commented Dec 8, 2016

Submitted #142 for the ASG definitions. Getting the defaults to work in golang is more tricky than I'd hoped but it works, will see what you think.

@mumoshu
Copy link
Contributor

mumoshu commented Feb 16, 2017

@c-knowles I believe that almost all the TODOs related to this issue are addressed via recently merged PRs. Can we close this now?

@mumoshu
Copy link
Contributor

mumoshu commented Feb 24, 2017

ping @c-knowles

@cknowles
Copy link
Contributor Author

@mumoshu I'm happy this has been covered off now, the remaining items were mostly in #315. For anyone who ends up here, I've ended up using the additions in #315 to have kube-aws manage the subnets and route tables while referencing an existing set of NAT Gateways (node pools set to private).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
Projects
None yet
Development

No branches or pull requests

3 participants