Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Upgrade Existing EKS Kubernetes #348

Closed
christopherhein opened this issue Dec 17, 2018 · 16 comments
Closed

Support Upgrade Existing EKS Kubernetes #348

christopherhein opened this issue Dec 17, 2018 · 16 comments

Comments

@christopherhein
Copy link
Contributor

Why do you want this feature?
EKS currently has clusters running with 1.10 versions, this would add a new mechanism to upgrade existing clusters to 1.11.

What feature/behavior/change do you want?
I'd like to have the conversation about best practices for how we should allow for this?

  1. Should we upgrade the nodes automatically?
  2. How should we do this to reduce human error? Interactively Upgrade EKS Worker Node aws/containers-roadmap#57

This is the extension from #344

@mrichman
Copy link

This would be awesome because this sucks.

@errordeveloper
Copy link
Contributor

errordeveloper commented Dec 29, 2018

Let's write down semi-manual instructions first (see #357 (comment)), it should become clear what needs automating from there.

cc @tiffanyfay

@christopherhein
Copy link
Contributor Author

christopherhein commented Jan 11, 2019

What @tiffanyfay and I have:


  • check if cluster-autoscaler is installed if so scale down to 0 [needs to be implemented]
  • scale kube-dns by 1 [needs to be implemented]
  • if cluster-autoscaler is installed if so scale back to original [needs to be implemented]
  • Deploy coredns [needs to be implemented]
  • Delete kube-dns [needs to be implemented]
  • Update kube-proxy [needs to be updated]

We might also have to upgrade kube-proxy from 1.10 to 1.11. Need more info.

@mrichman
Copy link

If going from 1.10 to 1.11 then also swap kube-dns for CoreDNS.

@christopherhein
Copy link
Contributor Author

Good point @mrichman.

@errordeveloper
Copy link
Contributor

errordeveloper commented Jan 11, 2019 via email

@tiffanyfay
Copy link
Contributor

@errordeveloper for the 1.11 upgrade, I don't believe so. I'll talk with the team.

And if/when we are good with the steps, I'll work on an update API/command when I'm back to work next week.

@tiffanyfay
Copy link
Contributor

We also need to update kube-proxy in the list above.

https://docs.aws.amazon.com/eks/latest/userguide/coredns.html

@errordeveloper
Copy link
Contributor

Answering my own questions.

Why is '--ignore-daemonsets' needed here?

So one cannot normally delete deamonset-owned pods. I still don't get why, but anyway...

By the way, does it work with multiple ASGs?

Yes, cluster autoscaler is capable of discovering nodegroups.

@errordeveloper
Copy link
Contributor

errordeveloper commented Jan 14, 2019

I am still not clear on why we need to wire up a temporary SG? And what does key=value:NoSchedule that cordon/drain doesn't accomplish already?

@christopherhein
Copy link
Contributor Author

christopherhein commented Jan 14, 2019

Answering my own questions.

Why is '--ignore-daemonsets' needed here?

So one cannot normally delete deamonset-owned pods. I still don't get why, but anyway...

By the way, does it work with multiple ASGs?

Yes, cluster autoscaler is capable of discovering nodegroups.

Yeah, the --ignore-daemonsets is necessary or kubectl won't work to drain, didn't look into the full background for why. The reality is it doesn't matter for DS' cause as your new ASG came up and was available the DS' would have been schedule automatically.

I am still not clear on why we need to wire up a temporary SG? And what does key=value:NoSchedule that cordon/drain doesn't accomplish already?

The temp SG connection between the two ASGs allows cross service traffic while you drain nodes. So if you have pods running on both sets of ASGs and a service on the new ASG tries to route to a pod running on the old ASG it can still make the connection during the switch.

The Cordon/Drain vs NoSchedule is very nuanced. If you Cordon it will start to remove the pods from Services so doing this takes down your environment if you haven't already moved the workloads manually somehow. So instead we just NoSchedule to allow the new nodes to be the only nodes Schedule-able, then drain which will move them to the new instances.

Make sense?

@errordeveloper
Copy link
Contributor

Thanks, Chris! Do we strictly need the temporary SG? At the moment we are still debating what level of isolation nodegroup should have (see #419), but I think it there is no isolation (for ordinary ports), we don't need the temporary SG, unless I am missing something?

@errordeveloper
Copy link
Contributor

errordeveloper commented Jan 15, 2019

A short summary on #419 - I'm going to work on adding shared SG for all nodes, so that all node groups are actually equal, there will be options to enable isolation for those who need it. Adding this SG also means that we will have to add plumbing/mechanics for making changes to cluster stack, which will help for future work on upgrades in general.

@tiffanyfay tiffanyfay mentioned this issue Feb 9, 2019
6 tasks
@christopherhein christopherhein changed the title Support Upgrade Existing EKS Kubernetes 1.10 -> 1.11 Support Upgrade Existing EKS Kubernetes Feb 21, 2019
@errordeveloper
Copy link
Contributor

We should turn #348 (comment) into an actual proposal and write down basic CLI design. I think we are pretty close to having this implemented.

@christopherhein
Copy link
Contributor Author

@errordeveloper would you call this done? I think we should close.

@errordeveloper
Copy link
Contributor

Yes, I think it is!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants