Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] [request]: On create: only return ACTIVE when endpoint actually usable #654

Open
dpiddockcmp opened this issue Dec 15, 2019 · 15 comments
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@dpiddockcmp
Copy link

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
When an EKS cluster is created the API reports an "ACTIVE" status before the endpoint can actually process requests. This means the first few attempts to use new clusters receive connection timeouts. All projects that create clusters have to implement retry logic for the first access to the api, usually when updating the aws-auth ConfigMap.

It would be super useful if the API reported an ACTIVE status on newly created clusters only once the endpoint was actually available to process requests. We've already waited over 10 minutes for the cluster to come up so waiting ~30 seconds more for it to actually be useable wouldn't be a big issue.

Which service(s) is this request for?
EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
The Terraform EKS community module is trying to migrate from running kubectl in a shell to using the kubernetes provider for creating the aws-auth ConfigMap. This would help with cross-platform use. Unfortunately due to "ACTIVE" not meaning "USABLE" we've hit issues with chaining the two providers together.

The kubernetes provider itself has refused to implement retry logic on connection timeouts.

Are you currently working around this issue?
Projects that create clusters have some form of retry loop with a sleep.

Additional context
Will potentially make other requests that deal with newly created clusters easier: #185, #254, #51

Attachments

@max-rocket-internet
Copy link

the API reports an "ACTIVE" status before the endpoint can actually process requests

It's super annoying in Terraform and also seems like quite a basic bug. Would love a real fix from AWS.

@mikestef9
Copy link
Contributor

Thanks for the issue report. We are looking into this further

@barryib
Copy link

barryib commented Dec 19, 2019

@mikestef9 Thanks for taking time for this. In fact this is an annoying "bug" and we would like to avoid to try to add some king of buggy retry or wait logic.

This can't be quick fixed by just checking the kubernetes /healthz, /readyz (or something else) endpoint before the EKS api return an ACTIVE status ?

@barryib
Copy link

barryib commented Dec 26, 2019

FYI, I opened a PR in the AWS provider to wait for kubernetes endpoint hashicorp/terraform-provider-aws#11426. I don't know if it the right quick win before this issue get solved. Feedbacks are welcome.

@mikestef9 mikestef9 added this to Researching in containers-roadmap Jan 3, 2020
@barryib
Copy link

barryib commented Jan 27, 2020

@mikestef9 Any updates for this ?

@jqmichael
Copy link

Is this behavior consistently reproducible? Trying to figure out if this is due to race condition in some scenarios but not the others.

@greenscar
Copy link

greenscar commented May 21, 2020

@jqmichael This happens every time I run my TF to spin up a new cluster. I have to run a second time to get the final step run.

@jlforester
Copy link

jlforester commented May 26, 2020

@jqmichael This happens every time I run my TF to spin up a new cluster. I have to run a second time to get the final step run.

We also ran into this issue. Our work around is a few lines of shell script in a provisioner to periodically run curl to check the endpoint. It can be up to a minute or more after AWS reports the cluster as ready that it actually becomes usable.

This isn't the only AWS resource type that we've had to implement these types of work-arounds on.

@jqmichael
Copy link

We definitely need to reproduce this on EKS side. But just curious, did the initial request ever get TCP SYN/ACK back (trying to figure out if the packet gets dropped in the middle or reached apiserver)?

@barryib
Copy link

barryib commented Oct 4, 2020

@jqmichael Was you able to reproduce this ? You can use the @dpiddockcmp gist to help you test this quickly https://gist.github.com/dpiddockcmp/23342f3b601b3432b1ea98ab61af6ba0

@jqmichael
Copy link

We narrowed it down to the propagation delay in Network Load Balancer(NLB) dataplane after AutoScalingGroup registers targets. NLB team is launching a campaign to reduce the propagation delay later this year. But until that campaign is finished, the work-around is to retry on client side until the traffic goes through.

@prithviramesh
Copy link

prithviramesh commented Oct 13, 2020

This also affects terminating instances; APIServers' on terminating control plane instances still serve requests because of (what seems to be) a de-registration delay in the NLB.

@ueokande
Copy link

ueokande commented Oct 26, 2020

Does this issue happen on the upgrading cluster? In our case, we manage an EKS cluster by CloudFormation, and we sometimes encountered a communication with an API server is unstable immediately after upgrading, even if a stack status is UPDATE_COMPLETE.

@annyip
Copy link

annyip commented Mar 5, 2021

any updates on this?

@SinghDivneet
Copy link

SinghDivneet commented Jan 7, 2022

are there any workarounds for this until we find a fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
containers-roadmap
  
Researching
Development

No branches or pull requests