[EKS] [request]: On create: only return ACTIVE when endpoint actually usable #654

dpiddockcmp · 2019-12-15T12:09:32Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
When an EKS cluster is created the API reports an "ACTIVE" status before the endpoint can actually process requests. This means the first few attempts to use new clusters receive connection timeouts. All projects that create clusters have to implement retry logic for the first access to the api, usually when updating the aws-auth ConfigMap.

It would be super useful if the API reported an ACTIVE status on newly created clusters only once the endpoint was actually available to process requests. We've already waited over 10 minutes for the cluster to come up so waiting ~30 seconds more for it to actually be useable wouldn't be a big issue.

Which service(s) is this request for?
EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
The Terraform EKS community module is trying to migrate from running kubectl in a shell to using the kubernetes provider for creating the aws-auth ConfigMap. This would help with cross-platform use. Unfortunately due to "ACTIVE" not meaning "USABLE" we've hit issues with chaining the two providers together.

The kubernetes provider itself has refused to implement retry logic on connection timeouts.

Are you currently working around this issue?
Projects that create clusters have some form of retry loop with a sleep.

Additional context
Will potentially make other requests that deal with newly created clusters easier: #185, #254, #51

Attachments

max-rocket-internet · 2019-12-16T14:55:00Z

the API reports an "ACTIVE" status before the endpoint can actually process requests

It's super annoying in Terraform and also seems like quite a basic bug. Would love a real fix from AWS.

mikestef9 · 2019-12-17T07:00:23Z

Thanks for the issue report. We are looking into this further

barryib · 2019-12-19T21:29:06Z

@mikestef9 Thanks for taking time for this. In fact this is an annoying "bug" and we would like to avoid to try to add some king of buggy retry or wait logic.

This can't be quick fixed by just checking the kubernetes /healthz, /readyz (or something else) endpoint before the EKS api return an ACTIVE status ?

barryib · 2019-12-26T23:56:59Z

FYI, I opened a PR in the AWS provider to wait for kubernetes endpoint hashicorp/terraform-provider-aws#11426. I don't know if it the right quick win before this issue get solved. Feedbacks are welcome.

barryib · 2020-01-27T22:15:32Z

@mikestef9 Any updates for this ?

jqmichael · 2020-03-03T17:59:09Z

Is this behavior consistently reproducible? Trying to figure out if this is due to race condition in some scenarios but not the others.

greenscar · 2020-05-21T17:27:05Z

@jqmichael This happens every time I run my TF to spin up a new cluster. I have to run a second time to get the final step run.

jlforester · 2020-05-26T12:02:34Z

@jqmichael This happens every time I run my TF to spin up a new cluster. I have to run a second time to get the final step run.

We also ran into this issue. Our work around is a few lines of shell script in a provisioner to periodically run curl to check the endpoint. It can be up to a minute or more after AWS reports the cluster as ready that it actually becomes usable.

This isn't the only AWS resource type that we've had to implement these types of work-arounds on.

jqmichael · 2020-05-26T20:37:57Z

We definitely need to reproduce this on EKS side. But just curious, did the initial request ever get TCP SYN/ACK back (trying to figure out if the packet gets dropped in the middle or reached apiserver)?

barryib · 2020-10-04T19:49:15Z

@jqmichael Was you able to reproduce this ? You can use the @dpiddockcmp gist to help you test this quickly https://gist.github.com/dpiddockcmp/23342f3b601b3432b1ea98ab61af6ba0

jqmichael · 2020-10-13T18:07:26Z

We narrowed it down to the propagation delay in Network Load Balancer(NLB) dataplane after AutoScalingGroup registers targets. NLB team is launching a campaign to reduce the propagation delay later this year. But until that campaign is finished, the work-around is to retry on client side until the traffic goes through.

prithviramesh · 2020-10-13T18:13:17Z

This also affects terminating instances; APIServers' on terminating control plane instances still serve requests because of (what seems to be) a de-registration delay in the NLB.

ueokande · 2020-10-26T09:10:09Z

Does this issue happen on the upgrading cluster? In our case, we manage an EKS cluster by CloudFormation, and we sometimes encountered a communication with an API server is unstable immediately after upgrading, even if a stack status is UPDATE_COMPLETE.

annyip · 2021-03-05T19:52:39Z

any updates on this?

SinghDivneet · 2022-01-07T05:26:58Z

are there any workarounds for this until we find a fix?

dpiddockcmp added the Proposed Community submitted issue label Dec 15, 2019

dpiddockcmp mentioned this issue Dec 15, 2019

Error: Post https://xxxx.eks.amazonaws.com/api/v1/namespaces/kube-system/configmaps: dial tcp xxx:443: i/o timeout terraform-aws-modules/terraform-aws-eks#621

Closed

mikestef9 added the EKS Amazon Elastic Kubernetes Service label Dec 15, 2019

barryib mentioned this issue Dec 26, 2019

Wait for kubernetes API to be ready during EKS cluster creation hashicorp/terraform-provider-aws#11426

Closed

mikestef9 added this to Researching in containers-roadmap Jan 3, 2020

satadruroy mentioned this issue Aug 17, 2020

eks aws_auth configmap management may cause race conditions SUSE/cap-terraform#84

Open

barryib mentioned this issue May 4, 2021

improvement: Use time_sleep instead of local-exec terraform-aws-modules/terraform-aws-eks#1253

Closed

2 tasks

mantoine96 mentioned this issue Jun 29, 2021

r/aws_emrcontainers_virtual_cluster - new resource hashicorp/terraform-provider-aws#20003

Merged

Zvikan mentioned this issue May 4, 2022

[Bug]: The time out in data.http.eks_cluster_readiness is too short under some circumstances aws-ia/terraform-aws-eks-blueprints#449

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EKS] [request]: On create: only return ACTIVE when endpoint actually usable #654

[EKS] [request]: On create: only return ACTIVE when endpoint actually usable #654

dpiddockcmp commented Dec 15, 2019

max-rocket-internet commented Dec 16, 2019

mikestef9 commented Dec 17, 2019

barryib commented Dec 19, 2019

barryib commented Dec 26, 2019 •

edited

barryib commented Jan 27, 2020

jqmichael commented Mar 3, 2020

greenscar commented May 21, 2020 •

edited

jlforester commented May 26, 2020 •

edited

jqmichael commented May 26, 2020

barryib commented Oct 4, 2020

jqmichael commented Oct 13, 2020

prithviramesh commented Oct 13, 2020 •

edited

ueokande commented Oct 26, 2020 •

edited

annyip commented Mar 5, 2021

SinghDivneet commented Jan 7, 2022 •

edited

[EKS] [request]: On create: only return ACTIVE when endpoint actually usable #654

[EKS] [request]: On create: only return ACTIVE when endpoint actually usable #654

Comments

dpiddockcmp commented Dec 15, 2019

Community Note

max-rocket-internet commented Dec 16, 2019

mikestef9 commented Dec 17, 2019

barryib commented Dec 19, 2019

barryib commented Dec 26, 2019 • edited

barryib commented Jan 27, 2020

jqmichael commented Mar 3, 2020

greenscar commented May 21, 2020 • edited

jlforester commented May 26, 2020 • edited

jqmichael commented May 26, 2020

barryib commented Oct 4, 2020

jqmichael commented Oct 13, 2020

prithviramesh commented Oct 13, 2020 • edited

ueokande commented Oct 26, 2020 • edited

annyip commented Mar 5, 2021

SinghDivneet commented Jan 7, 2022 • edited

barryib commented Dec 26, 2019 •

edited

greenscar commented May 21, 2020 •

edited

jlforester commented May 26, 2020 •

edited

prithviramesh commented Oct 13, 2020 •

edited

ueokande commented Oct 26, 2020 •

edited

SinghDivneet commented Jan 7, 2022 •

edited