Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] [bug] Removed kube-apiservers return 401 Unauthorized instead of closing connection #1810

Open
mhulscher opened this issue Aug 18, 2022 · 2 comments
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@mhulscher
Copy link

mhulscher commented Aug 18, 2022

Around the time of the release of EKS 1.23 we started noticing that EKS is more aggressively scaling out/in its kube-apiservers. We are seeing them being replaced more frequently. What we also notice is that whenever a kube-apiserver is removed (it no longer appears in kubectl get endpoints kubernetes -n default) it doesn't close existing connections. Instead, whenever a client makes a request to this removed kube-apiserver over an existing connection, the kube-apiserver returns a 401 Unauthorized. This seems to happen every time a kube-apiserver is scaled down. Applications might not be triggered by this 401 Unauthorized to re-establish their connection to the kube-apiserver. Instead, they might think that certain API resources are not available and act accordingly. This happens for example with the latest release of cilium.

I believe that whenever a kube-apiserver is removed as an endpoint, it should also immediately close all of its client connections; forcing the clients to establish a new connection.

Related issue: cilium/cilium#20915

@mhulscher mhulscher added the Proposed Community submitted issue label Aug 18, 2022
@mikestef9 mikestef9 added the EKS Amazon Elastic Kubernetes Service label Aug 18, 2022
@mhulscher
Copy link
Author

mhulscher commented Sep 5, 2022

I wanted to provide a small update. It seems that since about a week ago, kube-apiservers that are in the process of being removed, still accept client requests while the connection from the kube-apiserver to etcd is already gone. What I'm also seeing is in-flight requests not being gracefully handled. This leads to clients reporting errors.

Clients can work around this by retrying failed requests, but it strikes me as "not nice" that kube-apiserver is still accepting requests, even though it can't possibly handle them. As a result, we are now forced to add more and more retry logic to our e2e testing of our EKS clusters. This is a good practice regardless, but I think EKS can improve here.

This as well, only started happening around the time of the EKS 1.23 release.

When trying to create a namespace:
image

Trying to delete a pod:
image

Applying a statefulset:
image

@kanor1306
Copy link

kanor1306 commented Jan 26, 2023

It would be nice to get an explanation from the EKS team to know whether this is something that must be expected when dealing with the EKS kube-api. It really is a weird situation having to retry 401s to prevent issues with the availability of a rotating/scaling API.

gibsondan added a commit to dagster-io/dagster that referenced this issue Feb 3, 2023
Summary:
401 would typically not be a retryable error, but a user reported hitting it when they scaled up their cluster, and aws/containers-roadmap#1810 seems to suggest retrying as a workarounds. The downside of retrying on a 401 seems fairly low as well. Open to push-back on this though.

Test Plan: Existing BK test coverage of the k8s client
gibsondan added a commit to dagster-io/dagster that referenced this issue Feb 8, 2023
Summary:
401 would typically not be a retryable error, but a user reported hitting it when they scaled up their cluster, and aws/containers-roadmap#1810 seems to suggest retrying as a workarounds. The downside of retrying on a 401 seems fairly low as well. Open to push-back on this though.

Test Plan: Existing BK test coverage of the k8s client
gibsondan added a commit to dagster-io/dagster that referenced this issue Feb 8, 2023
Summary:
401 would typically not be a retryable error, but a user reported
hitting it when they scaled up their cluster, and
aws/containers-roadmap#1810 seems to suggest
retrying as a workarounds. The downside of retrying on a 401 seems
fairly low as well. Open to push-back on this though.

Test Plan: Existing BK test coverage of the k8s client

### Summary & Motivation

### How I Tested These Changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests

3 participants