[EKS] [bug] Removed kube-apiservers return 401 Unauthorized instead of closing connection #1810

mhulscher · 2022-08-18T11:18:05Z

Around the time of the release of EKS 1.23 we started noticing that EKS is more aggressively scaling out/in its kube-apiservers. We are seeing them being replaced more frequently. What we also notice is that whenever a kube-apiserver is removed (it no longer appears in kubectl get endpoints kubernetes -n default) it doesn't close existing connections. Instead, whenever a client makes a request to this removed kube-apiserver over an existing connection, the kube-apiserver returns a 401 Unauthorized. This seems to happen every time a kube-apiserver is scaled down. Applications might not be triggered by this 401 Unauthorized to re-establish their connection to the kube-apiserver. Instead, they might think that certain API resources are not available and act accordingly. This happens for example with the latest release of cilium.

I believe that whenever a kube-apiserver is removed as an endpoint, it should also immediately close all of its client connections; forcing the clients to establish a new connection.

Related issue: cilium/cilium#20915

The text was updated successfully, but these errors were encountered:

mhulscher · 2022-09-05T10:18:17Z

I wanted to provide a small update. It seems that since about a week ago, kube-apiservers that are in the process of being removed, still accept client requests while the connection from the kube-apiserver to etcd is already gone. What I'm also seeing is in-flight requests not being gracefully handled. This leads to clients reporting errors.

Clients can work around this by retrying failed requests, but it strikes me as "not nice" that kube-apiserver is still accepting requests, even though it can't possibly handle them. As a result, we are now forced to add more and more retry logic to our e2e testing of our EKS clusters. This is a good practice regardless, but I think EKS can improve here.

This as well, only started happening around the time of the EKS 1.23 release.

When trying to create a namespace:

Trying to delete a pod:

Applying a statefulset:

kanor1306 · 2023-01-26T16:22:57Z

It would be nice to get an explanation from the EKS team to know whether this is something that must be expected when dealing with the EKS kube-api. It really is a weird situation having to retry 401s to prevent issues with the availability of a rotating/scaling API.

Summary: 401 would typically not be a retryable error, but a user reported hitting it when they scaled up their cluster, and aws/containers-roadmap#1810 seems to suggest retrying as a workarounds. The downside of retrying on a 401 seems fairly low as well. Open to push-back on this though. Test Plan: Existing BK test coverage of the k8s client

Summary: 401 would typically not be a retryable error, but a user reported hitting it when they scaled up their cluster, and aws/containers-roadmap#1810 seems to suggest retrying as a workarounds. The downside of retrying on a 401 seems fairly low as well. Open to push-back on this though. Test Plan: Existing BK test coverage of the k8s client ### Summary & Motivation ### How I Tested These Changes

mhulscher added the Proposed Community submitted issue label Aug 18, 2022

mikestef9 added the EKS Amazon Elastic Kubernetes Service label Aug 18, 2022

jasonaliyetti mentioned this issue Sep 28, 2022

Cilium hangs onto old IPs cilium/cilium#21218

Closed

2 tasks

recollir mentioned this issue Nov 30, 2022

Scale-down event of EKS kube-apiserver causes network outage cilium/cilium#20915

Closed

2 tasks

mhulscher mentioned this issue Dec 30, 2022

feat(k8s): introduce several retry functions to deal w/ flaky apis gruntwork-io/terratest#1219

Open

3 tasks

kanor1306 mentioned this issue Jan 26, 2023

Failures in pipelines due to Kube API rotation in EKS nextflow-io/nextflow#3576

Closed

le-duane mentioned this issue Jan 31, 2023

Immediate demotion on k8s api auth failure zalando/patroni#2536

Closed

gibsondan mentioned this issue Feb 3, 2023

Add 401 to the list of API codes that our k8s client retries on dagster-io/dagster#12074

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EKS] [bug] Removed kube-apiservers return 401 Unauthorized instead of closing connection #1810

[EKS] [bug] Removed kube-apiservers return 401 Unauthorized instead of closing connection #1810

mhulscher commented Aug 18, 2022 •

edited

mhulscher commented Sep 5, 2022 •

edited

kanor1306 commented Jan 26, 2023 •

edited

[EKS] [bug] Removed kube-apiservers return 401 Unauthorized instead of closing connection #1810

[EKS] [bug] Removed kube-apiservers return 401 Unauthorized instead of closing connection #1810

Comments

mhulscher commented Aug 18, 2022 • edited

mhulscher commented Sep 5, 2022 • edited

kanor1306 commented Jan 26, 2023 • edited

mhulscher commented Aug 18, 2022 •

edited

mhulscher commented Sep 5, 2022 •

edited

kanor1306 commented Jan 26, 2023 •

edited