Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Apiserver-Proxy] Readiness probe fails #5541

Closed
timuthy opened this issue Mar 9, 2022 · 4 comments · Fixed by #5544
Closed

[Apiserver-Proxy] Readiness probe fails #5541

timuthy opened this issue Mar 9, 2022 · 4 comments · Fixed by #5544
Labels
area/networking Networking related kind/bug Bug

Comments

@timuthy
Copy link
Contributor

timuthy commented Mar 9, 2022

How to categorize this issue?

/area networking
/kind bug

What happened:
We've regularly seen issues with at least one apiserver-proxy pod in shoot clusters that does not get ready and remains in this state until the pod is deleted/restarted.

Events:
  Type     Reason     Age                      From     Message
  ----     ------     ----                     ----     -------
  Warning  Unhealthy  105s (x9659 over 5h25m)  kubelet  Readiness probe failed: Get "http://10.250.0.19:16910/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

What you expected to happen:
The apiserver-proxy pod to get ready.

How to reproduce it (as minimally and precisely as possible):
Not certain at the moment.

Anything else we need to know?:
The container is heavily throttled and the /ready is probably not able to respond.

image

Interestingly, after the pod is deleted the new one is not throttled.

Environment:

  • Gardener version: v1.41.2
  • Kubernetes version (use kubectl version): v1.21.7
@timuthy timuthy added the kind/bug Bug label Mar 9, 2022
@gardener-robot gardener-robot added the area/networking Networking related label Mar 9, 2022
@timuthy
Copy link
Contributor Author

timuthy commented Mar 9, 2022

/cc @DockToFuture @ScheererJ

@ScheererJ
Copy link
Contributor

Envoy proxy from the proxy container does not seem to have come up correctly, i.e. neither the port for forwarding to the kube-apiserver nor the metrics port are open.

@ScheererJ
Copy link
Contributor

ScheererJ commented Mar 9, 2022

It looks like the envoy proxy hangs in some kind of busy loop as it continuously consumes cpu time. This also explains why it is throttled.

@ScheererJ
Copy link
Contributor

strace shows that the envoy proxy container is repeatedly calling sched_setaffinity.

sched_setaffinity(0, 128, [12])         = 0
 > /usr/glibc-compat/lib/libc.so.6() [0xe5af9]
 > /usr/local/bin/envoy() [0x3a9470a]
 > /usr/local/bin/envoy() [0x3a6fb96]
 > /usr/local/bin/envoy() [0x3a6fa00]
 > /usr/local/bin/envoy() [0x3a6e89d]
 > /usr/local/bin/envoy() [0x3a6e426]
 > /usr/local/bin/envoy() [0x3a6bc7a]
 > /usr/local/bin/envoy() [0x3b075ae]
 > /usr/local/bin/envoy() [0x3a6f534]
 > /usr/local/bin/envoy() [0x3b04e7d]

gdb paints a similar picture.

(gdb) bt
#0  0x00007f9c81b7daf9 in sched_setaffinity () from target:/usr/glibc-compat/lib/libc.so.6
#1  0x000055f5f9c422b5 in ?? ()
#2  0x000055f5f9c4270a in ?? ()
#3  0x000055f5f9c1db96 in ?? ()
#4  0x000055f5f9c1da00 in ?? ()
#5  0x000055f5f9c1c89d in ?? ()
#6  0x000055f5f9c1c426 in ?? ()
#7  0x000055f5f9c19c7a in ?? ()
#8  0x000055f5f9cb55ae in ?? ()
#9  0x000055f5f9c1d534 in ?? ()
#10 0x000055f5f9cb2e7d in ?? ()
#11 0x00007f9c81abec1c in __libc_start_main () from target:/usr/glibc-compat/lib/libc.so.6
#12 0x000055f5f765ae6a in ?? ()

The envoy code base does not seem to have a direct call to sched_setaffinity.

We will address the issue with a liveness probe and restart the container if it does not come up in 30 seconds (#5544).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking Networking related kind/bug Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants