k9s running very slowly when opening namespace with 13k pods #1462

cesher · 2022-02-11T21:20:56Z

Describe the bug
We have a cluster with 13k pods in a single namespace. When I start k9s in that namespace, it can take minutes for the pods to load and for the terminal to become responsive once again, although there is a lot of latency/stutter. I understand that 13k pods are not very common, and it would be totally fine to say that that is outside the scope of what k9s is designed to handle. But I saw this other ticket that seemed very similar but for secrets. I wonder if we could do something like that? Only load metadata to list the pods and get the details on-demand?

To Reproduce
Steps to reproduce the behavior:

Open k9s
Navigate to the namespace with 13k pods
Wait about a minute
See pods
/<pod-name-filter>
Wait a couple of minutes
See a handful of pods I care about, if it is a small number k9s works fine again, if not there would still be latency

Expected behavior
K9s does not slow down.

Screenshots
It's my employer's cluster, doubt I could post a picture here. Let me know if there is something else I can provide.

Versions (please complete the following information):

OS: macOS Big Sur
K9s: v0.25.18
K8s: v1.19.7

Additional context
The cluster is running on AWS EKS

The text was updated successfully, but these errors were encountered:

yonatang · 2022-04-26T20:36:42Z

Very similar phenomena happens when viewing a very large list of namespaces in a cluster (+18k namespaces). UI become unresponsive, to a point it is unusable.

toddljones · 2022-08-19T18:32:40Z

Seems this could possibly be solved with some pagination?

Semmu · 2022-10-19T13:41:51Z

Similar issue here as well, even though our cluster is much smaller (~50 namespace, ~600 pod, ~300 Helm installation).

It is especially slow when doing Helm operations.

gotosre · 2023-01-17T02:52:38Z

+10086，how to optimize it with multi core?

max-sixty · 2023-05-25T18:26:39Z

Does anyone have any short-term solutions for this?

I'm running with 10K jobs, each with a pod, and it's becoming faster to use kubectl directly.

I would be very happy to have data refreshed much less frequently (e.g. on request, or every few minutes), but be able to search for jobs & pods quickly.

derailed · 2023-12-10T20:10:18Z

@cesher Thank you all for piping in!
Are you guys still experimenting issues with load lags?

Using the latest k9s rev v0.29.1.

Loaded ~10k+ namespaces
Loaded ~10k+ pods with metrics server enabled
Viewing/Filtering seems to remain responsive (~1s).

Please add details here if that's not the case. Thank you!

GMartinez-Sisti · 2024-02-16T17:20:06Z

Hi @derailed, I'm having this issue since roughly v0.31.5. Even with just 300 namespaces the UI slows downs considerably (GKE and EKS). When writing a filter I can see my characters appearing one by one every 5 sec or so. Even after jumping to the namespace pods the slowness doesn't go away and I have to kill it and start with arguments directly on the namespace to be usable.

Anything I can do to help, please let me know!

derailed · 2024-02-18T01:18:09Z

@GMartinez-Sisti Very kind!Thank you Gabriel!
I'd like to resolve some of these or see how we can improve here but don't have access to a larger clusters at this juncture.
I can instrument the code and toss it over the fence to see where the bottlenecks might be or perhaps (best) we could do a joint session on your clusters and see if we can figure out root cause...

GMartinez-Sisti · 2024-02-18T12:44:31Z

I was wondering if just having a lot of objects would be enough to trigger it, and was able to reproduce it locally:

→ kind create cluster --name test
→ for i in {1..300}; do kubectl create ns "ns-$i"; done

If you try this, open k9s and navigate to namespaces it will extremely slow to the point of being unusable.

Hope it helps 😄

PS: happy to test dev builds! Just let me know.

derailed · 2024-02-18T16:49:40Z

@GMartinez-Sisti Very kind! Thank you Gabriel! Dang! you are correct!
I reran the tests I had earlier on this ticket and things did change for the worst of recent ;(
Has loaded 10k ns previously with no hitch ;(
I'll take another pass...

alexnovak · 2024-02-20T14:56:44Z

Hey @derailed! Thanks for being so responsive to these comments thus far. While looking through the source code, I noticed that the ListOptions that we provide whenever we try to get all objects tends to be pretty sparse.

For example (and correct me if I'm wrong here) but I think this is the line we hit most commonly when we get the list of objects within a namespace, while we use this for namespace collection.

In both of these cases, we provide an "empty" ListOptions for the call. Lists are typically pretty expensive for kubernetes to perform, since it has to do a significant amount of scanning against etcd to collect all data before passing it back up. Based on my understanding, there are a few ways we could make this more responsive.

Add a ResourceVersion: "0"and resourceVersionMatch: "NotOlderThan" to these list calls (docs here). This will cause kubernetes to return with whatever value it has present within its own storage cache instead of having to perform a passthrough to etc-d. Experimenting locally with a cluster that has a large number of objects (a little over 3k) under a namespace, I see that this makes the page load much faster, from 10 seconds to 4 seconds. This does, however, have a drawback. Since it's the most recent version supplied from the kubernetes API cache, there's a chance of inconsistency between the state given here, and what's present in the etc-d cache. Since k9s isn't guaranteed to match the exact state of the cluster (I think the update cadence is once every two seconds?), this might not be that big of a deal.
Adding pagination to the list calls. If we were to set a limit to the number of responses in our ListOptions, the API will respond more quickly with batches of results. List calls that go through to etc-d can still be very expensive for the server, but this would at least make the responsiveness on the application faster. There would probably need to be some unfun refactoring in the codebase to handle the results coming over a channel instead of as the return value. This can be done in addition to the solution in 1 for increased zippiness.

derailed · 2024-03-01T22:42:03Z

@alexnovak Very kind! Thank you so much for this research Alex!!

You are absolutely correct. I've spent a lot of time on this cycle working on improving perf for the next drop and accumulated additional gray hairs in the process...

This is indeed tricky with the current state of affairs since all filtering/parsing is done mostly client side and thus expects full lists.
I was able to reduce the lags reported above significantly to 1-3 secs on initial load, then ms with 10k entries.
But this book will indeed remain open as more work needs to occur especially in light of clusters getting bigger to avoid multi clusters costs. Hence more strain on k9s ;(

Tho I don't think k9s will ever accommodate the 13k pods in a single ns scenario, with everyones help we can make it more bearable.
Tho pagination would indeed help, TDH tho I am not a fan ;( I think it's clunky and in the end what is the point in plowing thru 10k+ resources list.
Hence I feel we can improve things by trying to narrow as early as possible so we can to avoid additional lags.

Let see if we're a bit happier with the next drop and we will dive in individual use cases thereafter...

GMartinez-Sisti · 2024-03-02T09:08:50Z

Hi @derailed! Thank you for the great explanation.

in the end what is the point in plowing thru 10k+ resources list.

I can add my opinion on this: with so many resources, and assuming they are pods, my goal is usually to check whatever is not healthy, otherwise it is indeed unmanageable. So maybe we could have an option to filter for “except Ready” when there are more than x pods.

derailed · 2024-03-02T17:37:03Z

@GMartinez-Sisti Thank you for the feedback Gabriel!

Let see if we're happier with v0.32.0... then we can track perf issues in separate tickets.

derailed added the enhancement New feature or request label Feb 13, 2022

derailed added the question Further information is requested label Dec 10, 2023

derailed added the noodle label Dec 10, 2023

derailed mentioned this issue Mar 1, 2024

List with namespaces unresponsive #2574

Closed

derailed mentioned this issue Mar 2, 2024

K9s/release v0.32.0 #2577

Merged

derailed closed this as completed Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k9s running very slowly when opening namespace with 13k pods #1462

k9s running very slowly when opening namespace with 13k pods #1462

cesher commented Feb 11, 2022

yonatang commented Apr 26, 2022

toddljones commented Aug 19, 2022

Semmu commented Oct 19, 2022

gotosre commented Jan 17, 2023

max-sixty commented May 25, 2023

derailed commented Dec 10, 2023

GMartinez-Sisti commented Feb 16, 2024

derailed commented Feb 18, 2024

GMartinez-Sisti commented Feb 18, 2024 •

edited

derailed commented Feb 18, 2024

alexnovak commented Feb 20, 2024

derailed commented Mar 1, 2024

GMartinez-Sisti commented Mar 2, 2024

derailed commented Mar 2, 2024

k9s running very slowly when opening namespace with 13k pods #1462

k9s running very slowly when opening namespace with 13k pods #1462

Comments

cesher commented Feb 11, 2022

yonatang commented Apr 26, 2022

toddljones commented Aug 19, 2022

Semmu commented Oct 19, 2022

gotosre commented Jan 17, 2023

max-sixty commented May 25, 2023

derailed commented Dec 10, 2023

GMartinez-Sisti commented Feb 16, 2024

derailed commented Feb 18, 2024

GMartinez-Sisti commented Feb 18, 2024 • edited

derailed commented Feb 18, 2024

alexnovak commented Feb 20, 2024

derailed commented Mar 1, 2024

GMartinez-Sisti commented Mar 2, 2024

derailed commented Mar 2, 2024

GMartinez-Sisti commented Feb 18, 2024 •

edited