Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k9s running very slowly when opening namespace with 13k pods #1462

Closed
cesher opened this issue Feb 11, 2022 · 14 comments
Closed

k9s running very slowly when opening namespace with 13k pods #1462

cesher opened this issue Feb 11, 2022 · 14 comments
Labels
enhancement New feature or request noodle question Further information is requested

Comments

@cesher
Copy link

cesher commented Feb 11, 2022




Describe the bug
We have a cluster with 13k pods in a single namespace. When I start k9s in that namespace, it can take minutes for the pods to load and for the terminal to become responsive once again, although there is a lot of latency/stutter. I understand that 13k pods are not very common, and it would be totally fine to say that that is outside the scope of what k9s is designed to handle. But I saw this other ticket that seemed very similar but for secrets. I wonder if we could do something like that? Only load metadata to list the pods and get the details on-demand?

To Reproduce
Steps to reproduce the behavior:

  1. Open k9s
  2. Navigate to the namespace with 13k pods
  3. Wait about a minute
  4. See pods
  5. /<pod-name-filter>
  6. Wait a couple of minutes
  7. See a handful of pods I care about, if it is a small number k9s works fine again, if not there would still be latency

Expected behavior
K9s does not slow down.

Screenshots
It's my employer's cluster, doubt I could post a picture here. Let me know if there is something else I can provide.

Versions (please complete the following information):

  • OS: macOS Big Sur
  • K9s: v0.25.18
  • K8s: v1.19.7

Additional context
The cluster is running on AWS EKS

@derailed derailed added the enhancement New feature or request label Feb 13, 2022
@yonatang
Copy link

Very similar phenomena happens when viewing a very large list of namespaces in a cluster (+18k namespaces). UI become unresponsive, to a point it is unusable.

@toddljones
Copy link

Seems this could possibly be solved with some pagination?

@Semmu
Copy link

Semmu commented Oct 19, 2022

Similar issue here as well, even though our cluster is much smaller (~50 namespace, ~600 pod, ~300 Helm installation).

It is especially slow when doing Helm operations.

@gotosre
Copy link

gotosre commented Jan 17, 2023

+10086,how to optimize it with multi core?

@max-sixty
Copy link

Does anyone have any short-term solutions for this?

I'm running with 10K jobs, each with a pod, and it's becoming faster to use kubectl directly.

I would be very happy to have data refreshed much less frequently (e.g. on request, or every few minutes), but be able to search for jobs & pods quickly.

@derailed derailed added the question Further information is requested label Dec 10, 2023
@derailed
Copy link
Owner

@cesher Thank you all for piping in!
Are you guys still experimenting issues with load lags?

Using the latest k9s rev v0.29.1.

  1. Loaded ~10k+ namespaces
  2. Loaded ~10k+ pods with metrics server enabled
    Viewing/Filtering seems to remain responsive (~1s).

Please add details here if that's not the case. Thank you!

@GMartinez-Sisti
Copy link
Sponsor

Hi @derailed, I'm having this issue since roughly v0.31.5. Even with just 300 namespaces the UI slows downs considerably (GKE and EKS). When writing a filter I can see my characters appearing one by one every 5 sec or so. Even after jumping to the namespace pods the slowness doesn't go away and I have to kill it and start with arguments directly on the namespace to be usable.

Anything I can do to help, please let me know!

@derailed
Copy link
Owner

@GMartinez-Sisti Very kind!Thank you Gabriel!
I'd like to resolve some of these or see how we can improve here but don't have access to a larger clusters at this juncture.
I can instrument the code and toss it over the fence to see where the bottlenecks might be or perhaps (best) we could do a joint session on your clusters and see if we can figure out root cause...

@GMartinez-Sisti
Copy link
Sponsor

GMartinez-Sisti commented Feb 18, 2024

I was wondering if just having a lot of objects would be enough to trigger it, and was able to reproduce it locally:

→ kind create cluster --name testfor i in {1..300}; do kubectl create ns "ns-$i"; done

If you try this, open k9s and navigate to namespaces it will extremely slow to the point of being unusable.

Hope it helps 😄

PS: happy to test dev builds! Just let me know.

@derailed
Copy link
Owner

@GMartinez-Sisti Very kind! Thank you Gabriel! Dang! you are correct!
I reran the tests I had earlier on this ticket and things did change for the worst of recent ;(
Has loaded 10k ns previously with no hitch ;(
I'll take another pass...

@alexnovak
Copy link

Hey @derailed! Thanks for being so responsive to these comments thus far. While looking through the source code, I noticed that the ListOptions that we provide whenever we try to get all objects tends to be pretty sparse.

For example (and correct me if I'm wrong here) but I think this is the line we hit most commonly when we get the list of objects within a namespace, while we use this for namespace collection.

In both of these cases, we provide an "empty" ListOptions for the call. Lists are typically pretty expensive for kubernetes to perform, since it has to do a significant amount of scanning against etcd to collect all data before passing it back up. Based on my understanding, there are a few ways we could make this more responsive.

  1. Add a ResourceVersion: "0"and resourceVersionMatch: "NotOlderThan" to these list calls (docs here). This will cause kubernetes to return with whatever value it has present within its own storage cache instead of having to perform a passthrough to etc-d. Experimenting locally with a cluster that has a large number of objects (a little over 3k) under a namespace, I see that this makes the page load much faster, from 10 seconds to 4 seconds. This does, however, have a drawback. Since it's the most recent version supplied from the kubernetes API cache, there's a chance of inconsistency between the state given here, and what's present in the etc-d cache. Since k9s isn't guaranteed to match the exact state of the cluster (I think the update cadence is once every two seconds?), this might not be that big of a deal.
  2. Adding pagination to the list calls. If we were to set a limit to the number of responses in our ListOptions, the API will respond more quickly with batches of results. List calls that go through to etc-d can still be very expensive for the server, but this would at least make the responsiveness on the application faster. There would probably need to be some unfun refactoring in the codebase to handle the results coming over a channel instead of as the return value. This can be done in addition to the solution in 1 for increased zippiness.

@derailed
Copy link
Owner

derailed commented Mar 1, 2024

@alexnovak Very kind! Thank you so much for this research Alex!!

You are absolutely correct. I've spent a lot of time on this cycle working on improving perf for the next drop and accumulated additional gray hairs in the process...

This is indeed tricky with the current state of affairs since all filtering/parsing is done mostly client side and thus expects full lists.
I was able to reduce the lags reported above significantly to 1-3 secs on initial load, then ms with 10k entries.
But this book will indeed remain open as more work needs to occur especially in light of clusters getting bigger to avoid multi clusters costs. Hence more strain on k9s ;(

Tho I don't think k9s will ever accommodate the 13k pods in a single ns scenario, with everyones help we can make it more bearable.
Tho pagination would indeed help, TDH tho I am not a fan ;( I think it's clunky and in the end what is the point in plowing thru 10k+ resources list.
Hence I feel we can improve things by trying to narrow as early as possible so we can to avoid additional lags.

Let see if we're a bit happier with the next drop and we will dive in individual use cases thereafter...

@GMartinez-Sisti
Copy link
Sponsor

Hi @derailed! Thank you for the great explanation.

in the end what is the point in plowing thru 10k+ resources list.

I can add my opinion on this: with so many resources, and assuming they are pods, my goal is usually to check whatever is not healthy, otherwise it is indeed unmanageable. So maybe we could have an option to filter for “except Ready” when there are more than x pods.

@derailed
Copy link
Owner

derailed commented Mar 2, 2024

@GMartinez-Sisti Thank you for the feedback Gabriel!

Let see if we're happier with v0.32.0... then we can track perf issues in separate tickets.

@derailed derailed closed this as completed Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request noodle question Further information is requested
Projects
None yet
Development

No branches or pull requests

9 participants