New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CFP: move namespace watcher to cilium-operator #29127
Comments
Neat, that's definitely a large-scale problem :). I would suggest to bring this up in the weekly community meeting, and see if there's wider interest in pursuing such an enhancement. So you don't have to carry all the load on your own. |
@julianwiedmann Do we have the community meeting next week, since it's near the Thanksgiving Day? |
Another option is that: |
I have some clarifying questions to better understand your scale and the issues that you are facing, I just want to make sure that any of the proposed ideas would solve your case.
I understand that k8s control plane was suffering without flowschema and with flowschema agent was taking a lot of time to get ready. I am wondering, maybe giving more concurrency shares to the corresponding Priority Level would help? It's always a tradeoff between cpu/memory usage of k8s control plane and how long it takes Cilium to get ready.
Going back to your proposals:
All these proposals assume though that namespace lists are the only issue, but my feeling is that we would see a similar issue with CiliumEndpoint CRDs lists for example. |
We add some metadata info into the namespace object. For example, some annotations to indicate monitoring system to gather the specific metrics or logs for the pods in this namespace. We extended the user isolation model, so add some related indicator to namespace annotations.
We are using 1.23.6. I believe the improvement you mentioned is "gzip compression switched from level 4 to level 1 to improve large list call latencies in exchange for higher network bandwidth usage (10-50% higher)". We can upgrade or cherry-pick this change to leverage the benefit.
The number of pods is also large in our cluster. So, for performance considerations, we are using high-scale-ipcache mode. which won't store pods'ip into ipcache. It means there's on need to watch all the ciliumendpoints for high-scale-ipcache mode, only ciliumendpoint with well-known identity are needed(code).
Definitely, giving more concurrency shares can accelerate the speed to get ready. It's a tradeoff. However, by minimizing unnecessary API loads, both the client and server sides can reap the benefits. That will be a win-win situation.
We expect watch can be resumed, it will extremely reduce the loads during restarting apiserver instance. But it cannot resumed successfully in our environment, I analyzed this issue.
But if client connect to lb vip of apiserver, watch will end up with error(very short watch).
client-go treats "connection refused" error as the retriable error, it won’t trigger re-listing. Anyway, this is one thing I will take effort to fix in k8s side.
We have dedicated master nodes to run etcd, apiserver, kube-controller-manager and kube-scheduler. There will be 5-9 master nodes for each cluster. For each master node, the number of cpu cores are about 64 - 80. memory is around 370GB.
We seldom modify the namespace labels, yet we haven't restricted customers from doing so. Additionally, as mentioned, we extended the user isolation model, there are instances where we do modify the namespace labels during account transfers. However, such occurrences are infrequent. |
Yes, that's the exact improvement I had in mind. It improves latency by x3 for list requests IIRC. There is also an option to disable compression, which gives additional x2 improvement, but it makes sense only if your network throughput allows it.
This unfortunately won't bring much benefit - k8s control plane does not have indexes so with each list requests it will still be processing all objects to return a few. Definitely, it will be some improvement, but probably not major from k8s control plane point of view.
Yea, makes sense. If you want to feel free to join the k8s sig-scalability meeting to get some more attention to this issue. On a side note, you will definitely be interested in KEP-3157 as it will drastically reduce k8s control plane load with respect to list requests. |
From the observation of production env, the cost(cpu/memory/latency) of encoding is much larger than filter for large listing. I believe even there's no indexer for label of CRD, this way will improve the cep listing call. And I am thinking can we support to define label indexer for CRD? |
Back to the three proposals: I wonder whether upstream will accept any of them? |
Hi there! I wanted to mention Rob proposed xDS adapter for Cilium which may be relevant here, PTAL: https://docs.google.com/document/d/1U4pO_dTaHERKOtrneNA8njW19HSVbq3sBM3x8an4878/edit#heading=h.y3v1ksm0ev6r |
This issue has been automatically marked as stale because it has not |
This issue has not seen any activity since it was marked stale. |
Cilium Feature Proposal
Thanks for taking time to make a feature proposal for Cilium! If you have usage questions, please try the slack channel and see the FAQ first.
Is your proposed feature related to a problem?
On the large scale cluster(several thousands of nodes), we have tens of thousands namespaces for customers. The raw data for one list all namespaces request is almost 200MB.
It's a huge stress for APIServer to suffer from the "listing" from cilium-agent.
Add the flowschema to ratelimit cilium-agent, it will cause cilium-agent taking a very long time to get ready.
To support the large scale clusters, we want to reduce the api loads from cilium-agent as much as possible.
Describe the feature you'd like
namespace watcher in cilium-agent is in charge of triggering the securitylabel changes of ciliumendpoint when namespace labels changed.
The idea is to move the namespace watcher from cilium-agent to cilium-operator.
cilium-operator watches all the namespaces, when it receives the update event, compare the namespace's labels and the securitylabels of the ciliumendpoints in this namespace, if there are changes, add an annotation such as "cilium.io/reconcile" to the ciliumendpoints.
cilium-agent watches the ciliumendpoints, when it receives the update event, and the new obj has the annotation "cilium.io/reconcile", cilium-agent will trigger the securitylabel changes of the ciliumendpoint. Then it get the namespace via GET api call instead of local cache to generate the new labels.
The text was updated successfully, but these errors were encountered: