Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFP: move namespace watcher to cilium-operator #29127

Closed
yingnanzhang666 opened this issue Nov 13, 2023 · 11 comments
Closed

CFP: move namespace watcher to cilium-operator #29127

yingnanzhang666 opened this issue Nov 13, 2023 · 11 comments
Labels
area/operator Impacts the cilium-operator component kind/feature This introduces new functionality. kind/performance There is a performance impact of this. sig/agent Cilium agent related. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@yingnanzhang666
Copy link
Contributor

yingnanzhang666 commented Nov 13, 2023

Cilium Feature Proposal

Thanks for taking time to make a feature proposal for Cilium! If you have usage questions, please try the slack channel and see the FAQ first.

Is your proposed feature related to a problem?

On the large scale cluster(several thousands of nodes), we have tens of thousands namespaces for customers. The raw data for one list all namespaces request is almost 200MB.
It's a huge stress for APIServer to suffer from the "listing" from cilium-agent.
Add the flowschema to ratelimit cilium-agent, it will cause cilium-agent taking a very long time to get ready.

To support the large scale clusters, we want to reduce the api loads from cilium-agent as much as possible.

Describe the feature you'd like

namespace watcher in cilium-agent is in charge of triggering the securitylabel changes of ciliumendpoint when namespace labels changed.
The idea is to move the namespace watcher from cilium-agent to cilium-operator.

cilium-operator watches all the namespaces, when it receives the update event, compare the namespace's labels and the securitylabels of the ciliumendpoints in this namespace, if there are changes, add an annotation such as "cilium.io/reconcile" to the ciliumendpoints.
cilium-agent watches the ciliumendpoints, when it receives the update event, and the new obj has the annotation "cilium.io/reconcile", cilium-agent will trigger the securitylabel changes of the ciliumendpoint. Then it get the namespace via GET api call instead of local cache to generate the new labels.

sequenceDiagram
    participant client
    participant apiserver
    participant cilium-operator
    participant cilium-agent
    cilium-operator->>apiserver: watch all namespaces
    cilium-agent->>apiserver: watch local ciliumendpoints
    client->>apiserver: update namespace labels
    apiserver-->>cilium-operator: namespace update event
    cilium-operator->>apiserver: patch ciliumendpoints in this namespace with reconcile annotation
    apiserver-->>cilium-agent: ciliumendpoint update event
    cilium-agent->>cilium-agent: trigger security labels reconcile
    cilium-agent->>apiserver: send GET ns request to fetch namespace meta labels
    cilium-agent->>cilium-agent: reconcile security labels of ciliumendpoint
    cilium-agent->>apiserver: remove reconcile annotation from ciliumendpoint
    cilium-agent->>apiserver: create new ciliumidentiy & update ciliumendpoint
@yingnanzhang666 yingnanzhang666 added the kind/feature This introduces new functionality. label Nov 13, 2023
@julianwiedmann julianwiedmann added sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. kind/performance There is a performance impact of this. area/operator Impacts the cilium-operator component sig/agent Cilium agent related. labels Nov 16, 2023
@julianwiedmann
Copy link
Member

Neat, that's definitely a large-scale problem :).

I would suggest to bring this up in the weekly community meeting, and see if there's wider interest in pursuing such an enhancement. So you don't have to carry all the load on your own.

@yingnanzhang666
Copy link
Contributor Author

@julianwiedmann Do we have the community meeting next week, since it's near the Thanksgiving Day?
Can we initialize the discuss on slack or here first? Do you have any concern about this idea to move the namespace watcher to operator?

@yingnanzhang666
Copy link
Contributor Author

yingnanzhang666 commented Nov 22, 2023

Another option is that:
cilium-agent only watch each single namespaces for the pods on this node.
Such as there're only ns1/pod1, ns2/pod2 on node1, then cilium-agent on node1 will only register 2 watchers, to watch each ns with fieldSelector=metadata.name=. Which is like the behavior of kubelet to watch confimaps and secrets mounted by the pods running on its node.
In this way, there will be a lot of watchers registered.

@marseel
Copy link
Contributor

marseel commented Nov 29, 2023

I have some clarifying questions to better understand your scale and the issues that you are facing, I just want to make sure that any of the proposed ideas would solve your case.

  1. "The raw data for one list of all namespaces requests is almost 200MB." This seems super high, usually, namespaces are really light objects, like a few hundred bytes. Do you have some complex structure of your namespace objects/many labels etc?
  2. You mentioned that you are running 1.23. What specific version? List call latency was significantly improved in version k8s v1.23.11
  3. How many pods do you run in your cluster? Do you see a similar issue with CiliumEndpoint CRD lists? It seems to me counterintuitive that namespaces are the only issue, I would expect that you are running a lot of Pods and their corresponding CiliumEndpoints should also probably cause list-related issues 🤔

It's a huge stress for APIServer to suffer from the "listing" from cilium-agent.
Add the flowschema to ratelimit cilium-agent, it will cause cilium-agent taking a very long time to get ready.

I understand that k8s control plane was suffering without flowschema and with flowschema agent was taking a lot of time to get ready. I am wondering, maybe giving more concurrency shares to the corresponding Priority Level would help? It's always a tradeoff between cpu/memory usage of k8s control plane and how long it takes Cilium to get ready.

  1. During Cilium community meeting, you mentioned that this issue usually happens when doing a rolling upgrade of k8s control plane. If that's the case, I think it would be worth creating an issue for k8s itself. Efficient watch resumption KEP which is beta since 1.21 (so you probably have it enabled) and stable in 1.24, should prevent clients from doing list request when switching between k8s control plane nodes.

  2. Do you mind sharing what setup do you have for your k8s control plane nodes? I would be interested in how many control planes you are running and what are the machine sizes and general numbers of objects within your cluster.

  3. Do you change namespace labels? I am asking, because if not, maybe it would make sense to drop informer altogether and just do a single get of namespace and do not react to namespace changes.

Going back to your proposals:

  • Moving watcher to operator makes sense, but I would be worried about higher churn on CiliumEndpoint CRD - basically 2 events per CEP CRD whenever namespace label changes for all pods within namespaces - previously I've seen issues with it even on a smaller scale. Also, time to converge would be probably significant after changing the labels of the namespace that has many pods.
  • Watching only single namespaces per pod running seems like a better solution. I'm not sure though how many pods you are running, but assuming it is in 100s of thousands, these additional watchers might also matter and cause issues for k8s control plane.
  • Alternatively, If you are not changing namespace labels then the simplest solution would be to do a single get of the namespace when needed and never open watch, so while cilium would not react to namespaces label changes, it would definitely help with your k8s control plane. This wouldn't suffer from additional 100s of thousands of opened watches.

All these proposals assume though that namespace lists are the only issue, but my feeling is that we would see a similar issue with CiliumEndpoint CRDs lists for example.

@yingnanzhang666
Copy link
Contributor Author

  1. "The raw data for one list of all namespaces requests is almost 200MB." This seems super high, usually, namespaces are really light objects, like a few hundred bytes. Do you have some complex structure of your namespace objects/many labels etc?

We add some metadata info into the namespace object. For example, some annotations to indicate monitoring system to gather the specific metrics or logs for the pods in this namespace. We extended the user isolation model, so add some related indicator to namespace annotations.

  1. You mentioned that you are running 1.23. What specific version? List call latency was significantly improved in version k8s v1.23.11

We are using 1.23.6. I believe the improvement you mentioned is "gzip compression switched from level 4 to level 1 to improve large list call latencies in exchange for higher network bandwidth usage (10-50% higher)". We can upgrade or cherry-pick this change to leverage the benefit.

  1. How many pods do you run in your cluster? Do you see a similar issue with CiliumEndpoint CRD lists? It seems to me counterintuitive that namespaces are the only issue, I would expect that you are running a lot of Pods and their corresponding CiliumEndpoints should also probably cause list-related issues 🤔

The number of pods is also large in our cluster. So, for performance considerations, we are using high-scale-ipcache mode. which won't store pods'ip into ipcache. It means there's on need to watch all the ciliumendpoints for high-scale-ipcache mode, only ciliumendpoint with well-known identity are needed(code).
My another change beside the namespace watchers is that, only list/watch a small group of ciliumendpoints(the ciliumendpoints related to this node + the ciliumendpoints with well-known identity)
How to achieve this(list/watch a small group of ciliumendpoints)?
Add the label to indicate the nodeName or well-known on ciliumendpoint when creating the ciliumendpoint. And change the ciliumendpoint watcher to init with label selector.

  1. I understand that k8s control plane was suffering without flowschema and with flowschema agent was taking a lot of time to get ready. I am wondering, maybe giving more concurrency shares to the corresponding Priority Level would help? It's always a tradeoff between cpu/memory usage of k8s control plane and how long it takes Cilium to get ready.

Definitely, giving more concurrency shares can accelerate the speed to get ready. It's a tradeoff. However, by minimizing unnecessary API loads, both the client and server sides can reap the benefits. That will be a win-win situation.

  1. During Cilium community meeting, you mentioned that this issue usually happens when doing a rolling upgrade of k8s control plane. If that's the case, I think it would be worth creating an issue for k8s itself. Efficient watch resumption KEP which is beta since 1.21 (so you probably have it enabled) and stable in 1.24, should prevent clients from doing list request when switching between k8s control plane nodes.

We expect watch can be resumed, it will extremely reduce the loads during restarting apiserver instance. But it cannot resumed successfully in our environment, I analyzed this issue.
If client connect to apiserver instance directly, watch can resumed.

***  194298 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
***  194298 reflector.go:383] k8s.io/kubernetes/pkg/kubelet/kubelet.go:527: Failed to watch *v1.Node: Get https://***/api/v1/nodes?allowWatchBookmarks=true&fieldSelector=metadata.name%3D***&resourceVersion=503983615&timeoutSeconds=543&watch=true: dial tcp ***: connect: connection refused

But if client connect to lb vip of apiserver, watch will end up with error(very short watch).

***   31802 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
***   31802 reflector.go:405] object-"***"/"***": watch of *v1.Secret ended with: very short watch: object-"***"/"***": Unexpected watch close - watch lasted less than a second and no items received
Screenshot 2023-11-30 at 7 25 51 PM

client-go treats "connection refused" error as the retriable error, it won’t trigger re-listing.
for our LB behavior, watch request didn’t receive ConnectRefused, instead, the err probably matches "net.IsProbableEOF(err) || net.IsTimeout(err)". But client-go only treat ConnectionRefused case to re-watch, others will trigger relisting. r.listerWatcher.Watch(options) -> (r *Request) Watch(ctx context.Context)

Anyway, this is one thing I will take effort to fix in k8s side.

  1. Do you mind sharing what setup do you have for your k8s control plane nodes? I would be interested in how many control planes you are running and what are the machine sizes and general numbers of objects within your cluster.

We have dedicated master nodes to run etcd, apiserver, kube-controller-manager and kube-scheduler. There will be 5-9 master nodes for each cluster. For each master node, the number of cpu cores are about 64 - 80. memory is around 370GB.
The nodes for each cluster are 2000-6000.
namespaces: 50k. pods: 100k. services: 5k. secrets: 200k. etc. The spec size of namespaces and pods will be larger than normal case for upstream. We stored lots of metadata in the annotations and labels. It means the total size of objects is large.

  1. Do you change namespace labels? I am asking, because if not, maybe it would make sense to drop informer altogether and just do a single get of namespace and do not react to namespace changes.

We seldom modify the namespace labels, yet we haven't restricted customers from doing so. Additionally, as mentioned, we extended the user isolation model, there are instances where we do modify the namespace labels during account transfers. However, such occurrences are infrequent.

@marseel
Copy link
Contributor

marseel commented Nov 30, 2023

We are using 1.23.6. I believe the improvement you mentioned is "gzip compression switched from level 4 to level 1 to improve large list call latencies in exchange for higher network bandwidth usage (10-50% higher)". We can upgrade or cherry-pick this change to leverage the benefit.

Yes, that's the exact improvement I had in mind. It improves latency by x3 for list requests IIRC. There is also an option to disable compression, which gives additional x2 improvement, but it makes sense only if your network throughput allows it.

Add the label to indicate the nodeName or well-known on ciliumendpoint when creating the ciliumendpoint. And change the ciliumendpoint watcher to init with label selector.

This unfortunately won't bring much benefit - k8s control plane does not have indexes so with each list requests it will still be processing all objects to return a few. Definitely, it will be some improvement, but probably not major from k8s control plane point of view.

But client-go only treat ConnectionRefused case to re-watch, others will trigger relisting. Anyway, this is one thing I will take effort to fix in k8s side.

Yea, makes sense. If you want to feel free to join the k8s sig-scalability meeting to get some more attention to this issue.

On a side note, you will definitely be interested in KEP-3157 as it will drastically reduce k8s control plane load with respect to list requests.

@yingnanzhang666
Copy link
Contributor Author

This unfortunately won't bring much benefit - k8s control plane does not have indexes so with each list requests it will still be processing all objects to return a few. Definitely, it will be some improvement, but probably not major from k8s control plane point of view.

From the observation of production env, the cost(cpu/memory/latency) of encoding is much larger than filter for large listing. I believe even there's no indexer for label of CRD, this way will improve the cep listing call. And I am thinking can we support to define label indexer for CRD?

@yingnanzhang666
Copy link
Contributor Author

Back to the three proposals: I wonder whether upstream will accept any of them?

@linsun
Copy link
Contributor

linsun commented Jan 12, 2024

Hi there! I wanted to mention Rob proposed xDS adapter for Cilium which may be relevant here, PTAL: https://docs.google.com/document/d/1U4pO_dTaHERKOtrneNA8njW19HSVbq3sBM3x8an4878/edit#heading=h.y3v1ksm0ev6r

Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Mar 13, 2024
Copy link

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operator Impacts the cilium-operator component kind/feature This introduces new functionality. kind/performance There is a performance impact of this. sig/agent Cilium agent related. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

No branches or pull requests

4 participants