Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s: CEP GC runs in steps #6029

Merged
merged 2 commits into from
Nov 15, 2018
Merged

Conversation

raybejjani
Copy link
Contributor

@raybejjani raybejjani commented Oct 24, 2018

On large clusters the CEP GC would fetch the entire list of CEPs. These
can grow to significant sizes (100kB in one instance) and this causes
extreme degradation in kube-apiserver. We now iterate 10 CEPs at a
time, thus lowering the load we put on the cluster.

I also return the cache.Store object from our k8s helper factories. No functionality has changed there.

I began by trying to watch for CEPs in the controller but the code became a little complex. This was partly because only the node that should do GC needs to watch for CEPs, but that meant creating a non-shared informer (shared informers cannot stop once they begin watching for a specific type). In the end, limiting the list to a small number seemed like a reasonable compromise, especially since the GC isn't a critical operation here.

Relates to: #5913


This change is Reviewable

@raybejjani raybejjani requested a review from a team as a code owner October 24, 2018 09:31
@raybejjani raybejjani requested a review from a team October 24, 2018 09:31
@raybejjani
Copy link
Contributor Author

test-me-please

@raybejjani raybejjani added kind/enhancement This would improve or streamline existing functionality. pending-review area/daemon Impacts operation of the Cilium daemon. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. labels Oct 24, 2018
@raybejjani
Copy link
Contributor Author

test-me-please

1 similar comment
@raybejjani
Copy link
Contributor Author

test-me-please

ceps, err := ciliumClient.CiliumEndpoints(meta_v1.NamespaceAll).List(listOpts)
switch {
case ceps.Continue != "" && err != nil:
// this is ok, it means we saw a 410 ResourceExpired error but we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is helpful, but is this documented somewhere? I'd suggest putting a link in a comment here to said documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was, in fact, checking this incorrectly. I've fixed it.

@ianvernon
Copy link
Member

@raybejjani do you have any benchmarks from manual testing that you can post here to validate the fix?

@raybejjani
Copy link
Contributor Author

raybejjani commented Oct 25, 2018

@ianvernon I added 10k CEPs, all would be GCed. That took 12
The no-op GC took

Oct 25 13:54:30 k8s1 cilium-agent[29443]: level=info msg="Time taken 48.722672ms" controller="sync-to-k8s-ciliumendpoint-gc (k8s1)" subsys=endpoint

The real one took

Oct 25 14:14:14 k8s1 cilium-agent[32403]: level=info msg="Time taken 6m33.807662252s" controller="sync-to-k8s-ciliumendpoint-gc (k8s1)" endpointID=0 k8sPodName=default/xwing-2316 subsys=endpoint

Oddly, listing the CEPs after this still shows them but the count is decreasing... so I think there is a delay on the delete.

I'll run this again with no fetch limit... I'll be a few minutes

@raybejjani
Copy link
Contributor Author

Ok, maybe not... I stopped the agent and the count stopped dropping. I think another GC run had started... although that shouldn't have happened (the interval is 30 minutes)

@raybejjani
Copy link
Contributor Author

test-me-please

@raybejjani
Copy link
Contributor Author

Running this again but with 1000 CEPs (10k was taking too long)

No CEPs to delete

Oct 25 14:38:35 k8s1 cilium-agent[15383]: level=info msg="Time taken 54.671552ms" controller="sync-to-k8s-ciliumendpoint-gc (k8s1)" subsys=endpoint

1000 CEPs to delete

Oct 25 14:51:04 k8s1 cilium-agent[20124]: level=info msg="Time taken 4m27.606818717s, orphans: 1002" controller="sync-to-k8s-ciliumendpoint-gc (k8s1)" endpointID=0 k8sPodName=default/xwing-999 subsys=endpoint

1000 CEPs to delete without a fetch limit

Oct 25 15:00:00 k8s1 cilium-agent[26276]: level=info msg="Time taken 4m2.605554694s, orphans: 1000" controller="sync-to-k8s-ciliumendpoint-gc (k8s1)" endpointID=0 k8sPodName=default/xwing-999 subsys=endpoint

1000 CEPs to delete with Propagation: Background

Oct 25 15:19:18 k8s1 cilium-agent[5325]: level=info msg="Time taken 4m3.37785256s, orphans: 1000" controller="sync-to-k8s-ciliumendpoint-gc (k8s1)" endpointID=0 k8sPodName=default/xwing-999 subsys=endpoint

I don't think we can extrapolate to much larger numbers, necessarily, since there may be some cliff somewhere. In any case, the goal was to lower the likelihood of overloading the apiserver (or the etcd behind it) when fetching these.

@raybejjani
Copy link
Contributor Author

test-me-please

pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved
pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved
pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved
pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@raybejjani raybejjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll run some more benchmarks but I updated the code a little.

pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved
pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved
pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved
@raybejjani
Copy link
Contributor Author

Ok, I finally ran with different limit values. It doesn't seem to make a difference:

  • No-op reference. Time taken 1.043820863s, CEPs: 0
  • Limit 10. Time taken 4m4.20950814s, CEPs: 1000
  • Limit 25. Time taken 4m4.236100466s, CEPs: 1000
  • Limit 50. Time taken 4m4.291099082s, CEPs: 1000
  • No limit. Time taken 4m4.207404591s, CEPs: 1000

@raybejjani
Copy link
Contributor Author

test-me-please

@coveralls
Copy link

coveralls commented Nov 3, 2018

Coverage Status

Coverage decreased (-0.1%) to 44.072% when pulling e41505c on raybejjani:cep-smaller-gc into e2b43dd on cilium:master.

@tgraf tgraf added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Nov 7, 2018
@tgraf tgraf added this to the 1.4-bugfix milestone Nov 7, 2018
@raybejjani
Copy link
Contributor Author

test-me-please

1 similar comment
@raybejjani
Copy link
Contributor Author

test-me-please

aanm and others added 2 commits November 14, 2018 17:40
The watcher also keeps a local cache of objects. We can use this for
lookups in places like garbage-collection.

Signed-off-by: Ray Bejjani <ray@covalent.io>
On large clusters the CEP GC would fetch the entire list of CEPs. These
can grow to significant sizes (100kB in one instance) and this causes
extreme degradation in kube-apiserver. We now iterate 10 CEPs at a
time, thus lowering the load we put on the cluster.

Signed-off-by: Ray Bejjani <ray@covalent.io>
@raybejjani
Copy link
Contributor Author

test-me-please

@raybejjani raybejjani merged commit 0a17787 into cilium:master Nov 15, 2018
@raybejjani raybejjani deleted the cep-smaller-gc branch November 15, 2018 12:30
@WingkaiHo
Copy link

if this patch merge to 1.2 version

@tgraf
Copy link
Member

tgraf commented Mar 1, 2019

@WingkaiHo I'm assuming you ran into scale issues with CEP. We recommend upgrading to 1.4 to use CEP at scale. Upgrading this commit only will not be sufficient to resolve all scale issues present in 1.2 and 1.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. kind/enhancement This would improve or streamline existing functionality. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants