Fix Memory Consumption in network_policy_controller #902

aauren · 2020-05-17T07:33:46Z

This is a fix for: #795

This change does two things broken up by commit:

Converts networkPoliciesInfo to be a function variable rather than a variable on the npc struct. Through extensive testing I've found that this gives a relatively small memory performance gain, but still worth doing IMO since this variable isn't referenced outside of the Sync function and its dependency functions. Converting the variable allows the GC to free up cached data sooner.
Converts the Sync() method to use semaphores and converts all handlers to call the function asynchronously. Pausing the handler flow and waiting for the Sync method to finish is causing the k8s watchers to backup and hold on to pod, network policy, and namespace object metadata until the handlers finish doing a full sync.
In large clusters that have a high churn on pods, network policies, and/or namespaces it results in kube-router never being able to catch up to the amount of updates it is receiving and the streamwatcher's cache of k8s objects will grown unbound.
Because every sync is a full sync we need to limit ourselves to a single thread of execution, but utilize a semaphore so that if we try to acquire a lock and fail we can return quickly without blocking anything.

As an example, in our medium sized cluster it takes about 1.5 minutes for kube-router to perform a full iptables sync. However, in this cluster we have multiple small cronjobs that execute every minute. When this happens, kube-router is never able to catch up and slowly the queue containing these pod changes become longer and longer.

When testing kube-router-1.0.0-rc3 without this patch I find that over the course of an hour with 6 1 minutes cronjobs running I see a memory gain of ~50 MB / hour. With this patch memory stays constant with no noticeable increases.

Disclaimer, I spent about an hour trying to get dep to work to update the Gopkg.toml and Gopkg.lock files for golang.org/x/sync but wasn't able to get it to work correctly. Multiple packages failed to update while running the dep ensure step, and it consistently failed to import the package with an error like the following:

Solving failure: No versions of golang.org/x/sync met constraints:
	master: Could not introduce golang.org/x/sync@master, as its subpackage golang.org/x/sync does not contain usable Go code (*build.NoGoError).. (Package is required by (root)

So to save time I manually vendored the sync package. @murali-reddy let me know if you're able to get dep to work for you and I'll update my PR.

A special thanks to @liggitt for setting me on the right track and giving me some really good hints that allowed us to find the handler contention.

filintod · 2020-05-18T05:37:02Z

LGTM. Nice find with the stacking syncs.

murali-reddy · 2020-05-18T17:02:24Z

thanks @aauren for being persistent and finding the problem.

I don't have a large cluster to reporduce the problem and test the patch myself. So I have to ask these question to be sure we are chasing the right problem.

Converts networkPoliciesInfo to be a function variable rather than a variable on the npc struct. Through extensive testing I've found that this gives a relatively small memory performance gain, but still worth doing IMO since this variable isn't referenced outside of the Sync function and its dependency functions.

Would like kube-router to be state less unless there is some performance gain caching the processed result. In this case if we have global state of network policies it would avoid re-building it on every update (add/delete/update events of pods, namespaces, network policy objects). Given that in current design its always full-sync I am fine with this change. When in future we change the desgin (if need arise to deal with larger scale) to handle update partially (to that particular pod/namespace/network policy) we might reconsider having global state back.

Converting the variable allows the GC to free up cached data sooner.

I am not sure of this. Is there a percieved benefit with just this change (of moving networkPoliciesInfo to variable of sync handler function)? In both the cases it was just a pointer and must have allocated on the heap.

Pausing the handler flow and waiting for the Sync method to finish is causing the k8s watchers to backup and hold on to pod, network policy, and namespace object metadata until the handlers finish doing a full sync.

What I am wondering is if there is memory leak or its just memory is being held? Its possible if rate at which number of events are coming exceeds the number of sync() that are completed then I agree that memory is held but its only for API objects. From the goroutine dump you shared offline I did not see any indication that there are huge number of go routines that are waiting for completion. Can you please check agian taking go routine dump to confirm indeed this is the case of back pressure due to network policies sync that is causing the k8s watchers to backup and hold on

Converts the Sync() method to use semaphores and converts all handlers to call the function asynchronously.

In principle i agree we should do heavy processing like sync() asynchronously than as part of handler handling the API watch event. We should do that change. There is neat work queue in clien-go with ratelimiter https://godoc.org/k8s.io/client-go/util/workqueue for processing objects asynchronously (https://github.com/kubernetes/client-go/tree/master/examples/workqueue). But as we are not interested in the object it self perhaps we can less simpler change.

In large clusters that have a high churn on pods, network policies, and/or namespaces it results in kube-router never being able to catch up to the amount of updates it is receiving and the streamwatcher's cache of k8s objects will grown unbound.
Because every sync is a full sync we need to limit ourselves to a single thread of execution, but utilize a semaphore so that if we try to acquire a lock and fail we can return quickly without blocking anything.

So effectivley we want to coalesce the events to perform single sync() where possible and limit to single thread of execution. I would suggest we use single dedicated go routine to perform sync() which wait on a channel to perform sync. We can use 1-capacity channel with non-blocking sender to the channel. I will try to post a gist if needed.

With semaphore its possible we will miss the latest update. For e.g if a thread is already processing the even and doing the sync, and we get 100 updates. All of them return as there is update in progress. But that does not necessarily pick up the latest changes. So we want to atleast ony more sync() to coalesce all the 100 updates.

aauren · 2020-05-18T18:50:19Z

Would like kube-router to be state less unless there is some performance gain caching the processed result. In this case if we have global state of network policies it would avoid re-building it on every update (add/delete/update events of pods, namespaces, network policy objects). Given that in current design its always full-sync I am fine with this change. When in future we change the desgin (if need arise to deal with larger scale) to handle update partially (to that particular pod/namespace/network policy) we might reconsider having global state back.

Full agreement here. We should keep kube-router stateless where possible, but if there is a large enough gain in maintaining global state we should be willing to adapt in the future. In this case, it's never used from the controller outside that method, so safe to remove some unused global state in this instance.

I am not sure of this. Is there a percieved benefit with just this change (of moving networkPoliciesInfo to variable of sync handler function)? In both the cases it was just a pointer and must have allocated on the heap.

The change here means that a reference to that pointer is released sooner, so it should allow the GC to cleanup the heap sooner than if it were referenced from the controller. I chose to leave it as a pointer reference to reduce the change, but if you're ok with it, I can change all of the instances to pass by value and then it would only be allocated on the stack and never touch the heap. That's where I actually started, but then reverted to passing by reference because I wasn't sure how big of a change I wanted to introduce here.

What I am wondering is if there is memory leak or its just memory is being held? Its possible if rate at which number of events are coming exceeds the number of sync() that are completed then I agree that memory is held but its only for API objects. From the goroutine dump you shared offline I did not see any indication that there are huge number of go routines that are waiting for completion. Can you please check agian taking go routine dump to confirm indeed this is the case of back pressure due to network policies sync that is causing the k8s watchers to backup and hold on

Yes, the word "leak" here is not correct. It's just memory being held by the streamwatcher processes. I'm willing to take goroutine dumps, but I can already tell you that there will not be a spike in goroutines as I kept a close eye on the go_goroutines metrics from kube-router and I never saw a spike no matter how much memory climbed.

While I'm not familiar with the internals of the Kubernetes apimachinery, from the external signs that I've seen it appears to only start a single goroutine for each watcher here:

kube-router/pkg/cmd/kube-router.go

Line 96 in 803bd90

informerFactory := informers.NewSharedInformerFactory(kr.Client, 0)

it then calls the handler within that single goroutine. This is consistent with what I saw with the core dumps that I took and also what I saw while monitoring the kube-router logs that were emitted from the network_policy_controller. If we make Sync() a synchronous action and pause the handler execution for the duration of Sync() you will see no additional pod, namespace, or network policy changes until the Sync() call is complete.

As such I would not expect to see the number of goroutines increase as the handlers became blocked, instead the same 3 informers will become blocked on the handlers and will begin to build up their caches with a bunch of pointers to Kubernetes metadata that it cannot flush and the heap will fill.

In principle i agree we should do heavy processing like sync() asynchronously than as part of handler handling the API watch event. We should do that change. There is neat work queue in clien-go with ratelimiter https://godoc.org/k8s.io/client-go/util/workqueue for processing objects asynchronously (https://github.com/kubernetes/client-go/tree/master/examples/workqueue). But as we are not interested in the object it self perhaps we can less simpler change.

I'm not familiar with workqueue, I'll take a look.

So effectivley we want to coalesce the events to perform single sync() where possible and limit to single thread of execution. I would suggest we use single dedicated go routine to perform sync() which wait on a channel to perform sync. We can use 1-capacity channel with non-blocking sender to the channel. I will try to post a gist if needed.

With semaphore its possible we will miss the latest update. For e.g if a thread is already processing the even and doing the sync, and we get 100 updates. All of them return as there is update in progress. But that does not necessarily pick up the latest changes. So we want to atleast ony more sync() to coalesce all the 100 updates.

This makes sense to me, in the current semaphore implementation there is definitely a chance that we'll lose a work item. I'll rework my solution to try a single item channel instead.

Take networkPoliciesInfo off of the npc struct and convert it to a stack variable that is easy to cleanup.

Kubernetes informers will block on handler execution and will then begin to accumulate cached Kubernetes object information into the heap. This change moves the full sync logic into it's own goroutine where full syncs are triggered and gated via writing to a single item channel. This ensures that: - Syncs will only happen one at a time (as they are full syncs and we can't process multiple at once) - Sync requests are only ever delayed and never lost as they will be added to the request channel - After we make a sync request we return fast to ensure that the handler execution returns fast and that we don't block the Kubernetes informers

Now that we are better managing requests for full syncs we no longer need to manage readyForUpdates on the npc controller. We already enforce not blocking the handlers and a single sync execution chain, whether it comes from the controller in the form of a periodic sync or whether it comes from a Kubernetes informer, either way the result is a non-blocking, single thread of execution, full sync.

aauren · 2020-05-18T22:34:32Z

@murali-reddy I switched to a channel based approach as recommended and I fixed networkPoliciesInfo to be completely stack based by removing the pass by reference logic.

Let me know what you think.

murali-reddy

overall looks good. couple of small nits.

murali-reddy · 2020-05-19T16:39:33Z

pkg/controllers/netpol/network_policy_controller.go

+	glog.Info("Starting network policy controller full sync goroutine")
+	go func(fullSyncRequest <-chan struct{}, stopCh <-chan struct{}) {
+		for {
+			select {


go's select case statement does not have priorty. So its better to give priority to case <-stopCh else we run into corner case where full sync's can continue to run inspite of stopCh is closed.

murali-reddy · 2020-05-19T16:41:05Z

pkg/controllers/netpol/network_policy_controller.go

+			select {
+			case <-stopCh:
+				glog.Info("Shutting down network policies full sync goroutine")
+				return


we might as well add this goroutine to WaitGroup

I created an additional WaitGroup here and then wait on it before the WaitGroup passed to Run sends done. Is that ok? Or did you want me to piggyback off the initial WaitGroup that was passed to Run()?

@aauren I meant to the existing waitgroup that was passed to Run(). So that whole process will exit gracefully (by waiting here https://github.com/cloudnativelabs/kube-router/blob/v1.0.0-rc3/pkg/cmd/kube-router.go#L190) when all the go routines exits.

Updated to use the original waitgroup

murali-reddy · 2020-05-20T10:45:54Z

I manually verified below scenarios

on pod/namespace/netwrok policy add/delete/update events results in full sync on network policies
periodic full sync are run as per scheduled interval
on bulk events verfied that full sync requests are colesced to a single run of full sync.

overall LGTM

) * feat(gitignore): don't track intellij files * fact(network_policy): networkPoliciesInfo -> stack Take networkPoliciesInfo off of the npc struct and convert it to a stack variable that is easy to cleanup. * fix(network_policy): k8s obj memory accumulation Kubernetes informers will block on handler execution and will then begin to accumulate cached Kubernetes object information into the heap. This change moves the full sync logic into it's own goroutine where full syncs are triggered and gated via writing to a single item channel. This ensures that: - Syncs will only happen one at a time (as they are full syncs and we can't process multiple at once) - Sync requests are only ever delayed and never lost as they will be added to the request channel - After we make a sync request we return fast to ensure that the handler execution returns fast and that we don't block the Kubernetes informers * fact(network_policy): rework readyForUpdates Now that we are better managing requests for full syncs we no longer need to manage readyForUpdates on the npc controller. We already enforce not blocking the handlers and a single sync execution chain, whether it comes from the controller in the form of a periodic sync or whether it comes from a Kubernetes informer, either way the result is a non-blocking, single thread of execution, full sync. * fix(network_policy): address PR feedback

mrueg requested a review from murali-reddy May 17, 2020 14:46

aauren added 4 commits May 18, 2020 16:52

feat(gitignore): don't track intellij files

db27d5c

fact(network_policy): networkPoliciesInfo -> stack

8c45edb

Take networkPoliciesInfo off of the npc struct and convert it to a stack variable that is easy to cleanup.

aauren force-pushed the fix_memory_leak branch from ccbf03e to 5763e29 Compare May 18, 2020 22:33

murali-reddy approved these changes May 19, 2020

View reviewed changes

aauren force-pushed the fix_memory_leak branch from 453cf45 to 42baaff Compare May 20, 2020 02:47

fix(network_policy): address PR feedback

42baaff

murali-reddy merged commit 837554b into cloudnativelabs:master May 20, 2020

aauren mentioned this pull request May 20, 2020

Memory overuse #795

Closed

brandond mentioned this pull request Jan 28, 2021

kube-router network policy controller needs to be updated k3s-io/k3s#2876

Closed

aauren mentioned this pull request Feb 1, 2021

refactor network policy controller #1010

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Memory Consumption in network_policy_controller #902

Fix Memory Consumption in network_policy_controller #902

Uh oh!

aauren commented May 17, 2020

Uh oh!

filintod commented May 18, 2020

Uh oh!

murali-reddy commented May 18, 2020

Uh oh!

aauren commented May 18, 2020

Uh oh!

aauren commented May 18, 2020

Uh oh!

murali-reddy left a comment

Uh oh!

murali-reddy May 19, 2020

Uh oh!

murali-reddy May 19, 2020

Uh oh!

aauren May 19, 2020

Uh oh!

murali-reddy May 19, 2020

Uh oh!

aauren May 20, 2020

Uh oh!

murali-reddy commented May 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix Memory Consumption in network_policy_controller #902

Fix Memory Consumption in network_policy_controller #902

Uh oh!

Conversation

aauren commented May 17, 2020

Uh oh!

filintod commented May 18, 2020

Uh oh!

murali-reddy commented May 18, 2020

Uh oh!

aauren commented May 18, 2020

Uh oh!

aauren commented May 18, 2020

Uh oh!

murali-reddy left a comment

Choose a reason for hiding this comment

Uh oh!

murali-reddy May 19, 2020

Choose a reason for hiding this comment

Uh oh!

murali-reddy May 19, 2020

Choose a reason for hiding this comment

Uh oh!

aauren May 19, 2020

Choose a reason for hiding this comment

Uh oh!

murali-reddy May 19, 2020

Choose a reason for hiding this comment

Uh oh!

aauren May 20, 2020

Choose a reason for hiding this comment

Uh oh!

murali-reddy commented May 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants