-
Notifications
You must be signed in to change notification settings - Fork 486
Fix Memory Consumption in network_policy_controller #902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
LGTM. Nice find with the stacking syncs. |
|
thanks @aauren for being persistent and finding the problem. I don't have a large cluster to reporduce the problem and test the patch myself. So I have to ask these question to be sure we are chasing the right problem.
Would like kube-router to be state less unless there is some performance gain caching the processed result. In this case if we have global state of network policies it would avoid re-building it on every update (add/delete/update events of pods, namespaces, network policy objects). Given that in current design its always full-sync I am fine with this change. When in future we change the desgin (if need arise to deal with larger scale) to handle update partially (to that particular pod/namespace/network policy) we might reconsider having global state back.
I am not sure of this. Is there a percieved benefit with just this change (of moving networkPoliciesInfo to variable of sync handler function)? In both the cases it was just a pointer and must have allocated on the heap.
What I am wondering is if there is memory leak or its just memory is being held? Its possible if rate at which number of events are coming exceeds the number of sync() that are completed then I agree that memory is held but its only for API objects. From the goroutine dump you shared offline I did not see any indication that there are huge number of go routines that are waiting for completion. Can you please check agian taking go routine dump to confirm indeed this is the case of back pressure due to network policies sync that is causing the k8s watchers to backup and hold on
In principle i agree we should do heavy processing like sync() asynchronously than as part of handler handling the API watch event. We should do that change. There is neat work queue in clien-go with ratelimiter https://godoc.org/k8s.io/client-go/util/workqueue for processing objects asynchronously (https://github.com/kubernetes/client-go/tree/master/examples/workqueue). But as we are not interested in the object it self perhaps we can less simpler change.
So effectivley we want to coalesce the events to perform single sync() where possible and limit to single thread of execution. I would suggest we use single dedicated go routine to perform sync() which wait on a channel to perform sync. We can use 1-capacity channel with non-blocking sender to the channel. I will try to post a gist if needed. With semaphore its possible we will miss the latest update. For e.g if a thread is already processing the even and doing the sync, and we get 100 updates. All of them return as there is update in progress. But that does not necessarily pick up the latest changes. So we want to atleast ony more sync() to coalesce all the 100 updates. |
Full agreement here. We should keep kube-router stateless where possible, but if there is a large enough gain in maintaining global state we should be willing to adapt in the future. In this case, it's never used from the controller outside that method, so safe to remove some unused global state in this instance.
The change here means that a reference to that pointer is released sooner, so it should allow the GC to cleanup the heap sooner than if it were referenced from the controller. I chose to leave it as a pointer reference to reduce the change, but if you're ok with it, I can change all of the instances to pass by value and then it would only be allocated on the stack and never touch the heap. That's where I actually started, but then reverted to passing by reference because I wasn't sure how big of a change I wanted to introduce here.
Yes, the word "leak" here is not correct. It's just memory being held by the streamwatcher processes. I'm willing to take goroutine dumps, but I can already tell you that there will not be a spike in goroutines as I kept a close eye on the While I'm not familiar with the internals of the Kubernetes apimachinery, from the external signs that I've seen it appears to only start a single goroutine for each watcher here: kube-router/pkg/cmd/kube-router.go Line 96 in 803bd90
Sync() a synchronous action and pause the handler execution for the duration of Sync() you will see no additional pod, namespace, or network policy changes until the Sync() call is complete.
As such I would not expect to see the number of goroutines increase as the handlers became blocked, instead the same 3 informers will become blocked on the handlers and will begin to build up their caches with a bunch of pointers to Kubernetes metadata that it cannot flush and the heap will fill.
I'm not familiar with workqueue, I'll take a look.
This makes sense to me, in the current semaphore implementation there is definitely a chance that we'll lose a work item. I'll rework my solution to try a single item channel instead. |
Take networkPoliciesInfo off of the npc struct and convert it to a stack variable that is easy to cleanup.
Kubernetes informers will block on handler execution and will then begin to accumulate cached Kubernetes object information into the heap. This change moves the full sync logic into it's own goroutine where full syncs are triggered and gated via writing to a single item channel. This ensures that: - Syncs will only happen one at a time (as they are full syncs and we can't process multiple at once) - Sync requests are only ever delayed and never lost as they will be added to the request channel - After we make a sync request we return fast to ensure that the handler execution returns fast and that we don't block the Kubernetes informers
Now that we are better managing requests for full syncs we no longer need to manage readyForUpdates on the npc controller. We already enforce not blocking the handlers and a single sync execution chain, whether it comes from the controller in the form of a periodic sync or whether it comes from a Kubernetes informer, either way the result is a non-blocking, single thread of execution, full sync.
|
@murali-reddy I switched to a channel based approach as recommended and I fixed networkPoliciesInfo to be completely stack based by removing the pass by reference logic. Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good. couple of small nits.
| glog.Info("Starting network policy controller full sync goroutine") | ||
| go func(fullSyncRequest <-chan struct{}, stopCh <-chan struct{}) { | ||
| for { | ||
| select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go's select case statement does not have priorty. So its better to give priority to case <-stopCh else we run into corner case where full sync's can continue to run inspite of stopCh is closed.
| select { | ||
| case <-stopCh: | ||
| glog.Info("Shutting down network policies full sync goroutine") | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might as well add this goroutine to WaitGroup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created an additional WaitGroup here and then wait on it before the WaitGroup passed to Run sends done. Is that ok? Or did you want me to piggyback off the initial WaitGroup that was passed to Run()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aauren I meant to the existing waitgroup that was passed to Run(). So that whole process will exit gracefully (by waiting here https://github.com/cloudnativelabs/kube-router/blob/v1.0.0-rc3/pkg/cmd/kube-router.go#L190) when all the go routines exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use the original waitgroup
|
I manually verified below scenarios
overall LGTM |
) * feat(gitignore): don't track intellij files * fact(network_policy): networkPoliciesInfo -> stack Take networkPoliciesInfo off of the npc struct and convert it to a stack variable that is easy to cleanup. * fix(network_policy): k8s obj memory accumulation Kubernetes informers will block on handler execution and will then begin to accumulate cached Kubernetes object information into the heap. This change moves the full sync logic into it's own goroutine where full syncs are triggered and gated via writing to a single item channel. This ensures that: - Syncs will only happen one at a time (as they are full syncs and we can't process multiple at once) - Sync requests are only ever delayed and never lost as they will be added to the request channel - After we make a sync request we return fast to ensure that the handler execution returns fast and that we don't block the Kubernetes informers * fact(network_policy): rework readyForUpdates Now that we are better managing requests for full syncs we no longer need to manage readyForUpdates on the npc controller. We already enforce not blocking the handlers and a single sync execution chain, whether it comes from the controller in the form of a periodic sync or whether it comes from a Kubernetes informer, either way the result is a non-blocking, single thread of execution, full sync. * fix(network_policy): address PR feedback
FYI @murali-reddy @mrueg @filintod
This is a fix for: #795
This change does two things broken up by commit:
Sync()method to use semaphores and converts all handlers to call the function asynchronously. Pausing the handler flow and waiting for the Sync method to finish is causing the k8s watchers to backup and hold on to pod, network policy, and namespace object metadata until the handlers finish doing a full sync.In large clusters that have a high churn on pods, network policies, and/or namespaces it results in kube-router never being able to catch up to the amount of updates it is receiving and the streamwatcher's cache of k8s objects will grown unbound.
Because every sync is a full sync we need to limit ourselves to a single thread of execution, but utilize a semaphore so that if we try to acquire a lock and fail we can return quickly without blocking anything.
As an example, in our medium sized cluster it takes about 1.5 minutes for kube-router to perform a full iptables sync. However, in this cluster we have multiple small cronjobs that execute every minute. When this happens, kube-router is never able to catch up and slowly the queue containing these pod changes become longer and longer.
When testing kube-router-1.0.0-rc3 without this patch I find that over the course of an hour with 6 1 minutes cronjobs running I see a memory gain of ~50 MB / hour. With this patch memory stays constant with no noticeable increases.
Disclaimer, I spent about an hour trying to get
depto work to update the Gopkg.toml and Gopkg.lock files forgolang.org/x/syncbut wasn't able to get it to work correctly. Multiple packages failed to update while running thedep ensurestep, and it consistently failed to import the package with an error like the following:So to save time I manually vendored the sync package. @murali-reddy let me know if you're able to get dep to work for you and I'll update my PR.
A special thanks to @liggitt for setting me on the right track and giving me some really good hints that allowed us to find the handler contention.