Skip to content

Conversation

@aauren
Copy link
Collaborator

@aauren aauren commented May 17, 2020

FYI @murali-reddy @mrueg @filintod

This is a fix for: #795

This change does two things broken up by commit:

  1. Converts networkPoliciesInfo to be a function variable rather than a variable on the npc struct. Through extensive testing I've found that this gives a relatively small memory performance gain, but still worth doing IMO since this variable isn't referenced outside of the Sync function and its dependency functions. Converting the variable allows the GC to free up cached data sooner.
  2. Converts the Sync() method to use semaphores and converts all handlers to call the function asynchronously. Pausing the handler flow and waiting for the Sync method to finish is causing the k8s watchers to backup and hold on to pod, network policy, and namespace object metadata until the handlers finish doing a full sync.
    In large clusters that have a high churn on pods, network policies, and/or namespaces it results in kube-router never being able to catch up to the amount of updates it is receiving and the streamwatcher's cache of k8s objects will grown unbound.
    Because every sync is a full sync we need to limit ourselves to a single thread of execution, but utilize a semaphore so that if we try to acquire a lock and fail we can return quickly without blocking anything.

As an example, in our medium sized cluster it takes about 1.5 minutes for kube-router to perform a full iptables sync. However, in this cluster we have multiple small cronjobs that execute every minute. When this happens, kube-router is never able to catch up and slowly the queue containing these pod changes become longer and longer.

When testing kube-router-1.0.0-rc3 without this patch I find that over the course of an hour with 6 1 minutes cronjobs running I see a memory gain of ~50 MB / hour. With this patch memory stays constant with no noticeable increases.

Disclaimer, I spent about an hour trying to get dep to work to update the Gopkg.toml and Gopkg.lock files for golang.org/x/sync but wasn't able to get it to work correctly. Multiple packages failed to update while running the dep ensure step, and it consistently failed to import the package with an error like the following:

Solving failure: No versions of golang.org/x/sync met constraints:
	master: Could not introduce golang.org/x/sync@master, as its subpackage golang.org/x/sync does not contain usable Go code (*build.NoGoError).. (Package is required by (root)

So to save time I manually vendored the sync package. @murali-reddy let me know if you're able to get dep to work for you and I'll update my PR.

A special thanks to @liggitt for setting me on the right track and giving me some really good hints that allowed us to find the handler contention.

@mrueg mrueg requested a review from murali-reddy May 17, 2020 14:46
@filintod
Copy link
Contributor

LGTM. Nice find with the stacking syncs.

@murali-reddy
Copy link
Member

thanks @aauren for being persistent and finding the problem.

I don't have a large cluster to reporduce the problem and test the patch myself. So I have to ask these question to be sure we are chasing the right problem.

Converts networkPoliciesInfo to be a function variable rather than a variable on the npc struct. Through extensive testing I've found that this gives a relatively small memory performance gain, but still worth doing IMO since this variable isn't referenced outside of the Sync function and its dependency functions.

Would like kube-router to be state less unless there is some performance gain caching the processed result. In this case if we have global state of network policies it would avoid re-building it on every update (add/delete/update events of pods, namespaces, network policy objects). Given that in current design its always full-sync I am fine with this change. When in future we change the desgin (if need arise to deal with larger scale) to handle update partially (to that particular pod/namespace/network policy) we might reconsider having global state back.

Converting the variable allows the GC to free up cached data sooner.

I am not sure of this. Is there a percieved benefit with just this change (of moving networkPoliciesInfo to variable of sync handler function)? In both the cases it was just a pointer and must have allocated on the heap.

Pausing the handler flow and waiting for the Sync method to finish is causing the k8s watchers to backup and hold on to pod, network policy, and namespace object metadata until the handlers finish doing a full sync.

What I am wondering is if there is memory leak or its just memory is being held? Its possible if rate at which number of events are coming exceeds the number of sync() that are completed then I agree that memory is held but its only for API objects. From the goroutine dump you shared offline I did not see any indication that there are huge number of go routines that are waiting for completion. Can you please check agian taking go routine dump to confirm indeed this is the case of back pressure due to network policies sync that is causing the k8s watchers to backup and hold on

Converts the Sync() method to use semaphores and converts all handlers to call the function asynchronously.

In principle i agree we should do heavy processing like sync() asynchronously than as part of handler handling the API watch event. We should do that change. There is neat work queue in clien-go with ratelimiter https://godoc.org/k8s.io/client-go/util/workqueue for processing objects asynchronously (https://github.com/kubernetes/client-go/tree/master/examples/workqueue). But as we are not interested in the object it self perhaps we can less simpler change.

In large clusters that have a high churn on pods, network policies, and/or namespaces it results in kube-router never being able to catch up to the amount of updates it is receiving and the streamwatcher's cache of k8s objects will grown unbound.
Because every sync is a full sync we need to limit ourselves to a single thread of execution, but utilize a semaphore so that if we try to acquire a lock and fail we can return quickly without blocking anything.

So effectivley we want to coalesce the events to perform single sync() where possible and limit to single thread of execution. I would suggest we use single dedicated go routine to perform sync() which wait on a channel to perform sync. We can use 1-capacity channel with non-blocking sender to the channel. I will try to post a gist if needed.

With semaphore its possible we will miss the latest update. For e.g if a thread is already processing the even and doing the sync, and we get 100 updates. All of them return as there is update in progress. But that does not necessarily pick up the latest changes. So we want to atleast ony more sync() to coalesce all the 100 updates.

@aauren
Copy link
Collaborator Author

aauren commented May 18, 2020

Would like kube-router to be state less unless there is some performance gain caching the processed result. In this case if we have global state of network policies it would avoid re-building it on every update (add/delete/update events of pods, namespaces, network policy objects). Given that in current design its always full-sync I am fine with this change. When in future we change the desgin (if need arise to deal with larger scale) to handle update partially (to that particular pod/namespace/network policy) we might reconsider having global state back.

Full agreement here. We should keep kube-router stateless where possible, but if there is a large enough gain in maintaining global state we should be willing to adapt in the future. In this case, it's never used from the controller outside that method, so safe to remove some unused global state in this instance.

I am not sure of this. Is there a percieved benefit with just this change (of moving networkPoliciesInfo to variable of sync handler function)? In both the cases it was just a pointer and must have allocated on the heap.

The change here means that a reference to that pointer is released sooner, so it should allow the GC to cleanup the heap sooner than if it were referenced from the controller. I chose to leave it as a pointer reference to reduce the change, but if you're ok with it, I can change all of the instances to pass by value and then it would only be allocated on the stack and never touch the heap. That's where I actually started, but then reverted to passing by reference because I wasn't sure how big of a change I wanted to introduce here.

What I am wondering is if there is memory leak or its just memory is being held? Its possible if rate at which number of events are coming exceeds the number of sync() that are completed then I agree that memory is held but its only for API objects. From the goroutine dump you shared offline I did not see any indication that there are huge number of go routines that are waiting for completion. Can you please check agian taking go routine dump to confirm indeed this is the case of back pressure due to network policies sync that is causing the k8s watchers to backup and hold on

Yes, the word "leak" here is not correct. It's just memory being held by the streamwatcher processes. I'm willing to take goroutine dumps, but I can already tell you that there will not be a spike in goroutines as I kept a close eye on the go_goroutines metrics from kube-router and I never saw a spike no matter how much memory climbed.

While I'm not familiar with the internals of the Kubernetes apimachinery, from the external signs that I've seen it appears to only start a single goroutine for each watcher here:

informerFactory := informers.NewSharedInformerFactory(kr.Client, 0)
it then calls the handler within that single goroutine. This is consistent with what I saw with the core dumps that I took and also what I saw while monitoring the kube-router logs that were emitted from the network_policy_controller. If we make Sync() a synchronous action and pause the handler execution for the duration of Sync() you will see no additional pod, namespace, or network policy changes until the Sync() call is complete.

As such I would not expect to see the number of goroutines increase as the handlers became blocked, instead the same 3 informers will become blocked on the handlers and will begin to build up their caches with a bunch of pointers to Kubernetes metadata that it cannot flush and the heap will fill.

In principle i agree we should do heavy processing like sync() asynchronously than as part of handler handling the API watch event. We should do that change. There is neat work queue in clien-go with ratelimiter https://godoc.org/k8s.io/client-go/util/workqueue for processing objects asynchronously (https://github.com/kubernetes/client-go/tree/master/examples/workqueue). But as we are not interested in the object it self perhaps we can less simpler change.

I'm not familiar with workqueue, I'll take a look.

So effectivley we want to coalesce the events to perform single sync() where possible and limit to single thread of execution. I would suggest we use single dedicated go routine to perform sync() which wait on a channel to perform sync. We can use 1-capacity channel with non-blocking sender to the channel. I will try to post a gist if needed.

With semaphore its possible we will miss the latest update. For e.g if a thread is already processing the even and doing the sync, and we get 100 updates. All of them return as there is update in progress. But that does not necessarily pick up the latest changes. So we want to atleast ony more sync() to coalesce all the 100 updates.

This makes sense to me, in the current semaphore implementation there is definitely a chance that we'll lose a work item. I'll rework my solution to try a single item channel instead.

aauren added 4 commits May 18, 2020 16:52
Take networkPoliciesInfo off of the npc struct and convert it to a stack
variable that is easy to cleanup.
Kubernetes informers will block on handler execution and will then begin
to accumulate cached Kubernetes object information into the heap. This
change moves the full sync logic into it's own goroutine where full
syncs are triggered and gated via writing to a single item channel.

This ensures that:
- Syncs will only happen one at a time (as they are full syncs and we
  can't process multiple at once)
- Sync requests are only ever delayed and never lost as they will be
  added to the request channel
- After we make a sync request we return fast to ensure that the handler
  execution returns fast and that we don't block the Kubernetes
  informers
Now that we are better managing requests for full syncs we no longer
need to manage readyForUpdates on the npc controller. We already enforce
not blocking the handlers and a single sync execution chain, whether it
comes from the controller in the form of a periodic sync or whether it
comes from a Kubernetes informer, either way the result is a
non-blocking, single thread of execution, full sync.
@aauren aauren force-pushed the fix_memory_leak branch from ccbf03e to 5763e29 Compare May 18, 2020 22:33
@aauren
Copy link
Collaborator Author

aauren commented May 18, 2020

@murali-reddy I switched to a channel based approach as recommended and I fixed networkPoliciesInfo to be completely stack based by removing the pass by reference logic.

Let me know what you think.

Copy link
Member

@murali-reddy murali-reddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good. couple of small nits.

glog.Info("Starting network policy controller full sync goroutine")
go func(fullSyncRequest <-chan struct{}, stopCh <-chan struct{}) {
for {
select {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go's select case statement does not have priorty. So its better to give priority to case <-stopCh else we run into corner case where full sync's can continue to run inspite of stopCh is closed.

select {
case <-stopCh:
glog.Info("Shutting down network policies full sync goroutine")
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might as well add this goroutine to WaitGroup

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an additional WaitGroup here and then wait on it before the WaitGroup passed to Run sends done. Is that ok? Or did you want me to piggyback off the initial WaitGroup that was passed to Run()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aauren I meant to the existing waitgroup that was passed to Run(). So that whole process will exit gracefully (by waiting here https://github.com/cloudnativelabs/kube-router/blob/v1.0.0-rc3/pkg/cmd/kube-router.go#L190) when all the go routines exits.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use the original waitgroup

@aauren aauren force-pushed the fix_memory_leak branch from 453cf45 to 42baaff Compare May 20, 2020 02:47
@murali-reddy
Copy link
Member

I manually verified below scenarios

  • on pod/namespace/netwrok policy add/delete/update events results in full sync on network policies
  • periodic full sync are run as per scheduled interval
  • on bulk events verfied that full sync requests are colesced to a single run of full sync.

overall LGTM

@murali-reddy murali-reddy merged commit 837554b into cloudnativelabs:master May 20, 2020
@aauren aauren mentioned this pull request May 20, 2020
FabienZouaoui pushed a commit to SirDataFR/kube-router that referenced this pull request Jun 22, 2020
)

* feat(gitignore): don't track intellij files

* fact(network_policy): networkPoliciesInfo -> stack

Take networkPoliciesInfo off of the npc struct and convert it to a stack
variable that is easy to cleanup.

* fix(network_policy): k8s obj memory accumulation

Kubernetes informers will block on handler execution and will then begin
to accumulate cached Kubernetes object information into the heap. This
change moves the full sync logic into it's own goroutine where full
syncs are triggered and gated via writing to a single item channel.

This ensures that:
- Syncs will only happen one at a time (as they are full syncs and we
  can't process multiple at once)
- Sync requests are only ever delayed and never lost as they will be
  added to the request channel
- After we make a sync request we return fast to ensure that the handler
  execution returns fast and that we don't block the Kubernetes
  informers

* fact(network_policy): rework readyForUpdates

Now that we are better managing requests for full syncs we no longer
need to manage readyForUpdates on the npc controller. We already enforce
not blocking the handlers and a single sync execution chain, whether it
comes from the controller in the form of a periodic sync or whether it
comes from a Kubernetes informer, either way the result is a
non-blocking, single thread of execution, full sync.

* fix(network_policy): address PR feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants