Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for service backend entry leak in cilium_lb4_backends_v2 #23749

Closed
wants to merge 1 commit into from

Conversation

hemanthmalla
Copy link
Member

Backends are meant to be reused across services. When service backends are updated, existence of every backend is looked up by it's hash. Currently, the check is made only against the current service's backends but not against global backends.

This can result in duplicate backend entries. Without this commit whether duplicate is created or not depends on a race between releasing old backend ID during deletion and updating ref. count during creation. Currently on agent startup when service restoration process finds a duplicate entry, restoration stops and the rest of the entries are leaked. Over time with enough restarts, we could max out the backend bpf map.

This commit makes the lookup global and entries in backendByHash map are deleted only after data from lbmap is deleted. This should prevent duplicate entries in backend map.

Fixes #23551

Signed-off-by: Hemanth Malla hemanth.malla@datadoghq.com

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 14, 2023
@hemanthmalla
Copy link
Member Author

/test

@hemanthmalla hemanthmalla marked this pull request as ready for review February 15, 2023 13:01
@hemanthmalla hemanthmalla requested a review from a team as a code owner February 15, 2023 13:01
@brb brb self-requested a review February 15, 2023 13:51
@brb
Copy link
Member

brb commented Feb 15, 2023

@hemanthmalla Nice catch!

This can result in duplicate backend entries

I am bit struggling to understand how this can lead to duplicate backend entries. Let's assume that a backend exists within another service, while a given service doesn't have it. The if s.backendRefCount.Add(hash) evaluates to false, so the backend won't be added to newBackend. Do I miss something?

@hemanthmalla
Copy link
Member Author

@brb That's correct. I missed factoring for this behavior in backendRefCount.Add() :(
Back to drawing board again to understand where the leaks are coming from.

@hemanthmalla
Copy link
Member Author

@brb Spent some more time looking into this. The race is between removal of a backend in terminating state and deletion of the entire service.
When a backend is in terminating state, there’s an optimization in place to not send new traffic to this backend. This is implemented by skipping addition of entries in the frontend map while the backend entry is still in place.

UpsertService()
	-> updateBackendsCacheLocked 
	       - > upsertServiceIntoLBMaps ()
	                  -> Check and exclude terminating backends 
	                         -> lbmap.UpsertService() 
	                                    -> updateServiceEndpoint()

// Skip adding the terminating backend to the service map so that it
// won't be selected to serve new requests. However, the backend is still
// kept in the affinity and backend maps so that existing connections
// are able to terminate gracefully. The final clean-up for the backend
// will happen once the agent receives a delete event for the service
// endpoint as it'll be part of obsoleteBackendIDs/obsoleteSVCBackendIDs
// list passed to this function.

We start seeing issues when this service with a terminating backend gets deleted :

delK8sSVCs()
	-> DeleteService()
		-> deleteServiceLocked()
			-> deleteBackendsFromCacheLocked() [decrements ref cnt and release backendID]
				-> lbmap.DeleteService() [ Attempts to del all entries based on #backends ]
				    * Cmd to del the terminating entry fails and the method bails immediately
				    * Obsolete backends entries are left behind
				    * lock released

Now the ref count for the orphan backend’s hash is 0 and the backend ID is released. If this backend is now added to any service , we’ll see a duplicate entry.

UpsertService()
      -> updateBackendsCacheLocked ()
           * if b, found := svc.backendByHash[hash] returns false because it's checked against local service. 
           * if s.backendRefCount.Add(hash) returns true because previously count was reset
	             -> AcquireBackendID(backend.L3n4Addr) returns new ID since lookup on alloc.entities fails
                      -> upsertServiceIntoLBMaps() 
                           -> s.lbmap.AddBackend()

And we have a duplicate backend entry.

@aditighag
Copy link
Member

@hemanthmalla Good catch! You might be onto something here! Thanks for the debugging notes.

There is a difference in the logic around how we treat terminating backends in v1.11 (terminating backends are removed from the BPF maps) v/s v1.12 (terminating backends are moved to the end of a service entry in the BPF maps). What version are you running (based on the linked code above, it looks like v1.11)? Was there cilium upgrade/restart involved?

@hemanthmalla
Copy link
Member Author

I was also able to reproduce this with the following :

  • Create a deployment which traps TERM and does nothing for 90 secs trap 'sleep 90' TERM && sleep infinity
  • Create a cluster IP service
  • Delete one of the pods with enough grace period kubectl delete pod <> --grace-period 60
  • Delete the service.
level=warning msg="Error deleting service by frontend" error="Unable to delete service entry REDACTED_IP:36895: unable to delete element REDACTED_IP:8080 from map cilium_lb4_services_v2: no such file or directory" k8sNamespace=hemanth-test k8sSvcName=leak-repro-service obj="REDACTED_IP:8080" subsys=k8s-watcher

After creation :

root@ip-10-128-102-12:/home/cilium# ./cilium bpf lb list --frontends | grep REDACTED_IP -A4
REDACTED_IP:8080       516 0 (90) [0x0 0x0]
                       513 0 (90) [0x0 0x0]
                       515 0 (90) [0x0 0x0]
                       0 3 (90) [0x0 0x0]

While one of the pods in terminating state (513 goes away)

root@ip-10-128-102-12:/home/cilium# ./cilium bpf lb list --frontends | grep REDACTED_IP -A4
REDACTED_IP:8080       515 0 (90) [0x0 0x0]
                       516 0 (90) [0x0 0x0]
                       518 0 (90) [0x0 0x0]
                       0 3 (90) [0x0 0x0]

All 4 backends leaked after service delete.

root@ip-10-128-102-12:/home/cilium# ./cilium bpf lb list --backends | grep -E '(513|515|516|518)'
518   any://IP1:8080
513   any://IP2:8080
515   any://IP3:8080
516   any://IP4:8080

@hemanthmalla
Copy link
Member Author

@aditighag we're observing this in 1.11. We did upgrade from 1.10, but I don't think this has anything to do with the upgrade. I did notice the skip adding to frontend bits change in 1.12 and master, but I'm yet to take a closer look.

@aditighag
Copy link
Member

aditighag commented Feb 17, 2023

@hemanthmalla Thanks! I have a fix + unit test in progress for v1.11 changes. I'll need to take a closer look for >= v1.12 to see if the problem exists.

@aditighag
Copy link
Member

@hemanthmalla Can you give this fix a shot - #23858?

@hemanthmalla
Copy link
Member Author

@aditighag thank you for the fix and unit test. I think we need the commit from this PR as well to avoid duplicates from other failure reasons in future.
I tested #23858 and this commit against the reproducer. Fix works as expected and entries are cleaned up.

I need to understand why integration tests are failing here though. Do you think any of these could be flakes ?

@aditighag
Copy link
Member

@aditighag thank you for the fix and unit test. I think we need the commit from this PR as well to avoid duplicates from other failure reasons in future. I tested #23858 and this commit against the reproducer.

I'm not sure I follow: As Martynas pointed out above, the existing code (s.backendRefCount.Add(hash)) already guards against having duplicate backend entries, right? Or were you talking about some other changes that are not yet pushed to this PR?

Fix works as expected and entries are cleaned up.

Thanks for testing the fix. The PR explains why we could have duplicate backend entries in some cases.

I need to understand why integration tests are failing here though. Do you think any of these could be flakes ?

The travis failure is related. As for the Jenkins failures, here are the documented steps - https://docs.cilium.io/en/v1.13/contributing/testing/ci/. Anyway you may not need this PR?

@hemanthmalla
Copy link
Member Author

@aditighag in a scenario where a backend belongs only to one service, when the k8s service is deleted and cleanup fails halfway through, s.backendRefCount would be zero ( hash deleted from map). At this stage clean up only went through for ref count and backend ID but nothing else.

delK8sSVCs()
	-> DeleteService()
		-> deleteServiceLocked()
			-> deleteBackendsFromCacheLocked() [decrements ref cnt and release backendID].  <--- Zeroed here
				-> lbmap.DeleteService() [ Attempts to del all entries based on #backends ]
				    * Cmd to del the terminating entry fails and the method bails immediately
				    * Obsolete backends entries are left behind
				    * lock released

During the next UpsertService call for a service that uses this backend, s.backendRefCount.Add(hash) will not guard against duplicate entry right ?

@ghost
Copy link

ghost commented Feb 22, 2023

So actually we have two problem: backend entry leaks caused by service lbmap fails, and failed to reuse the exists backends, am I right?

@aditighag
Copy link
Member

aditighag commented Feb 24, 2023

@aditighag in a scenario where a backend belongs only to one service, when the k8s service is deleted and cleanup fails halfway through, s.backendRefCount would be zero ( hash deleted from map). At this stage clean up only went through for ref count and backend ID but nothing else.

delK8sSVCs()
	-> DeleteService()
		-> deleteServiceLocked()
			-> deleteBackendsFromCacheLocked() [decrements ref cnt and release backendID].  <--- Zeroed here
				-> lbmap.DeleteService() [ Attempts to del all entries based on #backends ]
				    * Cmd to del the terminating entry fails and the method bails immediately
				    * Obsolete backends entries are left behind
				    * lock released

During the next UpsertService call for a service that uses this backend, s.backendRefCount.Add(hash) will not guard against duplicate entry right ?

After deleteBackendsFromCacheLocked returns for a service with terminating backend(s), backends have already been removed from the in-memory state in the Service instance (provided that these backends were only associated with that sole service). However, the agent and datapath are now out of sync, because the backends weren't deleted from BPF maps. So in order to remove duplicate backends from the BPF maps, you would have to scan the the backends map during the restore operation (see restoreBackendsLocked). Have you tried checking for duplicates here - https://github.com/cilium/cilium/blob/master/pkg/service/service.go#L1371-L1371?

The issue with your changes is that if a backend is common across multiple services, then you'll skip adding the backend to subsequent services. And when the first service is deleted, you would end up deleting the backend.

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hemanthmalla
Copy link
Member Author

@aditighag I've been OOO for the last one week. If not sooner, I'll check and get back on the 8th. Sorry for the delay.

@aditighag
Copy link
Member

If not sooner, I'll check and get back on the 8th.

@hemanthmalla Any update on the PR? We can turn it into a draft in the meantime.

@christarazi christarazi added kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Mar 22, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 22, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Mar 22, 2023
@christarazi christarazi added needs-backport/1.11 affects/v1.11 This issue affects v1.11 branch labels Mar 22, 2023
@hemanthmalla
Copy link
Member Author

@aditighag apologize for the delay.

you'll skip adding the backend to subsequent services. And when the first service is deleted, you would end up deleting the backend

Thanks for catching this. What if we moved adding the backend to svc part svc.backendByHash[hash] = backends[i] out of the backendByHash lookup block ?

So in order to remove duplicate backends from the BPF maps..

I intended this PR only to avoid the leak in the first place. If we also plan on handling leaked / duplicate entries, can we do that on a different PR ?

@hemanthmalla hemanthmalla force-pushed the hmalla/svc_backend_leak branch 2 times, most recently from f221f4a to 2286073 Compare March 24, 2023 22:12
@aditighag
Copy link
Member

I intended this PR only to avoid the leak in the first place.

The leak was fixed in this PR. We have a PR to not to skip backend restore for failures, and Jared also backported it. So I'm not sure what this PR is trying to address?
@jaredladvina/ @hemanthmalla : Can you comment on to what extend you are seeing the original bug implications?

@hemanthmalla
Copy link
Member Author

Primarily I wanted to address the issue that if we don't lookup backends in the global in-memory map there might be a possibility of hitting the same bug again in future. Especially because ref count being zero for a backend doesn't mean that the entry was deleted from the bpf map.

@hemanthmalla
Copy link
Member Author

I realize the current logic is flawed as well since the ref. count doesn't get updated. Pushing an update again to fix that.

Follow up for cilium#23858 . During backend deletion, reference count
decrement and backend ID release operations could go through, but
deletion of entries from bpf map might fail. This commit adds
additional defenses to avoid potential leaks or duplicates from
similar scenarios in the future.

Also see cilium#23749 (comment) for more details on past race conditions.

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
if b, found := svc.backendByHash[hash]; !found {
if s.backendRefCount.Add(hash) {
globalBackend, existsGlobally := s.backendByHash[hash]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm... When there is a leak, datapath state goes out of sync with userspace. Will this work when a backend is associated with only one service?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that s.backendByHash would still have the entry because its deleted only after the entry is removed from the lbmap. Since the backend ID might be released I'm attempting to restore if an entry exists in s.backendByHash

cilium/pkg/service/service.go

Lines 1391 to 1399 in 9629343

if s.backendRefCount[hash] == 0 {
log.WithField(logfields.BackendID, b.ID).
Debug("Removing orphan backend")
// The b.ID is unique across IPv4/6, hence attempt
// to clean it from both maps, and ignore errors.
DeleteBackendID(b.ID)
s.lbmap.DeleteBackendByID(b.ID)
delete(s.backendByHash, hash)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree with the motivation to prevent such bugs from happening in future. The fix looks fine based on the current logic, but we are relying on the ordering of release of backend id and deletion from the lbmap. Could you add unit tests?

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! Thanks for the revised fix.

if b, found := svc.backendByHash[hash]; !found {
if s.backendRefCount.Add(hash) {
globalBackend, existsGlobally := s.backendByHash[hash]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree with the motivation to prevent such bugs from happening in future. The fix looks fine based on the current logic, but we are relying on the ordering of release of backend id and deletion from the lbmap. Could you add unit tests?

@brb brb removed their request for review April 25, 2023 07:53
@github-actions
Copy link

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label May 26, 2023
@github-actions
Copy link

github-actions bot commented Jun 9, 2023

This pull request has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.11 This issue affects v1.11 branch kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
5 participants