Fix for service backend entry leak in cilium_lb4_backends_v2 #23749

hemanthmalla · 2023-02-14T17:48:55Z

Backends are meant to be reused across services. When service backends are updated, existence of every backend is looked up by it's hash. Currently, the check is made only against the current service's backends but not against global backends.

This can result in duplicate backend entries. Without this commit whether duplicate is created or not depends on a race between releasing old backend ID during deletion and updating ref. count during creation. Currently on agent startup when service restoration process finds a duplicate entry, restoration stops and the rest of the entries are leaked. Over time with enough restarts, we could max out the backend bpf map.

This commit makes the lookup global and entries in backendByHash map are deleted only after data from lbmap is deleted. This should prevent duplicate entries in backend map.

Fixes #23551

Signed-off-by: Hemanth Malla hemanth.malla@datadoghq.com

hemanthmalla · 2023-02-14T17:53:53Z

/test

brb · 2023-02-15T14:52:22Z

@hemanthmalla Nice catch!

This can result in duplicate backend entries

I am bit struggling to understand how this can lead to duplicate backend entries. Let's assume that a backend exists within another service, while a given service doesn't have it. The if s.backendRefCount.Add(hash) evaluates to false, so the backend won't be added to newBackend. Do I miss something?

hemanthmalla · 2023-02-15T15:30:23Z

@brb That's correct. I missed factoring for this behavior in backendRefCount.Add() :(
Back to drawing board again to understand where the leaks are coming from.

hemanthmalla · 2023-02-17T18:37:17Z

@brb Spent some more time looking into this. The race is between removal of a backend in terminating state and deletion of the entire service.
When a backend is in terminating state, there’s an optimization in place to not send new traffic to this backend. This is implemented by skipping addition of entries in the frontend map while the backend entry is still in place.

UpsertService()
	-> updateBackendsCacheLocked 
	       - > upsertServiceIntoLBMaps ()
	                  -> Check and exclude terminating backends 
	                         -> lbmap.UpsertService() 
	                                    -> updateServiceEndpoint()

cilium/pkg/service/service.go

Lines 801 to 807 in 85615c1

    
           // Skip adding the terminating backend to the service map so that it 
        
           // won't be selected to serve new requests. However, the backend is still 
        
           // kept in the affinity and backend maps so that existing connections 
        
           // are able to terminate gracefully. The final clean-up for the backend 
        
           // will happen once the agent receives a delete event for the service 
        
           // endpoint as it'll be part of obsoleteBackendIDs/obsoleteSVCBackendIDs 
        
           // list passed to this function.

We start seeing issues when this service with a terminating backend gets deleted :

delK8sSVCs()
	-> DeleteService()
		-> deleteServiceLocked()
			-> deleteBackendsFromCacheLocked() [decrements ref cnt and release backendID]
				-> lbmap.DeleteService() [ Attempts to del all entries based on #backends ]
				    * Cmd to del the terminating entry fails and the method bails immediately
				    * Obsolete backends entries are left behind
				    * lock released

Now the ref count for the orphan backend’s hash is 0 and the backend ID is released. If this backend is now added to any service , we’ll see a duplicate entry.

UpsertService()
      -> updateBackendsCacheLocked ()
           * if b, found := svc.backendByHash[hash] returns false because it's checked against local service. 
           * if s.backendRefCount.Add(hash) returns true because previously count was reset
	             -> AcquireBackendID(backend.L3n4Addr) returns new ID since lookup on alloc.entities fails
                      -> upsertServiceIntoLBMaps() 
                           -> s.lbmap.AddBackend()

And we have a duplicate backend entry.

aditighag · 2023-02-17T19:14:28Z

@hemanthmalla Good catch! You might be onto something here! Thanks for the debugging notes.

There is a difference in the logic around how we treat terminating backends in v1.11 (terminating backends are removed from the BPF maps) v/s v1.12 (terminating backends are moved to the end of a service entry in the BPF maps). What version are you running (based on the linked code above, it looks like v1.11)? Was there cilium upgrade/restart involved?

hemanthmalla · 2023-02-17T19:22:27Z

I was also able to reproduce this with the following :

Create a deployment which traps TERM and does nothing for 90 secs trap 'sleep 90' TERM && sleep infinity
Create a cluster IP service
Delete one of the pods with enough grace period kubectl delete pod <> --grace-period 60
Delete the service.

level=warning msg="Error deleting service by frontend" error="Unable to delete service entry REDACTED_IP:36895: unable to delete element REDACTED_IP:8080 from map cilium_lb4_services_v2: no such file or directory" k8sNamespace=hemanth-test k8sSvcName=leak-repro-service obj="REDACTED_IP:8080" subsys=k8s-watcher

After creation :

root@ip-10-128-102-12:/home/cilium# ./cilium bpf lb list --frontends | grep REDACTED_IP -A4
REDACTED_IP:8080       516 0 (90) [0x0 0x0]
                       513 0 (90) [0x0 0x0]
                       515 0 (90) [0x0 0x0]
                       0 3 (90) [0x0 0x0]

While one of the pods in terminating state (513 goes away)

root@ip-10-128-102-12:/home/cilium# ./cilium bpf lb list --frontends | grep REDACTED_IP -A4
REDACTED_IP:8080       515 0 (90) [0x0 0x0]
                       516 0 (90) [0x0 0x0]
                       518 0 (90) [0x0 0x0]
                       0 3 (90) [0x0 0x0]

All 4 backends leaked after service delete.

root@ip-10-128-102-12:/home/cilium# ./cilium bpf lb list --backends | grep -E '(513|515|516|518)'
518   any://IP1:8080
513   any://IP2:8080
515   any://IP3:8080
516   any://IP4:8080

hemanthmalla · 2023-02-17T19:28:56Z

@aditighag we're observing this in 1.11. We did upgrade from 1.10, but I don't think this has anything to do with the upgrade. I did notice the skip adding to frontend bits change in 1.12 and master, but I'm yet to take a closer look.

aditighag · 2023-02-17T19:58:07Z

@hemanthmalla Thanks! I have a fix + unit test in progress for v1.11 changes. I'll need to take a closer look for >= v1.12 to see if the problem exists.

aditighag · 2023-02-20T15:06:19Z

@hemanthmalla Can you give this fix a shot - #23858?

hemanthmalla · 2023-02-21T13:17:26Z

@aditighag thank you for the fix and unit test. I think we need the commit from this PR as well to avoid duplicates from other failure reasons in future.
I tested #23858 and this commit against the reproducer. Fix works as expected and entries are cleaned up.

I need to understand why integration tests are failing here though. Do you think any of these could be flakes ?

aditighag · 2023-02-21T14:38:53Z

@aditighag thank you for the fix and unit test. I think we need the commit from this PR as well to avoid duplicates from other failure reasons in future. I tested #23858 and this commit against the reproducer.

I'm not sure I follow: As Martynas pointed out above, the existing code (s.backendRefCount.Add(hash)) already guards against having duplicate backend entries, right? Or were you talking about some other changes that are not yet pushed to this PR?

Fix works as expected and entries are cleaned up.

Thanks for testing the fix. The PR explains why we could have duplicate backend entries in some cases.

I need to understand why integration tests are failing here though. Do you think any of these could be flakes ?

The travis failure is related. As for the Jenkins failures, here are the documented steps - https://docs.cilium.io/en/v1.13/contributing/testing/ci/. Anyway you may not need this PR?

hemanthmalla · 2023-02-21T16:17:17Z

@aditighag in a scenario where a backend belongs only to one service, when the k8s service is deleted and cleanup fails halfway through, s.backendRefCount would be zero ( hash deleted from map). At this stage clean up only went through for ref count and backend ID but nothing else.

delK8sSVCs()
	-> DeleteService()
		-> deleteServiceLocked()
			-> deleteBackendsFromCacheLocked() [decrements ref cnt and release backendID].  <--- Zeroed here
				-> lbmap.DeleteService() [ Attempts to del all entries based on #backends ]
				    * Cmd to del the terminating entry fails and the method bails immediately
				    * Obsolete backends entries are left behind
				    * lock released

During the next UpsertService call for a service that uses this backend, s.backendRefCount.Add(hash) will not guard against duplicate entry right ?

ghost · 2023-02-22T07:41:53Z

So actually we have two problem: backend entry leaks caused by service lbmap fails, and failed to reuse the exists backends, am I right?

aditighag · 2023-02-24T01:36:39Z

@aditighag in a scenario where a backend belongs only to one service, when the k8s service is deleted and cleanup fails halfway through, s.backendRefCount would be zero ( hash deleted from map). At this stage clean up only went through for ref count and backend ID but nothing else.
delK8sSVCs()
	-> DeleteService()
		-> deleteServiceLocked()
			-> deleteBackendsFromCacheLocked() [decrements ref cnt and release backendID].  <--- Zeroed here
				-> lbmap.DeleteService() [ Attempts to del all entries based on #backends ]
				    * Cmd to del the terminating entry fails and the method bails immediately
				    * Obsolete backends entries are left behind
				    * lock released
During the next UpsertService call for a service that uses this backend, s.backendRefCount.Add(hash) will not guard against duplicate entry right ?

After deleteBackendsFromCacheLocked returns for a service with terminating backend(s), backends have already been removed from the in-memory state in the Service instance (provided that these backends were only associated with that sole service). However, the agent and datapath are now out of sync, because the backends weren't deleted from BPF maps. So in order to remove duplicate backends from the BPF maps, you would have to scan the the backends map during the restore operation (see restoreBackendsLocked). Have you tried checking for duplicates here - https://github.com/cilium/cilium/blob/master/pkg/service/service.go#L1371-L1371?

The issue with your changes is that if a backend is common across multiple services, then you'll skip adding the backend to subsequent services. And when the first service is deleted, you would end up deleting the backend.

aditighag

#23749 (comment)

hemanthmalla · 2023-03-02T13:38:45Z

@aditighag I've been OOO for the last one week. If not sooner, I'll check and get back on the 8th. Sorry for the delay.

aditighag · 2023-03-20T15:58:13Z

If not sooner, I'll check and get back on the 8th.

@hemanthmalla Any update on the PR? We can turn it into a draft in the meantime.

hemanthmalla · 2023-03-24T18:55:00Z

@aditighag apologize for the delay.

you'll skip adding the backend to subsequent services. And when the first service is deleted, you would end up deleting the backend

Thanks for catching this. What if we moved adding the backend to svc part svc.backendByHash[hash] = backends[i] out of the backendByHash lookup block ?

So in order to remove duplicate backends from the BPF maps..

I intended this PR only to avoid the leak in the first place. If we also plan on handling leaked / duplicate entries, can we do that on a different PR ?

aditighag · 2023-03-29T15:31:04Z

I intended this PR only to avoid the leak in the first place.

The leak was fixed in this PR. We have a PR to not to skip backend restore for failures, and Jared also backported it. So I'm not sure what this PR is trying to address?
@jaredladvina/ @hemanthmalla : Can you comment on to what extend you are seeing the original bug implications?

hemanthmalla · 2023-03-29T16:15:16Z

Primarily I wanted to address the issue that if we don't lookup backends in the global in-memory map there might be a possibility of hitting the same bug again in future. Especially because ref count being zero for a backend doesn't mean that the entry was deleted from the bpf map.

hemanthmalla · 2023-03-29T18:18:30Z

I realize the current logic is flawed as well since the ref. count doesn't get updated. Pushing an update again to fix that.

Follow up for cilium#23858 . During backend deletion, reference count decrement and backend ID release operations could go through, but deletion of entries from bpf map might fail. This commit adds additional defenses to avoid potential leaks or duplicates from similar scenarios in the future. Also see cilium#23749 (comment) for more details on past race conditions. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

aditighag · 2023-04-01T01:04:44Z

pkg/service/service.go

 		if b, found := svc.backendByHash[hash]; !found {
-			if s.backendRefCount.Add(hash) {
+			globalBackend, existsGlobally := s.backendByHash[hash]


Umm... When there is a leak, datapath state goes out of sync with userspace. Will this work when a backend is associated with only one service?

My understanding was that s.backendByHash would still have the entry because its deleted only after the entry is removed from the lbmap. Since the backend ID might be released I'm attempting to restore if an entry exists in s.backendByHash

cilium/pkg/service/service.go

Lines 1391 to 1399 in 9629343

if s.backendRefCount[hash] == 0 {

log.WithField(logfields.BackendID, b.ID).

Debug("Removing orphan backend")

// The b.ID is unique across IPv4/6, hence attempt

// to clean it from both maps, and ignore errors.

DeleteBackendID(b.ID)

s.lbmap.DeleteBackendByID(b.ID)

delete(s.backendByHash, hash)

}

I completely agree with the motivation to prevent such bugs from happening in future. The fix looks fine based on the current logic, but we are relying on the ordering of release of backend id and deletion from the lbmap. Could you add unit tests?

aditighag

Sorry for the delay! Thanks for the revised fix.

aditighag · 2023-04-17T22:07:39Z

pkg/service/service.go

 		if b, found := svc.backendByHash[hash]; !found {
-			if s.backendRefCount.Add(hash) {
+			globalBackend, existsGlobally := s.backendByHash[hash]


I completely agree with the motivation to prevent such bugs from happening in future. The fix looks fine based on the current logic, but we are relying on the ordering of release of backend id and deletion from the lbmap. Could you add unit tests?

github-actions · 2023-05-26T01:53:25Z

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

github-actions · 2023-06-09T02:05:49Z

This pull request has not seen any activity since it was marked stale.
Closing.

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 14, 2023

hemanthmalla marked this pull request as ready for review February 15, 2023 13:01

hemanthmalla requested a review from a team as a code owner February 15, 2023 13:01

hemanthmalla requested a review from aspsk February 15, 2023 13:01

aspsk approved these changes Feb 15, 2023

View reviewed changes

brb self-requested a review February 15, 2023 13:51

aditighag mentioned this pull request Feb 17, 2023

pkg/service: Fix leaking service entries with terminating backends #23858

Merged

hemanthmalla mentioned this pull request Feb 21, 2023

[1.11] Fix for service backend entry leak DataDog/cilium#495

Merged

aditighag requested changes Feb 24, 2023

View reviewed changes

aditighag mentioned this pull request Feb 24, 2023

v1.11 backport: In service recovery, don't skip if one of the service recovery fails #23922

Merged

7 tasks

christarazi added kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Mar 22, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 22, 2023

maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Mar 22, 2023

christarazi added needs-backport/1.11 affects/v1.11 This issue affects v1.11 branch labels Mar 22, 2023

hemanthmalla force-pushed the hmalla/svc_backend_leak branch 2 times, most recently from f221f4a to 2286073 Compare March 24, 2023 22:12

hemanthmalla force-pushed the hmalla/svc_backend_leak branch from 2286073 to d1f6da3 Compare March 29, 2023 20:04

aditighag reviewed Apr 1, 2023

View reviewed changes

aditighag reviewed Apr 17, 2023

View reviewed changes

brb removed their request for review April 25, 2023 07:53

github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label May 26, 2023

github-actions bot closed this Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for service backend entry leak in cilium_lb4_backends_v2 #23749

Fix for service backend entry leak in cilium_lb4_backends_v2 #23749

hemanthmalla commented Feb 14, 2023

hemanthmalla commented Feb 14, 2023

brb commented Feb 15, 2023

hemanthmalla commented Feb 15, 2023

hemanthmalla commented Feb 17, 2023

aditighag commented Feb 17, 2023

hemanthmalla commented Feb 17, 2023

hemanthmalla commented Feb 17, 2023

aditighag commented Feb 17, 2023 •

edited

Loading

aditighag commented Feb 20, 2023

hemanthmalla commented Feb 21, 2023

aditighag commented Feb 21, 2023

hemanthmalla commented Feb 21, 2023

ghost commented Feb 22, 2023

aditighag commented Feb 24, 2023 •

edited

Loading

aditighag left a comment

hemanthmalla commented Mar 2, 2023

aditighag commented Mar 20, 2023

hemanthmalla commented Mar 24, 2023

aditighag commented Mar 29, 2023

hemanthmalla commented Mar 29, 2023

hemanthmalla commented Mar 29, 2023

aditighag Apr 1, 2023

hemanthmalla Apr 4, 2023

aditighag Apr 17, 2023

aditighag left a comment

aditighag Apr 17, 2023

github-actions bot commented May 26, 2023

github-actions bot commented Jun 9, 2023

	if s.backendRefCount[hash] == 0 {
	log.WithField(logfields.BackendID, b.ID).
	Debug("Removing orphan backend")
	// The b.ID is unique across IPv4/6, hence attempt
	// to clean it from both maps, and ignore errors.
	DeleteBackendID(b.ID)
	s.lbmap.DeleteBackendByID(b.ID)
	delete(s.backendByHash, hash)
	}

Fix for service backend entry leak in cilium_lb4_backends_v2 #23749

Fix for service backend entry leak in cilium_lb4_backends_v2 #23749

Conversation

hemanthmalla commented Feb 14, 2023

hemanthmalla commented Feb 14, 2023

brb commented Feb 15, 2023

hemanthmalla commented Feb 15, 2023

hemanthmalla commented Feb 17, 2023

aditighag commented Feb 17, 2023

hemanthmalla commented Feb 17, 2023

hemanthmalla commented Feb 17, 2023

aditighag commented Feb 17, 2023 • edited Loading

aditighag commented Feb 20, 2023

hemanthmalla commented Feb 21, 2023

aditighag commented Feb 21, 2023

hemanthmalla commented Feb 21, 2023

ghost commented Feb 22, 2023

aditighag commented Feb 24, 2023 • edited Loading

aditighag left a comment

Choose a reason for hiding this comment

hemanthmalla commented Mar 2, 2023

aditighag commented Mar 20, 2023

hemanthmalla commented Mar 24, 2023

aditighag commented Mar 29, 2023

hemanthmalla commented Mar 29, 2023

hemanthmalla commented Mar 29, 2023

aditighag Apr 1, 2023

Choose a reason for hiding this comment

hemanthmalla Apr 4, 2023

Choose a reason for hiding this comment

aditighag Apr 17, 2023

Choose a reason for hiding this comment

aditighag left a comment

Choose a reason for hiding this comment

aditighag Apr 17, 2023

Choose a reason for hiding this comment

github-actions bot commented May 26, 2023

github-actions bot commented Jun 9, 2023

aditighag commented Feb 17, 2023 •

edited

Loading

aditighag commented Feb 24, 2023 •

edited

Loading