Fix for excess IP release race condition | handshake between agent and operator #17939

hemanthmalla · 2021-11-19T17:47:16Z

Forward port of DataDog#24

Currently there's a 15 sec delay in updating ciliumnode (CN) CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues.

This PR introduces a handshake between the operator and agent before releasing an excess IP to avoid the race condition. A new operator flag excess-ip-release-delay is added to control how long operator should wait before considering an IP for release (defaults to 180 secs). This is done to better handle IP reuse during rollouts. Operator and agent use a new map in cilium node status .status.ipam.release-ips to exchange information during the handshake.

Fixes: #13412

The following alternative options were considered :

We could force a CN write to go through before allocation but it can result in too many writes from the agent.
The operator picks the ENI with the most free IPs to release excess IPs from, this is done so that the ENI could potentially be released in the future. We could move the excess detection logic to the agent, but this involves calling into cloud provider specific implementation, which AFAIU the agent IPAM implementation does not.

Fix for excess IP release race condition. New operator flag excess-ip-release-delay is introduced to control waiting period before marking an IP for release.

Note : DataDog#24 already received some valuable feedback from @aanm @gandro and @christarazi

gandro

Awesome, thanks a ton for addressing this long-standing issue.

I think we need to ensure that allocateNext doesn't hand out IPs which are marked for release too (see inline comment), the rest are smaller nits.

pkg/ipam/crd.go

pkg/ipam/types/types.go

joestringer · 2021-11-24T00:12:19Z

/test

Job 'Cilium-PR-K8s-1.22-kernel-4.9' has 1 failure but they might be new flake since it also hit 1 known flake: #17919 (93.81)

hemanthmalla · 2021-11-24T05:42:28Z

There's one more corner case that needs addressing.

After the handshake is complete and the operator is done releasing an IP, CiliumNode status (release-ips) and spec (pool) are updated in two consecutive requests. There's a tiny window between the two updates where the entry is removed from .status.ipam.release-ips but the IP is till present in spec.ipam.pool. It might be possible that the IP can be allocated between these requests.

@gandro / @aanm Can i flip the order of update only when there are changes to IP release status ? Comments in syncToAPIServer() say that the status should always be updated first though. I don't fully understand why.

cilium/pkg/ipam/node.go

Lines 700 to 701 in eeb7f1b

    
           // Always update the status first to ensure that the IPAM information 
        
           // is synced for all addresses that are marked as available.

gandro · 2021-11-24T10:03:32Z

There's one more corner case that needs addressing.

After the handshake is complete and the operator is done releasing an IP, CiliumNode status (release-ips) and spec (pool) are updated in two consecutive requests. There's a tiny window between the two updates where the entry is removed from .status.ipam.release-ips but the IP is till present in spec.ipam.pool. It might be possible that the IP can be allocated between these requests.

Oh wow, good catch! In dynamic cluster pool, we avoid this by having the agent remove the entry from status.ipam.release-ips once it is no longer present in spec.pool, which might also be an option here.

@gandro / @aanm Can i flip the order of update only when there are changes to IP release status ? Comments in syncToAPIServer() say that the status should always be updated first though. I don't fully understand why.

While I'm also not sure, I believe the order might have to do with the fact that that the agent needs to know the ENI metadata for each IP present in spec. If the agent sees an IP in the pool before it knows about its ENI metadata from status (e.g. subnet CIDR and ENI MAC), this error will be hit:

cilium/pkg/ipam/crd.go

Line 528 in 124fa2b

return nil, fmt.Errorf("unable to find ENI %s", ipInfo.Resource)

gandro · 2021-11-24T18:46:24Z

The two CI failures (besides checkpatch) look like unrelated flaky tests:

Smoke Test with IPv6 has a failing host-to-b-multi-node-clusterip pod, which looks like a variant of CI: Smoke test: Timed out waiting for the condition on pods #12279 https://github.com/cilium/cilium/runs/4305950192?check_suite_focus=true
test-1.22-4.9 hit CI: K8sServicesTest Checks ClusterIP Connectivity Checks service on same node #17919 and CI: K8sServicesTest Checks graceful termination of service endpoints Checks client terminates gracefully on service endpoint deletion #17881 https://jenkins.cilium.io/job/Cilium-PR-K8s-1.22-kernel-4.9/227/

christarazi

Thanks for forward porting! The changes LGTM overall. I am marking my review as approved, but there are suggestions that I'd like to see followed up on in a later PR, so that we can get this PR in. Feel free to address my comments as you see fit in this PR however. After this is merged, we can create a tracking issue and put all the comments that need following up on in there. Please use "resolve conversion" for comments that you've already addressed in this PR.

Note, for any suggestions related to commits, could you address those in this PR as those can't be changed in a followup?

operator/provider_aws_flags.go

christarazi · 2021-11-29T18:35:21Z

pkg/ipam/crd.go

+	releaseUpstreamSyncNeeded := false
+	// ACK or NACK IPs marked for release by the operator
+	for ip, status := range n.ownNode.Status.IPAM.ReleaseIPs {
+		if status != ipamOption.IPAMMarkForRelease || n.ownNode.Spec.IPAM.Pool == nil {
+			continue
+		}
+		// NACK the IP, if this node doesn't own the IP
+		if _, ok := n.ownNode.Spec.IPAM.Pool[ip]; !ok {
+			n.ownNode.Status.IPAM.ReleaseIPs[ip] = ipamOption.IPAMDoNotRelease
+			continue
+		}
+		// Retrieve the appropriate allocator
+		var allocator *crdAllocator
+		var ipFamily Family
+		if ipAddr := net.ParseIP(ip); ipAddr != nil {
+			if ipAddr.To4() != nil {
+				ipFamily = IPv4
+			} else {
+				ipFamily = IPv6
+			}
+		}
+		if ipFamily == "" {
+			continue
+		}
+		for _, a := range n.allocators {
+			if a.family == ipFamily {
+				allocator = a
+			}
+		}
+		if allocator == nil {
+			continue
+		}
+
+		allocator.mutex.Lock()
+		if _, ok := allocator.allocated[ip]; ok {
+			// IP still in use, update the operator to stop releasing the IP.
+			n.ownNode.Status.IPAM.ReleaseIPs[ip] = ipamOption.IPAMDoNotRelease
+		} else {
+			n.ownNode.Status.IPAM.ReleaseIPs[ip] = ipamOption.IPAMReadyForRelease
+		}
+		allocator.mutex.Unlock()
+		releaseUpstreamSyncNeeded = true
+	}


Would it make sense to move this code to a isolated function and have it return releaseUpstreamSyncNeeded?

pkg/ipam/node.go

christarazi · 2021-11-29T18:40:35Z

pkg/ipam/node.go

+			"available":         n.stats.AvailableIPs,
+			"used":              n.stats.UsedIPs,
+			"excess":            n.stats.ExcessIPs,
+			"excessIps":         a.release.IPsToRelease,
+			"releasing":         ipsToRelease,
+			"selectedInterface": a.release.InterfaceID,
+			"selectedPoolID":    a.release.PoolID,


Ideally, since these are a part of an INFO level log, they should be inside the logfields package. However, there are other parts of the code that also suffer the same thing, so let's just followup on this later, doesn't have to be part of this PR.

pkg/ipam/crd.go

hemanthmalla · 2021-11-29T21:10:16Z

@christarazi Thanks for the review. Addressed most the the feedback except the conversations still unresolved. There's also one more refactor suggestion in the original PR DataDog#24 (comment) we can add to the list for the future PR.

christarazi · 2021-11-29T21:24:14Z

/test

Job 'Cilium-PR-K8s-1.23-kernel-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks graceful termination of service endpoints Checks client terminates gracefully on service endpoint deletion

Failure Output

FAIL: Timed out after 30.001s.

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-4.9 so I can create a new GitHub issue to track it.

Job 'Cilium-PR-K8s-1.22-kernel-4.19' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests NodePort

Failure Output

FAIL: Request from k8s1 to service http://[fd04::12]:31301 failed

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.19 so I can create a new GitHub issue to track it.

aanm · 2021-11-30T00:30:43Z

@hemanthmalla the unit tests failures are legitimate

Currently there's a potential 15 sec delay in updating ciliumnode CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues. A new operator flag `excess-ip-release-delay` is added to control how long operator should wait before considering an IP for release(defaults to 180 secs). This is done to better handle IP reuse during rollouts. Operator and agent use a new map in cilium node status .status.ipam.release-ips to exchange information during the handshake. Fixes: cilium#13412 Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

After the handshake is complete and the operator is done releasing an IP, CiliumNode status (release-ips) and spec (pool) are updated in two consecutive requests. There's a tiny window between the two updates where the entry is removed from .status.ipam.release-ips but the IP is still present in spec.ipam.pool. It was possible that the IP could be allocated between these requests. This commit introduces a new state called released to deal with this. Now agent removes the entry from release-ips only when the IP was removed from .spec.ipam.pool as well. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

gandro · 2021-12-06T14:38:06Z

/test

gandro · 2021-12-06T17:42:23Z

/ci-gke

Edit: Previous failure looks like a variant of #17102

https://github.com/cilium/cilium/runs/4431710422?check_suite_focus=true

nbusseneau · 2021-12-09T15:59:40Z

Thanks @hemanthmalla! Merging.

wu0407 · 2022-04-26T08:52:49Z

pkg/aws/eni/node.go

+	for k := range n.enis {
+		eniIds = append(eniIds, k)
+	}
+	sort.Strings(eniIds)


Same iterate order not ensure selecting the same ENI to release IPs in multi PrepareIPRelease called. what is use case ? @hemanthmalla

Because select the ENI with the most addresses available for release in every call PrepareIPRelease

hemanthmalla requested a review from a team as a code owner November 19, 2021 17:47

hemanthmalla requested review from a team November 19, 2021 17:47

hemanthmalla requested a review from a team as a code owner November 19, 2021 17:47

hemanthmalla requested a review from tgraf November 19, 2021 17:47

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 19, 2021

maintainer-s-little-helper bot assigned tgraf Nov 19, 2021

aanm removed the request for review from tgraf November 19, 2021 23:58

aanm unassigned tgraf Nov 19, 2021

aanm requested review from aanm, christarazi and gandro November 19, 2021 23:59

maintainer-s-little-helper bot assigned aanm, christarazi and gandro Nov 19, 2021

aanm added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Nov 19, 2021

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 19, 2021

gandro added area/eni Impacts ENI based IPAM. sig/ipam IP address management, including cloud IPAM labels Nov 22, 2021

gandro requested changes Nov 22, 2021

View reviewed changes

pkg/ipam/crd.go Outdated Show resolved Hide resolved

pkg/ipam/crd.go Outdated Show resolved Hide resolved

pkg/ipam/types/types.go Outdated Show resolved Hide resolved

aanm added the needs-backport/1.11 label Nov 22, 2021

hemanthmalla force-pushed the hemanthmalla/ip_release_handshake_1.10 branch 2 times, most recently from d11f4ed to 618bfd5 Compare November 23, 2021 20:04

hemanthmalla requested a review from a team as a code owner November 23, 2021 20:04

hemanthmalla force-pushed the hemanthmalla/ip_release_handshake_1.10 branch from 618bfd5 to c76cc5a Compare November 23, 2021 21:42

maintainer-s-little-helper bot mentioned this pull request Nov 24, 2021

CI: K8sServicesTest Checks ClusterIP Connectivity Checks service on same node #17919

Closed

christarazi approved these changes Nov 29, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned christarazi Nov 29, 2021

hemanthmalla force-pushed the hemanthmalla/ip_release_handshake_1.10 branch from 5bd0924 to 77f286b Compare November 29, 2021 21:01

hemanthmalla added 2 commits December 6, 2021 08:45

hemanthmalla force-pushed the hemanthmalla/ip_release_handshake_1.10 branch from 77f286b to bcdb763 Compare December 6, 2021 13:47

aanm removed their request for review December 9, 2021 03:01

aanm removed their assignment Dec 9, 2021

aanm added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 9, 2021

nbusseneau merged commit 5a4dd80 into cilium:master Dec 9, 2021

gandro mentioned this pull request Dec 9, 2021

CI: Travis fatal error: concurrent map iteration and map write in github.com/cilium/cilium/pkg/ipam #18204

Closed

nbusseneau added backport-pending/1.11 and removed needs-backport/1.11 labels Dec 10, 2021

nbusseneau mentioned this pull request Dec 10, 2021

v1.11 backports 2021-12-10 #18232

Merged

hemanthmalla mentioned this pull request Dec 13, 2021

Excess IP release handshake 1.8 backport DataDog/cilium#81

Merged

tklauser added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Dec 16, 2021

hemanthmalla mentioned this pull request Dec 17, 2021

Improvements to excess IP release handshake #18296

Merged

joestringer mentioned this pull request Jan 18, 2022

Prepare for release v1.11.1 #18535

Merged

hemanthmalla mentioned this pull request Jan 20, 2022

Adding support for writing e2e tests exclusive to ENI mode #18557

Closed

3 tasks

jaredledvina mentioned this pull request Mar 7, 2022

Cherry-picks for v1.10-dd DataDog/cilium#138

Merged

wu0407 reviewed Apr 26, 2022

View reviewed changes

joestringer mentioned this pull request May 31, 2022

IP Address was released for an active pod #20026

Closed

2 tasks

asaub mentioned this pull request Jul 12, 2022

Fail to load programs with "Tail call key 65535 for map 1 out of bounds?" #15071

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for excess IP release race condition | handshake between agent and operator #17939

Fix for excess IP release race condition | handshake between agent and operator #17939

hemanthmalla commented Nov 19, 2021 •

edited

gandro left a comment

joestringer commented Nov 24, 2021 •

edited by maintainer-s-little-helper bot

hemanthmalla commented Nov 24, 2021 •

edited

gandro commented Nov 24, 2021 •

edited

gandro commented Nov 24, 2021 •

edited

christarazi left a comment •

edited

christarazi Nov 29, 2021

christarazi Nov 29, 2021

hemanthmalla commented Nov 29, 2021

christarazi commented Nov 29, 2021 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

Test Name

Failure Output

aanm commented Nov 30, 2021

gandro commented Dec 6, 2021

gandro commented Dec 6, 2021 •

edited

nbusseneau commented Dec 9, 2021

wu0407 Apr 26, 2022

wu0407 Apr 26, 2022

Fix for excess IP release race condition | handshake between agent and operator #17939

Fix for excess IP release race condition | handshake between agent and operator #17939

Conversation

hemanthmalla commented Nov 19, 2021 • edited

gandro left a comment

Choose a reason for hiding this comment

joestringer commented Nov 24, 2021 • edited by maintainer-s-little-helper bot

hemanthmalla commented Nov 24, 2021 • edited

gandro commented Nov 24, 2021 • edited

gandro commented Nov 24, 2021 • edited

christarazi left a comment • edited

Choose a reason for hiding this comment

christarazi Nov 29, 2021

Choose a reason for hiding this comment

christarazi Nov 29, 2021

Choose a reason for hiding this comment

hemanthmalla commented Nov 29, 2021

christarazi commented Nov 29, 2021 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

Test Name

Failure Output

aanm commented Nov 30, 2021

gandro commented Dec 6, 2021

gandro commented Dec 6, 2021 • edited

nbusseneau commented Dec 9, 2021

wu0407 Apr 26, 2022

Choose a reason for hiding this comment

wu0407 Apr 26, 2022

Choose a reason for hiding this comment

hemanthmalla commented Nov 19, 2021 •

edited

joestringer commented Nov 24, 2021 •

edited by maintainer-s-little-helper bot

hemanthmalla commented Nov 24, 2021 •

edited

gandro commented Nov 24, 2021 •

edited

gandro commented Nov 24, 2021 •

edited

christarazi left a comment •

edited

christarazi commented Nov 29, 2021 •

edited by maintainer-s-little-helper bot

gandro commented Dec 6, 2021 •

edited