Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for excess IP release race condition | handshake between agent and operator #17939

Merged

Conversation

hemanthmalla
Copy link
Member

@hemanthmalla hemanthmalla commented Nov 19, 2021

Forward port of DataDog#24

Currently there's a 15 sec delay in updating ciliumnode (CN) CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues.

This PR introduces a handshake between the operator and agent before releasing an excess IP to avoid the race condition. A new operator flag excess-ip-release-delay is added to control how long operator should wait before considering an IP for release (defaults to 180 secs). This is done to better handle IP reuse during rollouts. Operator and agent use a new map in cilium node status .status.ipam.release-ips to exchange information during the handshake.

Fixes: #13412

handshake_upstream

The following alternative options were considered :

  • We could force a CN write to go through before allocation but it can result in too many writes from the agent.

  • The operator picks the ENI with the most free IPs to release excess IPs from, this is done so that the ENI could potentially be released in the future. We could move the excess detection logic to the agent, but this involves calling into cloud provider specific implementation, which AFAIU the agent IPAM implementation does not.

Fix for excess IP release race condition. New operator flag excess-ip-release-delay is introduced to control waiting period before marking an IP for release.

Note : DataDog#24 already received some valuable feedback from @aanm @gandro and @christarazi

@hemanthmalla hemanthmalla requested a review from a team as a code owner November 19, 2021 17:47
@hemanthmalla hemanthmalla requested review from a team November 19, 2021 17:47
@hemanthmalla hemanthmalla requested a review from a team as a code owner November 19, 2021 17:47
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 19, 2021
@aanm aanm removed the request for review from tgraf November 19, 2021 23:58
@aanm aanm unassigned tgraf Nov 19, 2021
@aanm aanm added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Nov 19, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 19, 2021
@gandro gandro added area/eni Impacts ENI based IPAM. sig/ipam IP address management, including cloud IPAM labels Nov 22, 2021
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks a ton for addressing this long-standing issue.

I think we need to ensure that allocateNext doesn't hand out IPs which are marked for release too (see inline comment), the rest are smaller nits.

pkg/ipam/crd.go Outdated Show resolved Hide resolved
pkg/ipam/crd.go Outdated Show resolved Hide resolved
pkg/ipam/types/types.go Outdated Show resolved Hide resolved
@hemanthmalla hemanthmalla force-pushed the hemanthmalla/ip_release_handshake_1.10 branch 2 times, most recently from d11f4ed to 618bfd5 Compare November 23, 2021 20:04
@hemanthmalla hemanthmalla requested a review from a team as a code owner November 23, 2021 20:04
@hemanthmalla hemanthmalla force-pushed the hemanthmalla/ip_release_handshake_1.10 branch from 618bfd5 to c76cc5a Compare November 23, 2021 21:42
@joestringer
Copy link
Member

joestringer commented Nov 24, 2021

/test

Job 'Cilium-PR-K8s-1.22-kernel-4.9' has 1 failure but they might be new flake since it also hit 1 known flake: #17919 (93.81)

@hemanthmalla
Copy link
Member Author

hemanthmalla commented Nov 24, 2021

There's one more corner case that needs addressing.

After the handshake is complete and the operator is done releasing an IP, CiliumNode status (release-ips) and spec (pool) are updated in two consecutive requests. There's a tiny window between the two updates where the entry is removed from .status.ipam.release-ips but the IP is till present in spec.ipam.pool. It might be possible that the IP can be allocated between these requests.

@gandro / @aanm Can i flip the order of update only when there are changes to IP release status ? Comments in syncToAPIServer() say that the status should always be updated first though. I don't fully understand why.

cilium/pkg/ipam/node.go

Lines 700 to 701 in eeb7f1b

// Always update the status first to ensure that the IPAM information
// is synced for all addresses that are marked as available.

@gandro
Copy link
Member

gandro commented Nov 24, 2021

There's one more corner case that needs addressing.

After the handshake is complete and the operator is done releasing an IP, CiliumNode status (release-ips) and spec (pool) are updated in two consecutive requests. There's a tiny window between the two updates where the entry is removed from .status.ipam.release-ips but the IP is till present in spec.ipam.pool. It might be possible that the IP can be allocated between these requests.

Oh wow, good catch! In dynamic cluster pool, we avoid this by having the agent remove the entry from status.ipam.release-ips once it is no longer present in spec.pool, which might also be an option here.

@gandro / @aanm Can i flip the order of update only when there are changes to IP release status ? Comments in syncToAPIServer() say that the status should always be updated first though. I don't fully understand why.

While I'm also not sure, I believe the order might have to do with the fact that that the agent needs to know the ENI metadata for each IP present in spec. If the agent sees an IP in the pool before it knows about its ENI metadata from status (e.g. subnet CIDR and ENI MAC), this error will be hit:

return nil, fmt.Errorf("unable to find ENI %s", ipInfo.Resource)

@gandro
Copy link
Member

gandro commented Nov 24, 2021

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for forward porting! The changes LGTM overall. I am marking my review as approved, but there are suggestions that I'd like to see followed up on in a later PR, so that we can get this PR in. Feel free to address my comments as you see fit in this PR however. After this is merged, we can create a tracking issue and put all the comments that need following up on in there. Please use "resolve conversion" for comments that you've already addressed in this PR.

Note, for any suggestions related to commits, could you address those in this PR as those can't be changed in a followup?

operator/provider_aws_flags.go Outdated Show resolved Hide resolved
pkg/ipam/crd.go Outdated
Comment on lines 334 to 383
releaseUpstreamSyncNeeded := false
// ACK or NACK IPs marked for release by the operator
for ip, status := range n.ownNode.Status.IPAM.ReleaseIPs {
if status != ipamOption.IPAMMarkForRelease || n.ownNode.Spec.IPAM.Pool == nil {
continue
}
// NACK the IP, if this node doesn't own the IP
if _, ok := n.ownNode.Spec.IPAM.Pool[ip]; !ok {
n.ownNode.Status.IPAM.ReleaseIPs[ip] = ipamOption.IPAMDoNotRelease
continue
}
// Retrieve the appropriate allocator
var allocator *crdAllocator
var ipFamily Family
if ipAddr := net.ParseIP(ip); ipAddr != nil {
if ipAddr.To4() != nil {
ipFamily = IPv4
} else {
ipFamily = IPv6
}
}
if ipFamily == "" {
continue
}
for _, a := range n.allocators {
if a.family == ipFamily {
allocator = a
}
}
if allocator == nil {
continue
}

allocator.mutex.Lock()
if _, ok := allocator.allocated[ip]; ok {
// IP still in use, update the operator to stop releasing the IP.
n.ownNode.Status.IPAM.ReleaseIPs[ip] = ipamOption.IPAMDoNotRelease
} else {
n.ownNode.Status.IPAM.ReleaseIPs[ip] = ipamOption.IPAMReadyForRelease
}
allocator.mutex.Unlock()
releaseUpstreamSyncNeeded = true
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move this code to a isolated function and have it return releaseUpstreamSyncNeeded?

pkg/ipam/node.go Outdated Show resolved Hide resolved
pkg/ipam/node.go Outdated Show resolved Hide resolved
Comment on lines +659 to +682
"available": n.stats.AvailableIPs,
"used": n.stats.UsedIPs,
"excess": n.stats.ExcessIPs,
"excessIps": a.release.IPsToRelease,
"releasing": ipsToRelease,
"selectedInterface": a.release.InterfaceID,
"selectedPoolID": a.release.PoolID,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, since these are a part of an INFO level log, they should be inside the logfields package. However, there are other parts of the code that also suffer the same thing, so let's just followup on this later, doesn't have to be part of this PR.

pkg/ipam/crd.go Outdated Show resolved Hide resolved
pkg/ipam/crd.go Outdated Show resolved Hide resolved
@hemanthmalla
Copy link
Member Author

@christarazi Thanks for the review. Addressed most the the feedback except the conversations still unresolved. There's also one more refactor suggestion in the original PR DataDog#24 (comment) we can add to the list for the future PR.

@christarazi
Copy link
Member

christarazi commented Nov 29, 2021

/test

Job 'Cilium-PR-K8s-1.23-kernel-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks graceful termination of service endpoints Checks client terminates gracefully on service endpoint deletion

Failure Output

FAIL: Timed out after 30.001s.

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-4.9 so I can create a new GitHub issue to track it.

Job 'Cilium-PR-K8s-1.22-kernel-4.19' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests NodePort

Failure Output

FAIL: Request from k8s1 to service http://[fd04::12]:31301 failed

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.19 so I can create a new GitHub issue to track it.

@aanm
Copy link
Member

aanm commented Nov 30, 2021

@hemanthmalla the unit tests failures are legitimate

Currently there's a potential 15 sec delay in updating ciliumnode CRD after IP
allocation to a pod, meanwhile the operator can determine that a node has
excess IPs and release the IP causing pod connectivity issues.

A new operator flag `excess-ip-release-delay` is added to control how long
operator should wait before considering an IP for release(defaults to 180 secs).
This is done to better handle IP reuse during rollouts. Operator and agent use
a new map in cilium node status .status.ipam.release-ips to exchange information
during the handshake.

Fixes: cilium#13412

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
After the handshake is complete and the operator is done releasing an IP,
CiliumNode status (release-ips) and spec (pool) are updated in two
consecutive requests. There's a tiny window between the two updates where
the entry is removed from .status.ipam.release-ips but the IP is still
present in spec.ipam.pool. It was possible that the IP could be allocated
between these requests.

This commit introduces a new state called released to deal with this.
Now agent removes the entry from release-ips only when the IP was
removed from .spec.ipam.pool as well.

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
@hemanthmalla hemanthmalla force-pushed the hemanthmalla/ip_release_handshake_1.10 branch from 77f286b to bcdb763 Compare December 6, 2021 13:47
@gandro
Copy link
Member

gandro commented Dec 6, 2021

/test

@gandro
Copy link
Member

gandro commented Dec 6, 2021

/ci-gke

Edit: Previous failure looks like a variant of #17102

https://github.com/cilium/cilium/runs/4431710422?check_suite_focus=true

@aanm aanm removed their request for review December 9, 2021 03:01
@aanm aanm removed their assignment Dec 9, 2021
@aanm aanm added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 9, 2021
@nbusseneau
Copy link
Member

Thanks @hemanthmalla! Merging.

for k := range n.enis {
eniIds = append(eniIds, k)
}
sort.Strings(eniIds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same iterate order not ensure selecting the same ENI to release IPs in multi PrepareIPRelease called. what is use case ? @hemanthmalla

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because select the ENI with the most addresses available for release in every call PrepareIPRelease

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eni Impacts ENI based IPAM. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/ipam IP address management, including cloud IPAM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWS ENI: Pod assigned IP that is not attached to an ENI
9 participants