-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for excess IP release race condition | handshake between agent and operator #17939
Fix for excess IP release race condition | handshake between agent and operator #17939
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks a ton for addressing this long-standing issue.
I think we need to ensure that allocateNext
doesn't hand out IPs which are marked for release too (see inline comment), the rest are smaller nits.
d11f4ed
to
618bfd5
Compare
618bfd5
to
c76cc5a
Compare
/test Job 'Cilium-PR-K8s-1.22-kernel-4.9' has 1 failure but they might be new flake since it also hit 1 known flake: #17919 (93.81) |
There's one more corner case that needs addressing. After the handshake is complete and the operator is done releasing an IP, CiliumNode status (release-ips) and spec (pool) are updated in two consecutive requests. There's a tiny window between the two updates where the entry is removed from @gandro / @aanm Can i flip the order of update only when there are changes to IP release status ? Comments in Lines 700 to 701 in eeb7f1b
|
Oh wow, good catch! In dynamic cluster pool, we avoid this by having the agent remove the entry from
While I'm also not sure, I believe the order might have to do with the fact that that the agent needs to know the ENI metadata for each IP present in spec. If the agent sees an IP in the pool before it knows about its ENI metadata from status (e.g. subnet CIDR and ENI MAC), this error will be hit: Line 528 in 124fa2b
|
The two CI failures (besides checkpatch) look like unrelated flaky tests:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for forward porting! The changes LGTM overall. I am marking my review as approved, but there are suggestions that I'd like to see followed up on in a later PR, so that we can get this PR in. Feel free to address my comments as you see fit in this PR however. After this is merged, we can create a tracking issue and put all the comments that need following up on in there. Please use "resolve conversion" for comments that you've already addressed in this PR.
Note, for any suggestions related to commits, could you address those in this PR as those can't be changed in a followup?
pkg/ipam/crd.go
Outdated
releaseUpstreamSyncNeeded := false | ||
// ACK or NACK IPs marked for release by the operator | ||
for ip, status := range n.ownNode.Status.IPAM.ReleaseIPs { | ||
if status != ipamOption.IPAMMarkForRelease || n.ownNode.Spec.IPAM.Pool == nil { | ||
continue | ||
} | ||
// NACK the IP, if this node doesn't own the IP | ||
if _, ok := n.ownNode.Spec.IPAM.Pool[ip]; !ok { | ||
n.ownNode.Status.IPAM.ReleaseIPs[ip] = ipamOption.IPAMDoNotRelease | ||
continue | ||
} | ||
// Retrieve the appropriate allocator | ||
var allocator *crdAllocator | ||
var ipFamily Family | ||
if ipAddr := net.ParseIP(ip); ipAddr != nil { | ||
if ipAddr.To4() != nil { | ||
ipFamily = IPv4 | ||
} else { | ||
ipFamily = IPv6 | ||
} | ||
} | ||
if ipFamily == "" { | ||
continue | ||
} | ||
for _, a := range n.allocators { | ||
if a.family == ipFamily { | ||
allocator = a | ||
} | ||
} | ||
if allocator == nil { | ||
continue | ||
} | ||
|
||
allocator.mutex.Lock() | ||
if _, ok := allocator.allocated[ip]; ok { | ||
// IP still in use, update the operator to stop releasing the IP. | ||
n.ownNode.Status.IPAM.ReleaseIPs[ip] = ipamOption.IPAMDoNotRelease | ||
} else { | ||
n.ownNode.Status.IPAM.ReleaseIPs[ip] = ipamOption.IPAMReadyForRelease | ||
} | ||
allocator.mutex.Unlock() | ||
releaseUpstreamSyncNeeded = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to move this code to a isolated function and have it return releaseUpstreamSyncNeeded
?
"available": n.stats.AvailableIPs, | ||
"used": n.stats.UsedIPs, | ||
"excess": n.stats.ExcessIPs, | ||
"excessIps": a.release.IPsToRelease, | ||
"releasing": ipsToRelease, | ||
"selectedInterface": a.release.InterfaceID, | ||
"selectedPoolID": a.release.PoolID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, since these are a part of an INFO level log, they should be inside the logfields package. However, there are other parts of the code that also suffer the same thing, so let's just followup on this later, doesn't have to be part of this PR.
5bd0924
to
77f286b
Compare
@christarazi Thanks for the review. Addressed most the the feedback except the conversations still unresolved. There's also one more refactor suggestion in the original PR DataDog#24 (comment) we can add to the list for the future PR. |
/test Job 'Cilium-PR-K8s-1.23-kernel-4.9' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment Job 'Cilium-PR-K8s-1.22-kernel-4.19' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment |
@hemanthmalla the unit tests failures are legitimate |
Currently there's a potential 15 sec delay in updating ciliumnode CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues. A new operator flag `excess-ip-release-delay` is added to control how long operator should wait before considering an IP for release(defaults to 180 secs). This is done to better handle IP reuse during rollouts. Operator and agent use a new map in cilium node status .status.ipam.release-ips to exchange information during the handshake. Fixes: cilium#13412 Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
After the handshake is complete and the operator is done releasing an IP, CiliumNode status (release-ips) and spec (pool) are updated in two consecutive requests. There's a tiny window between the two updates where the entry is removed from .status.ipam.release-ips but the IP is still present in spec.ipam.pool. It was possible that the IP could be allocated between these requests. This commit introduces a new state called released to deal with this. Now agent removes the entry from release-ips only when the IP was removed from .spec.ipam.pool as well. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
77f286b
to
bcdb763
Compare
/test |
/ci-gke Edit: Previous failure looks like a variant of #17102 https://github.com/cilium/cilium/runs/4431710422?check_suite_focus=true |
Thanks @hemanthmalla! Merging. |
for k := range n.enis { | ||
eniIds = append(eniIds, k) | ||
} | ||
sort.Strings(eniIds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same iterate order not ensure selecting the same ENI to release IPs in multi PrepareIPRelease
called. what is use case ? @hemanthmalla
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because select the ENI with the most addresses available for release in every call PrepareIPRelease
Forward port of DataDog#24
Currently there's a 15 sec delay in updating ciliumnode (CN) CRD after IP allocation to a pod, meanwhile the operator can determine that a node has excess IPs and release the IP causing pod connectivity issues.
This PR introduces a handshake between the operator and agent before releasing an excess IP to avoid the race condition. A new operator flag
excess-ip-release-delay
is added to control how long operator should wait before considering an IP for release (defaults to 180 secs). This is done to better handle IP reuse during rollouts. Operator and agent use a new map in cilium node status.status.ipam.release-ips
to exchange information during the handshake.Fixes: #13412
The following alternative options were considered :
We could force a CN write to go through before allocation but it can result in too many writes from the agent.
The operator picks the ENI with the most free IPs to release excess IPs from, this is done so that the ENI could potentially be released in the future. We could move the excess detection logic to the agent, but this involves calling into cloud provider specific implementation, which AFAIU the agent IPAM implementation does not.
Note : DataDog#24 already received some valuable feedback from @aanm @gandro and @christarazi