Track max pods, simplify warm IP pool management #2745

jdn5126 · 2024-01-05T20:34:59Z

What type of PR is this?
enhancement

Which issue does this PR fix:
N/A

What does this PR do / Why do we need it:
I set out trying to do two things in this PR:

Handle the low-hanging fruit of not allocating more IPs to the node than max-pods, so as not to waste VPC IPs.
Cleanup and simplify the warm IP pool management logic

On point 1, I partially succeeded, as we now do not increase the datastore pool when more than max-pods IPs are allocated to the node. This check cannot be exact since IPs can be allocated in chunks (prefix mode or "full-ENI" mode). Note that this check only needs to be enforced on the allocation side, not the deallocation side.

On point 2, I had very limited success, as the warm IP pool management logic is a behemoth. I made some simplifications, primarily around limiting how many times we calculate DataStoreStats. Calls to GetIPStats acquire a datastore lock that can slow down pod IP acquisition. The conclusion I came to is that the warm pool logic has to be significantly refactored and broken up.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
N/A

Testing done on this change:
Manually verified that all integration tests pass after this change. Also verified that when max pods is set to a low value, excess IPs are not allocated.

Will this PR introduce any new dependencies?:
No

Will this break upgrades or downgrades? Has updating a running cluster been tested?:
No, Yes

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:
Yes

Track max pods, simplify warm IP pool management

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

orsenthil · 2024-01-09T20:05:04Z

pkg/ipamd/ipamd.go

-		return short > 0
+func (c *IPAMContext) isDatastorePoolTooLow() (bool, *datastore.DataStoreStats) {
+	stats := c.dataStore.GetIPStats(ipV4AddrFamily)
+	// If max pods has been reached, pool is not too low


If max pods has been reached, pool is not too low

I think, this comment is bit confusing, as max pods has been mentioned without the context. I got confused while reading this.

If we write, if c.maxPods <= stats.TotalIPs then it seems alright.

Here we are saying, the nodes maxPods is less than or equal to the total ip available in the datastore, and so the isDatastorePoolTooLow too low is false.

Yeah, what this function checks is a bit confusing. isDatastorePoolTooLow looks at the number of currently available IPs and compares that number to the warm/minimum targets. If there are not enough available IPs, we return false.

The maxPods check here is short-circuiting that logic by saying, "forgot about warm targets, do not allocate any more IPs if we have already allocated more than maxPods worth"

I understood it better, when thinking like this

c.maxPods <= stats.TotalIPs

Because we are sending stats as the argument, I think, you went with the validation as stats.TotalIPs >= c.maxPods; explaining the we have more than maxPods worth.

Sounds good to me.

orsenthil

LGTM.

jayanthvn · 2024-01-09T22:43:45Z

pkg/ipamd/ipamd.go

@@ -499,17 +500,26 @@ func (c *IPAMContext) nodeInit() error {
 		}, 30*time.Second)
 	}

+	// Make a k8s client request for the current node so that max pods can be derived
+	node, err := k8sapi.GetNode(ctx, c.k8sClient)


We should ideally remove the need of k8sClient usage unless any feature like custom networking would need it. We can enhance CNINode for this information. /cc @haouc

Yeah, I thought about that, but we already have a non-optional k8s client call for the node when we attach an ENI - to check if the node is currently marked as schedulable. It would be nice to have this node state available in CNINode, but that would still require a k8s client call to retrieve.

sure..with secondary IPs we would never go over max pods since max pods take secondary IP into account and once max pods are reached pretty much new IPs cannot be assigned since ENIs will be full. This would be beneficial for prefix delegation but it is still tricky since allocation happens in chunks of 16..

Right, that's we don't enforce a hard check on increase (or decrease). We simply do not increase, even when the warm targets tell us to, when max pods has been reached

Also, in PD mode - Let's say we have WARM_PREFIX set to 1 and the node is already @ Max Pods. Now, with this change, we will not be honoring the WARM PREFIX in this state as the node is already at capacity w.r.t maxPods. If the user now churns the pods on this node (Let's say rollout restart of all deployments) - IPs assigned to recently deleted pods can't be reused for next 30s and so the new pods will not have any IPs and we would need to pull in a new prefix anyways? Current workflow, would've assigned an additional prefix and the new pods would've consumed IPs form the additional prefix and we would've released the old prefix after the cool down period. Did I understand the new flow right?

Your understanding is correct. Over-provisioning could help in a case where the user restarts all pods at once. I do not think that is a case that we should design for, though. The downsides of over-provisioning (cost and resource utilization) seem much higher than slower recovery after restarting all pods, which should be an extremely rare event.

jdn5126 requested a review from a team as a code owner January 5, 2024 20:34

jdn5126 force-pushed the max_pods branch 3 times, most recently from dc58d22 to ca26b91 Compare January 5, 2024 22:43

add max pods, simplify warm IP pool management

b87d7d4

jdn5126 force-pushed the max_pods branch from ca26b91 to b87d7d4 Compare January 5, 2024 23:01

jdn5126 mentioned this pull request Jan 8, 2024

Faster eni scaleup #2744

Merged

orsenthil reviewed Jan 9, 2024

View reviewed changes

orsenthil approved these changes Jan 9, 2024

View reviewed changes

jchen6585 merged commit 1de6c1d into aws:master Jan 9, 2024
6 checks passed

jayanthvn reviewed Jan 9, 2024

View reviewed changes

jdn5126 deleted the max_pods branch January 9, 2024 22:44

jdn5126 mentioned this pull request Feb 14, 2024

IPs assignment doesn't respect maxPod configuration #2793

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track max pods, simplify warm IP pool management #2745

Track max pods, simplify warm IP pool management #2745

jdn5126 commented Jan 5, 2024

orsenthil Jan 9, 2024

jdn5126 Jan 9, 2024

orsenthil Jan 9, 2024

orsenthil left a comment

jayanthvn Jan 9, 2024

jdn5126 Jan 9, 2024

jayanthvn Jan 11, 2024 •

edited

jdn5126 Jan 11, 2024

achevuru Jan 11, 2024

jdn5126 Jan 17, 2024

Track max pods, simplify warm IP pool management #2745

Track max pods, simplify warm IP pool management #2745

Conversation

jdn5126 commented Jan 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orsenthil left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayanthvn Jan 11, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayanthvn Jan 11, 2024 •

edited