Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track max pods, simplify warm IP pool management #2745

Merged
merged 1 commit into from Jan 9, 2024

Conversation

jdn5126
Copy link
Contributor

@jdn5126 jdn5126 commented Jan 5, 2024

What type of PR is this?
enhancement

Which issue does this PR fix:
N/A

What does this PR do / Why do we need it:
I set out trying to do two things in this PR:

  1. Handle the low-hanging fruit of not allocating more IPs to the node than max-pods, so as not to waste VPC IPs.
  2. Cleanup and simplify the warm IP pool management logic

On point 1, I partially succeeded, as we now do not increase the datastore pool when more than max-pods IPs are allocated to the node. This check cannot be exact since IPs can be allocated in chunks (prefix mode or "full-ENI" mode). Note that this check only needs to be enforced on the allocation side, not the deallocation side.

On point 2, I had very limited success, as the warm IP pool management logic is a behemoth. I made some simplifications, primarily around limiting how many times we calculate DataStoreStats. Calls to GetIPStats acquire a datastore lock that can slow down pod IP acquisition. The conclusion I came to is that the warm pool logic has to be significantly refactored and broken up.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
N/A

Testing done on this change:
Manually verified that all integration tests pass after this change. Also verified that when max pods is set to a low value, excess IPs are not allocated.

Will this PR introduce any new dependencies?:
No

Will this break upgrades or downgrades? Has updating a running cluster been tested?:
No, Yes

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:
Yes

Track max pods, simplify warm IP pool management

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jdn5126 jdn5126 requested a review from a team as a code owner January 5, 2024 20:34
@jdn5126 jdn5126 force-pushed the max_pods branch 3 times, most recently from dc58d22 to ca26b91 Compare January 5, 2024 22:43
return short > 0
func (c *IPAMContext) isDatastorePoolTooLow() (bool, *datastore.DataStoreStats) {
stats := c.dataStore.GetIPStats(ipV4AddrFamily)
// If max pods has been reached, pool is not too low
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If max pods has been reached, pool is not too low

I think, this comment is bit confusing, as max pods has been mentioned without the context. I got confused while reading this.

If we write, if c.maxPods <= stats.TotalIPs then it seems alright.

Here we are saying, the nodes maxPods is less than or equal to the total ip available in the datastore, and so the isDatastorePoolTooLow too low is false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what this function checks is a bit confusing. isDatastorePoolTooLow looks at the number of currently available IPs and compares that number to the warm/minimum targets. If there are not enough available IPs, we return false.

The maxPods check here is short-circuiting that logic by saying, "forgot about warm targets, do not allocate any more IPs if we have already allocated more than maxPods worth"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood it better, when thinking like this

c.maxPods <= stats.TotalIPs

Because we are sending stats as the argument, I think, you went with the validation as stats.TotalIPs >= c.maxPods; explaining the we have more than maxPods worth.

Sounds good to me.

Copy link
Contributor

@orsenthil orsenthil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jchen6585 jchen6585 merged commit 1de6c1d into aws:master Jan 9, 2024
6 checks passed
@@ -499,17 +500,26 @@ func (c *IPAMContext) nodeInit() error {
}, 30*time.Second)
}

// Make a k8s client request for the current node so that max pods can be derived
node, err := k8sapi.GetNode(ctx, c.k8sClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ideally remove the need of k8sClient usage unless any feature like custom networking would need it. We can enhance CNINode for this information. /cc @haouc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about that, but we already have a non-optional k8s client call for the node when we attach an ENI - to check if the node is currently marked as schedulable. It would be nice to have this node state available in CNINode, but that would still require a k8s client call to retrieve.

Copy link
Contributor

@jayanthvn jayanthvn Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure..with secondary IPs we would never go over max pods since max pods take secondary IP into account and once max pods are reached pretty much new IPs cannot be assigned since ENIs will be full. This would be beneficial for prefix delegation but it is still tricky since allocation happens in chunks of 16..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's we don't enforce a hard check on increase (or decrease). We simply do not increase, even when the warm targets tell us to, when max pods has been reached

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in PD mode - Let's say we have WARM_PREFIX set to 1 and the node is already @ Max Pods. Now, with this change, we will not be honoring the WARM PREFIX in this state as the node is already at capacity w.r.t maxPods. If the user now churns the pods on this node (Let's say rollout restart of all deployments) - IPs assigned to recently deleted pods can't be reused for next 30s and so the new pods will not have any IPs and we would need to pull in a new prefix anyways? Current workflow, would've assigned an additional prefix and the new pods would've consumed IPs form the additional prefix and we would've released the old prefix after the cool down period. Did I understand the new flow right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct. Over-provisioning could help in a case where the user restarts all pods at once. I do not think that is a case that we should design for, though. The downsides of over-provisioning (cost and resource utilization) seem much higher than slower recovery after restarting all pods, which should be an extremely rare event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants