New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-pool: Support allocating from new IPAM pools on demand #25765
multi-pool: Support allocating from new IPAM pools on demand #25765
Conversation
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me, one tiny nit :)
Functionality I am not the best person for to ask so if a better expert has time to also review 😄
option.BindEnv(Vp, option.IPAMMultiPoolNodePreAlloc) | ||
flags.Var(option.NewNamedMapOptions(option.IPAMMultiPoolPreAllocation, &option.Config.IPAMMultiPoolPreAllocation, nil), | ||
option.IPAMMultiPoolPreAllocation, | ||
fmt.Sprintf("Defines how the minimum number of IPs a node should pre-allocate from each pool (default %s)", defaults.IPAMMultiPoolPreAllocation)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is a string might %q
not be more cleat to the user? (default default=8)
vs (default "default=8")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yes, I noticed this too, and I am torn on this. It's not really a string, it's a map. But option.NewNamedMapOptions
does not allow me to define a default map, which is why I'm setting a default "string" value (which is later parsed as a map).
I've went back and forth between %s
and %q
. I ended up with %s
, because other non-string types (like e.g. integers, durations and booleans) don't use quotes either:
--allocator-list-timeout duration Timeout for listing allocator state before exiting (default 3m0s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another note here is that you don't need to explicitly mention the default here because flags
does it for you, AFAIR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, for some reason it doesn't here. See the generated docs, there is no duplicate default
.
One thought occurs to me: is the owner a pod or the CNI container ID? If it's a container ID, then that is unique for every CNI ADD (even retries), so we will have a churn issue. If it's a pod, then we're OK. |
Good point. It's the pod: cilium/plugins/cilium-cni/main.go Line 121 in 2588462
|
This commit contains no functional changes. It changes the function signature of a few internal functions to use the `ipam.Pool` type instead of a string. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Before this commit, agent would sort the list of requested and allocated pools in reserve alphabetical order. This was not intended, as the operator sorts the list of allocated pools in alphabetical order. This commit fixes that by actually using the "less" operator for the "less" comparison function in `sort.Slice`. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit fixes the sort order of IPv4 vs IPv6 CIDRs in the multipool IPAM unit tests. This change is purely cosmetic and has no impact on test correctness, but the new order relects the sort order also used by the operator when it populates that field. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit renames and exposes the flag to configure how many IPs a node should pre-allocate from each pool. The flag was previously hidden due to uncertainties of how IPAM pools should be configured. But since then, it has become clear that it does make sense for certain workloads to pre-allocate, while for others, dynamic allocation makes more sense. Therefore, this commit exposes a way to configure Cilium to always pre-allocate IPs for pools listed in this option. Pools not listed in this option will only request as many IPs as are actually used, which will be added in a subsequent commit. The default value of `default=8` means that (unless specified otherwise), we will always require at least 8 spare IPs in the default pool. This matches the default behavior of cloud-based IPAM modes, c.f. `defaults.IPAMPreAllocation`, which is also set to 8. This commit also slightly cleans up the usage of the flag in `multipool.go` for upcoming commits. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit adds a new `pendingAllocationsPerPool` object, which allows the caller to track which IP owners have pending (i.e. requested but not yet fulfilled) allocations. This will be used in the next commit to compute the number of IPs the agent should request from the operator. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit rewrites the way we compute the number of requested IPs for each pool. Previously, we required the user to specify in advance what pool the agent would allocate from. This however is very limiting, since it's often not known in advance what workload a particular node will run. Therefore, this commit adds a mechanism where upon observing a allocation attempt for a new pool, we update the local CiliumNode CRD to request additional IPs for that pool. This then allows the allocation request to be successful upon retry. This retry mechanism is similar to how Cilium IPAM also works e.g. in ENI mode (where if the local node is out of IPs, we fail the CNI ADD request and expect kubelet to re-try again later). This PR also implicitly adds support for allocating from pools without a "pre-allocate" value. This allows users to configure very small pools for specific workloads and assign those few IPs only if they are actually used. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
9533c25
to
6bff6cc
Compare
Addressed the feedback and rebased. No functional changes: diff --git a/pkg/ipam/multipool.go b/pkg/ipam/multipool.go
index 18439b361e86..9b443dd9a2be 100644
--- a/pkg/ipam/multipool.go
+++ b/pkg/ipam/multipool.go
@@ -352,11 +352,11 @@ func neededIPCeil(numIP int, preAlloc int) int {
// computeNeededIPsPerPoolLocked computes how many IPs we want to request from
// the operator for each pool. The formula we use for each pool is basically
//
-// neededIPs = roundUp(inUseIPs + pendingIPs + preAllocIPs, preAlloc)
+// neededIPs = roundUp(inUseIPs + pendingIPs + preAllocIPs, preAllocIPs)
//
-// inUseIPs Number of IPs that are currently actively in use
-// pendingIPs Number of IPs that have been requested, but not yet assigned
-// preAllocIPs Minimum number of IPs that we want to pre-allocate as a buffer
+// inUseIPs Number of IPs that are currently actively in use
+// pendingIPs Number of IPs that have been requested, but not yet assigned
+// preAllocIPs Minimum number of IPs that we want to pre-allocate as a buffer
//
// Rounded up to the next multiple of preAllocIPs.
func (m *multiPoolManager) computeNeededIPsPerPoolLocked() map[Pool]types.IPAMPoolDemand {
@@ -596,17 +596,11 @@ func (m *multiPoolManager) poolByFamilyLocked(poolName Pool, family Family) *pod
return nil
}
-func (m *multiPoolManager) allocateNext(owner string, poolName Pool, family Family, syncUpstream bool) (result *AllocationResult, err error) {
+func (m *multiPoolManager) allocateNext(owner string, poolName Pool, family Family, syncUpstream bool) (*AllocationResult, error) {
m.mutex.Lock()
defer m.mutex.Unlock()
defer func() {
- if err != nil {
- m.pendingIPsPerPool.upsertPendingAllocation(poolName, owner, family)
- } else {
- m.pendingIPsPerPool.markAsAllocated(poolName, owner, family)
- }
-
if syncUpstream {
m.k8sUpdater.TriggerWithReason("allocation of next IP")
}
@@ -614,27 +608,25 @@ func (m *multiPoolManager) allocateNext(owner string, poolName Pool, family Fami
pool := m.poolByFamilyLocked(poolName, family)
if pool == nil {
+ m.pendingIPsPerPool.upsertPendingAllocation(poolName, owner, family)
return nil, fmt.Errorf("unable to allocate from pool %q (family %s): pool not (yet) available", poolName, family)
}
ip, err := pool.allocateNext()
if err != nil {
+ m.pendingIPsPerPool.upsertPendingAllocation(poolName, owner, family)
return nil, err
}
+
+ m.pendingIPsPerPool.markAsAllocated(poolName, owner, family)
return &AllocationResult{IP: ip, IPPoolName: poolName}, nil
}
-func (m *multiPoolManager) allocateIP(ip net.IP, owner string, poolName Pool, family Family, syncUpstream bool) (result *AllocationResult, err error) {
+func (m *multiPoolManager) allocateIP(ip net.IP, owner string, poolName Pool, family Family, syncUpstream bool) (*AllocationResult, error) {
m.mutex.Lock()
defer m.mutex.Unlock()
defer func() {
- if err != nil {
- m.pendingIPsPerPool.upsertPendingAllocation(poolName, owner, family)
- } else {
- m.pendingIPsPerPool.markAsAllocated(poolName, owner, family)
- }
-
if syncUpstream {
m.k8sUpdater.TriggerWithReason("allocation of specific IP")
}
@@ -642,13 +634,17 @@ func (m *multiPoolManager) allocateIP(ip net.IP, owner string, poolName Pool, fa
pool := m.poolByFamilyLocked(poolName, family)
if pool == nil {
+ m.pendingIPsPerPool.upsertPendingAllocation(poolName, owner, family)
return nil, fmt.Errorf("unable to reserve IP %s from pool %q (family %s): pool not (yet) available", ip, poolName, family)
}
- err = pool.allocate(ip)
+ err := pool.allocate(ip)
if err != nil {
+ m.pendingIPsPerPool.upsertPendingAllocation(poolName, owner, family)
return nil, err
}
+
+ m.pendingIPsPerPool.markAsAllocated(poolName, owner, family)
return &AllocationResult{IP: ip, IPPoolName: poolName}, nil
} |
/test |
This PR adds support for dynamically allocating IPs from a new pool. Previously, we required the user to specify in advance what pool the agent would allocate from. This however is very limiting, since it's often not known in advance what workload a particular node will run.
Therefore, this PR adds a mechanism where upon observing a allocation attempt for a new pool, we update the local CiliumNode CRD to request additional IPs for that pool. This then allows the allocation request to be successful upon retry. This retry mechanism is similar to how Cilium IPAM also works e.g. in ENI mode (where if the local node is out of IPs, we fail the CNI ADD request and expect kubelet to re-try again later).
This PR also implicitly adds support for allocating from pools without a "pre-allocate" value. This allows users to configure very small pools for specific workloads and assign those few IPs only if they are actually used.
ℹ️ Review per commit. The first few commits are mainly cleanup, the main logic is contained in the last two commits
For further context, see:
ipam.cilium.io/ip-pool
annotation #25511CI and documentation will be added in a separate PR. See meta issue #25470