Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.9] Backport of #17492 #17493

Merged

Conversation

christarazi
Copy link
Member

Once this PR is merged, you can update the PR labels via:

$ for pr in 17492; do contrib/backporting/set-labels.py $pr done 1.9; done

@christarazi christarazi added backport/1.9 kind/backports This PR provides functionality previously merged into master. labels Sep 29, 2021
@christarazi
Copy link
Member Author

/test-backport-1.9

@christarazi christarazi force-pushed the pr/christarazi/v1.9-backport-17492 branch from 006ecef to d1268ca Compare September 29, 2021 23:20
@christarazi
Copy link
Member Author

/test-backport-1.9

[ upstream commit c88e06f ]

This will be used in the subsequent commit to check for a specific error
when allocating an IP.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
[ upstream commit baa9e9a ]

In CRD-backed IPAM modes such as ENI mode, some IPs were recently
removed from the IPAM pool [1]. In user environments where Cilium was
running a version prior to [1], it is possible for endpoints to be
assigned "unavailable" IPs. Upon upgrade (to a version which includes
[1]), endpoint restoration will fail with [2].

In order to workaround this failure and not disrupt the upgrade, this
change introduces a hidden flag
(`--bypass-ip-availability-upon-restore`) which will inform Cilium to
continue on if the restored endpoint's IP is not available for
reallocation, bypassing the specific error "IP is not available". Other
errors will not be bypassed, in order to reduce the scope of this
stop-gap solution.

With the flag set, restored endpoints which had "unavailable" IPs will
keep them. Any new endpoints / pods will be assigned fresh, valid IPs
from the pool.

This flag is only meant to be enabled with CRD-backed IPAM modes such as
ENI mode. The reason is because of the change described in [1], where
the primary ENI IP was removed from the IPAM pool. In any other mode
that this flag is enabled in, the user is warned that the flag is not
intended for other modes and will have no effect.

This patch is intended to be reverted in the future, as this stop-gap
solution will no longer be required, as users of Cilium don't upgrade
from versions prior to [1]. I propose that we revert this in the
following release that this patch makes it in (N+1).

How was this tested?

1) Deployed a Cilium version that doesn't include [1] on EKS cluster
2) Created a Deployment object which I scaled to max out the ENI IPs,
   such that at least one pod is assigned an "unavailable" IP
3) Upgraded Cilium to a version which does include [1] and observe [2]
   failures
4) Reset cluster back to state from (2)
5) Upgrade Cilium to the version that contains this commit
6) Observe log msgs from this commit and endpoint restoration succeeding
7) Scale Deployment to 0 and back up, to restart all pods
8) Observe that they all get fresh IPs and none of the "unavailable"
   IPs

[1]: cilium#15453
[2]:

```
{
"time":"2021-09-20T16:57:00.400086481Z",
"level":"WARN",
"origin":"cilium.io/agent",
"message":"Unable to restore endpoint, ignoring",
"params":{
    "endpointID":"992",
    "error":"Failed to re-allocate IP of endpoint: unable to reallocate 10.0.133.193 IPv4 address: IP 10.0.133.193 is not available",
    "k8sPodName":"default/pod-1",
    "subsys":"daemon"
  }
}
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi christarazi force-pushed the pr/christarazi/v1.9-backport-17492 branch from d1268ca to 020cc98 Compare September 30, 2021 21:35
@christarazi christarazi marked this pull request as ready for review September 30, 2021 21:36
@christarazi christarazi requested a review from a team as a code owner September 30, 2021 21:36
@christarazi
Copy link
Member Author

christarazi commented Sep 30, 2021

/test-backport-1.9

Edit: GKE images failed to pull: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/6507/

@christarazi
Copy link
Member Author

christarazi commented Oct 1, 2021

/test-gke

Edit: failed again, turns out it's #17282

@joestringer joestringer merged commit 43eb445 into cilium:v1.9 Oct 1, 2021
@christarazi christarazi deleted the pr/christarazi/v1.9-backport-17492 branch October 1, 2021 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants