Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.14 Backports 2023-09-08 #28021

Merged

Conversation

margamanterola
Copy link
Member

Once this PR is merged, you can update the PR labels via:

for pr in 28012; do contrib/backporting/set-labels.py $pr done 1.14; done

or with

make add-labels BRANCH=v1.14 ISSUES=28012

@margamanterola margamanterola added kind/backports This PR provides functionality previously merged into master. backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. labels Sep 8, 2023
@margamanterola
Copy link
Member Author

/test-backport-1.14

@margamanterola
Copy link
Member Author

/ci-ipsec-upgrade

@margamanterola margamanterola marked this pull request as ready for review September 8, 2023 13:43
@margamanterola margamanterola requested a review from a team as a code owner September 8, 2023 13:43
@margamanterola
Copy link
Member Author

Failures are unrelated to IPSec (currently v1.14 is failing due to a recent commit that broke some tests)

[ upstream commit c9ea7a5 ]

When rolling cilium-agent or doing an upgrade while running stress test
with encryption a small number of NoStateIn errors are seen. To capture
the error state (a cilium_host IP without an xfrm state rule) you need
to get into the pod near pod init and get somewhat lucky that init
took some longer time. For example I ran `ip x s` in a pod about
15seconds after launch and captured a case with new XfrmInNoErrors,
a cilium_host ip assigned, but no xfrm state rule for it. The packets
received are dropped.

The conclusion is remote nodes learn the new router IP before we have
the xfrm state rule loaded. The remote nodes then start using that
IP for the IPSec tunnel outer IP resulting in the errors when they
reach the local node without the xfrm rule yet. The errors eventually
resolve, but some packets are lost in the meantime.

The reason this happens is because first we configure the datapath
after we push node object updates. This is wrong because we need
to init the ipsec code path before we teach remote nodes about the
new IP. And second the configuration of the datapath does a lookup
in the node objects IPAddresses{} this is only populated from the
k8s watcher in the tunnel case. So we only have the fully populated
node object after we receive it through the k8s watcher. Again its
possible other nodes already have seen the event and started pushing
traffic with the new IPs.

To resolve push IPSec init code to create xfrm rules needed with
the new IPs before we publish them to the k8s node object. And
instead of pulling the IPs out of the node object simply pull them
directly from the node module. This resolves the XfrmInNoState and
XfrmIn*Policy* errors I've seen.

To reproduce the errors I can consistently reproduce with about
30 nodes, with httpperf test running from a pod in all nodes, and
then doing a 'rollout' of the cilium agent for awhile. Seems
a 2-3 hours almost ensures errors pop up. Usually the errors
happen much sooner. Initially I saw these errors on upgrade tests
which is another method to reproduce.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Margarita Manterola <margamanterola@gmail.com>
@margamanterola
Copy link
Member Author

/test-backport-1.14

Copy link
Contributor

@michi-covalent michi-covalent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrfastab could you also approve this 🙏

@margamanterola
Copy link
Member Author

/ci-ipsec-upgrade

@michi-covalent michi-covalent merged commit 69b4ceb into cilium:v1.14 Sep 8, 2023
57 checks passed
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants