New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pr/v1.11 backport 2023 05 11 #25382
Pr/v1.11 backport 2023 05 11 #25382
Conversation
[ upstream commit d39ca10 ] While extending datapath coverage, Martynas found a new bug affecting IPsec + VXLAN + endpoint routes. In that configuration, cross-node pod connectivity seems to fail and we see the IPsec XfrmInNoPols error counter increasing. By tracing the connection, we can see that the packet disappears on the receiving node, after decryption, between bpf_overlay (second traversal) and the lxc device. On that path, given decryption already happened, we should match the FWD XFRM policy: src 0.0.0.0/0 dst 10.244.0.0/24 uid 0 dir fwd action allow index 106 priority 2975 share any flag (0x00000000) lifetime config: limit: soft (INF)(bytes), hard (INF)(bytes) limit: soft (INF)(packets), hard (INF)(packets) expire add: soft 0(sec), hard 0(sec) expire use: soft 0(sec), hard 0(sec) lifetime current: 0(bytes), 0(packets) add 2023-01-13 14:34:18 use 2023-01-13 14:34:22 mark 0/0xf00 Clearly, given the non-zero XfrmInNoPols, the packet doesn't match the policy. Note XfrmInNoPols is also reported for FWD; there is no XfrmFwdNoPols. Checking the source code [1], we can see that when endpoint routes are enabled, we encode the source security identity into the packet mark. We thus won't match the 0/0xf00 mark on the FWD XFRM policy. We shouldn't need to match on any packet mark for the FWD XFRM policy; we want to allow all packets through. This commit therefore removes the packet mark match for the FWD direction. 1 - https://github.com/cilium/cilium/blob/v1.13.0-rc4/bpf/lib/l3.h#L151-L154 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
[ upstream commit 54fa995 ] [ Backporter's notes: Dropped call to 'helpers.RunsOnAKS()' which does not exist in v1.11 ] The previous commit fixed a connectivity bug affecting the above configuration. We can now extend tests to cover that configuration. Note that these tests are soon going to be removed and replaced by the new GitHub workflow. However, we may need to backport this pull request to stable branches where the GitHub workflow doesn't exist. Therefore, the corresponding extension of the workflow test will be done in a separate pull request. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
fd817a7
to
64c6c25
Compare
@pchaigno Dropped call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My PRs look good. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my commit lgtm
My changes LGTM |
[ upstream commit 71af0a2 ] [ Backporter's node: Added '!privileged_tests' to the test ] There are condition possible in which CEP changes CES. This leads to CEP being present in multiple CESs for some time. In such cases the standard logic may not work as it always expect to have a single CEP representation. This commit changes to logic to handle multiple CEPs properly. Signed-off-by: Alan Kutniewski <kutniewski@google.com> Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
[ upstream commit 7335aa3 ] Another option would be to quarantine the test and find an assignee to make the test more robust, but I assert that we don't need test coverage for monitor verbose output. Fixes: cilium#25178 Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
[ upstream commit 9115e05 ] We correctly detect that we failed to allocate a new node ID (due to exhaustion of the idpool), but then still go ahead and map it. This leads to spurious errors which include "Failed to map node IP address to allocated ID". Instead, don't try to map NoID and return it directly. Fixes: af88b42 (datapath: Introduce node IDs) Suggested-by: Paul Chaignon <paul@cilium.io> Signed-off-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
[ upstream commit 2045d59 ] We currently install the default-drop XFRM policy when we install the XFRM policies and states for the local node. It is however possible for us to start installing XFRM policies and states for remote nodes before we handle the local one. The default-drop XFRM policy is a safety measure for when we move XFRM policies around. Because we don't always have a way to atomically update XFRM policies, it's possible that we end up with a very short time where no encryption XFRM OUT policy is matching a subset of traffic. The default-drop policy ensures that we drop such traffic instead of letting it leave the node as plain-text. We therefore want this default-drop XFRM policy to be installed before we update any other other XFRM policy. This commit therefore moves its installation before any other XFRM update instead of before just local-node XFRM updates. Fixes: 7d44f37 ("ipsec: Catch-default default drop policy for encryption") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
64c6c25
to
b5aefd3
Compare
Added |
/test-backport-1.11 Job 'Cilium-PR-K8s-1.21-kernel-4.9' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/2628/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
Build finished. |
/test-1.17-4.9 |
/test-1.21-4.9 |
/test-1.21-5.4 |
/test-1.22-4.19 |
/test-1.23-4.9 |
/test-runtime |
/test-upstream-k8s |
/test-1.16-netnext |
/test-1.18-4.9 |
/test-1.19-4.9 |
test-1.17-4.9 failed with known flake #24394 |
test-1.21-4.9 failed with known flake #24394 |
Travis run succeeded, but build status was still "in progress". Two instances of kernel 4.9 flake. Merging. |
v1.11 backports 2023-05-11
Once this PR is merged, you can update the PR labels via: