Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium with IPSec tunneling fails to start on 3033.2.2 #626

Closed
shosti opened this issue Feb 11, 2022 · 13 comments
Closed

Cilium with IPSec tunneling fails to start on 3033.2.2 #626

shosti opened this issue Feb 11, 2022 · 13 comments
Labels
kind/bug Something isn't working

Comments

@shosti
Copy link

shosti commented Feb 11, 2022

Description

Since updating to 3033.2.2, all of my Cilium pods are in a CrashLoopBackoff state with the following error message:

level=fatal msg="IPSec with tunneling requires support for xfrm state output masks (Linux 4.19 or later)." error="invalid argument" subsys=daemon

After rolling back to 3033.2.1, cilium starts up again.

Impact

Cilium fails -> all other pods can't get network -> general mayhem 😈

Environment and steps to reproduce

  1. Set-up:
  1. Task: Automatic upgrade
  2. Action(s): Update from 3033.2.1 to 3033.2.2, check kubectl logs for a cilium pod
  3. Error:
level=fatal msg="IPSec with tunneling requires support for xfrm state output masks (Linux 4.19 or later)." error="invalid argument" subsys=daemon

Expected behavior

Cilium pods start up correctly.

@shosti shosti added the kind/bug Something isn't working label Feb 11, 2022
@pothos
Copy link
Member

pothos commented Feb 11, 2022

Thanks for the report, maybe our cilium tests need some extensions to catch these cases in the future.
It looks like these xfrm changes could be related: https://lwn.net/Articles/882912/ (Linux 5.10.94)

@pothos
Copy link
Member

pothos commented Feb 11, 2022

In the mean time make sure you disable automatic updates: https://www.flatcar.org/docs/latest/setup/releases/update-strategies/#disable-automatic-updates

@pothos
Copy link
Member

pothos commented Feb 11, 2022

Looked at the kernel config under /proc and there is no unexpected difference between the two versions (as was the case with a bugfix kernel update some time ago)

@tormath1
Copy link
Contributor

here's a minimal repro: https://gist.github.com/tormath1/bf3af973de9a4232698cc42199496496 from cilium failing part.

with strace, we can see that we're hitting the following change: torvalds/linux@8dce439

Investigating on the netlink side...

@jepio
Copy link
Member

jepio commented Feb 11, 2022

@tormath1 if its failing on the changelink then its a kernel bug (if_id is always 0).

@tormath1
Copy link
Contributor

@jepio based on the linked commit, if_id needs to be different from 0... On the repro, it works fine by setting an explicit if_id to 1 for example.

@jepio
Copy link
Member

jepio commented Feb 11, 2022

Missed the lower half of the gist on my phone xD

@pothos
Copy link
Member

pothos commented Feb 11, 2022

It's really something to discuss how this (LTS) bug fix kernel update ended up in this situation given the leading principle that userspace ABI should not be broken. Maybe this change could be reverted in the next kernel bug fix release?
@borkmann - do you have an opinion on this from the Cilium (and kernel) side?

Edit: also posted to lkml: https://marc.info/?l=linux-kernel&m=164483790014524&w=2

tormath1 added a commit to tormath1/cilium that referenced this issue Feb 24, 2022
in this patch:
https://patchwork.kernel.org/project/netdevbpf/patch/20220106093606.3046771-6-steffen.klassert@secunet.com/
we see that `if_id` must be different from 0 for policy and
state construction.

With a 0 value, it makes the creation of the dummy interface fail with
the following error:
```
level=fatal msg="IPSec with tunneling requires support for xfrm state output masks (Linux 4.19 or later)." error="invalid argument" subsys=daemon
```

Related-To: flatcar/Flatcar#626
Signed-off-by: Mathieu Tortuyaux <mtortuyaux@microsoft.com>
kaworu pushed a commit to cilium/cilium that referenced this issue Feb 28, 2022
in this patch:
https://patchwork.kernel.org/project/netdevbpf/patch/20220106093606.3046771-6-steffen.klassert@secunet.com/
we see that `if_id` must be different from 0 for policy and
state construction.

With a 0 value, it makes the creation of the dummy interface fail with
the following error:
```
level=fatal msg="IPSec with tunneling requires support for xfrm state output masks (Linux 4.19 or later)." error="invalid argument" subsys=daemon
```

Related-To: flatcar/Flatcar#626
Signed-off-by: Mathieu Tortuyaux <mtortuyaux@microsoft.com>
@tormath1
Copy link
Contributor

tormath1 commented Mar 1, 2022

Hi @shosti, Cilium's PR has been merged - if you're feeling adventurous, you can build a cilium image and try it to deploy by updating this value: https://github.com/cilium/cilium/blob/v1.11.0/install/kubernetes/cilium/values.yaml#L84

Otherwise, the fix should be part of the next cilium's release (1.11.3). :)

@pothos
Copy link
Member

pothos commented Mar 2, 2022

The proposal to revert upstream got rejected but we decided to do it for Flatcar anyway: flatcar-archive/coreos-overlay#1682 - this is not part of a release yet.

@tormath1
Copy link
Contributor

tormath1 commented Mar 3, 2022

Hi @shosti , so it seems that cilium has already fixed the issue starting from latest release 1.11.2. Do you think you could give a try ?
I tested with IPSec test case: flatcar/mantle#292 and it seems to work as expected.

@pothos
Copy link
Member

pothos commented Mar 3, 2022

Maybe as additional info that there is a regression: cilium/cilium#19019

Unfortunately, this workaround breaks IPsec connectivity between nodes. Once the XFRMA_IF_ID is set to the placeholder value (1), traffic that should be encrypted leave the node without any encryption. On GKE and self-managed clusters, that's the only noticeable impact. However, on AKS and EKS, we also have BPF logic to rewrite the outer IP address to the proper IP. This still happens despite the failure to encrypt traffic, leading to packet drops.

Edit: With 1.11.2 this is not a problem

@pothos
Copy link
Member

pothos commented Mar 4, 2022

I think we can close this now as a Cilium release was done and we have anyway reverted the change for next week's release

@pothos pothos closed this as completed Mar 4, 2022
@tormath1 tormath1 moved this from In Progress to Ready to Release - 2022-02-28 in Flatcar Container Linux Releases Planning Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
No open projects
Flatcar Container Linux Releases Plan...
Ready to Release - 2022-02-28
Development

No branches or pull requests

4 participants