-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.10 backport 2021-10-08 #17559
Merged
Merged
v1.10 backport 2021-10-08 #17559
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[ upstream commit 13ebeb0 ] This adds two hidden/undocumented options to the agent which allows Cilium in KPR=strict mode to be deployed with Mirantis Kubernetes Engine (MKE): --enable-mke=true --mke-cgroup-mount="" (auto-detection as default, or for manual specification:) --mke-cgroup-mount="/sys/fs/cgroup/net_cls,net_prio" MKE adds a number of Docker containers onto each MKE node which are otherwise neither visible nor managed from Cilium side, example: docker network inspect ucp-bridge -f "{{json .Containers }}" | jq . | grep Name "Name": "ucp-kv", "Name": "ucp-kube-controller-manager", "Name": "ucp-kube-apiserver", "Name": "ucp-swarm-manager", "Name": "ucp-kubelet", "Name": "ucp-auth-store", "Name": "ucp-cluster-root-ca", "Name": "ucp-hardware-info", "Name": "ucp-client-root-ca", "Name": "ucp-kube-scheduler", "Name": "ucp-proxy", "Name": "ucp-controller", They [0] contain things like the kubeapi-server which then live in their own network namespace with their own private address range of 172.16.0.0/12. The link to the hostns is set up from MKE side and are veth pairs which are connected to a bridge device: [...] 59: br-61d49ba5e56d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:b2:e4:55:ff brd ff:ff:ff:ff:ff:ff inet 172.19.0.1/16 brd 172.19.255.255 scope global br-61d49ba5e56d valid_lft forever preferred_lft forever 61: vethd56c086@if60: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-61d49ba5e56d state UP group default link/ether 06:ad:07:c6:55:e8 brd ff:ff:ff:ff:ff:ff link-netnsid 4 63: veth7db52f6@if62: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-61d49ba5e56d state UP group default link/ether aa:10:e2:d8:b7:6c brd ff:ff:ff:ff:ff:ff link-netnsid 5 65: vethe23d66c@if64: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-61d49ba5e56d state UP group default link/ether ba:f1:e3:de:ce:a0 brd ff:ff:ff:ff:ff:ff link-netnsid 6 [...] This is different compared to regular K8s deployments where such components reside in the hostns. For the socket LB which is enabled in KPR=strict deployments this is problematic as these containers are not seen the same way as hostns and therefore not all the translations might take place as we perform in hostns (like accessing NodePort via loopback/local addressing, etc). We've noticed this in particular in combination with the work in f7303af ("Adds a new option to skip socket lb when in pod ns") which is needed to get the Istio use-case working under MKE since the latter treats these MKE system containers in the same way as application Pods and therefore disables socket LB for them whereas no bpf_lxc style per packet translation gets attached from tc, hence complete lack of service translation in this scenario. One observation in MKE environments is that cgroups-v2 is only supported with MCR 20.10 which is not available in every deployment at this point. However, MKE makes use of cgroup-v1 and under /sys/fs/cgroup/net_cls/ it populates both the com.docker.ucp/ and docker/ subdirectories. One idea for a non-intrusive fix to get KPR=strict deployments working is to tag these container's net_cls controllers with a magic marker which we can then be read out from the socket LB with the kernel extension we added some time ago [1]. Given this relies on 'current' as task we can query for get_cgroup_classid() to determine that this should have similar service handling behavior as in hostns. This works reliable as 'current' points to the application doing the syscall which is always in process context. Pods are under /sys/fs/cgroup/net_cls/kubepods/ whereas all MKE containers under /sys/fs/cgroup/net_cls/{com.docker.ucp,docker}/. Upon agent start, it will set a net_cls tag for all paths under the latter. On cgroup side, this will walk all sockets of all processes of a given cgroup and tag them. In case MKE sets up a subpath under the latter, then this will automatically inherit the net_cls tag as per cgroup semantics. This has two limitations which were found to be acceptable: i) this will only work in Kind environments with kernel fixes we upstreamed in [2], and ii) no other application on the node can use the same net_cls tag. Running MKE on Kind is not supported at the moment, so i) is a non-issue right now. And it's very unlikely to run into collisions related to ii). This approach has been tested on RHEL8, and Duffie asserted that connectivity works as expected [when testing] manually. For the sake of record, there were 2 alternative options that have been weighted against this approach: i) attaching cgroups-v2 non-root programs, ii) per packet translation at tc level. Unfortunately i) was not an option since MKE does not support cgroups-v2 in near future and therefore MKE-related containers are also not in their own cgroup-v2 path in the unified hierarchy. Otherwise it would have allowed for a clean way to override default behavior for specific containers. And option ii) would have ended up in a very intrusive way, meaning, the agent would need to detect MKE related veth devices, attach to tc ingress and tc egress and we would have to split out the bpf_lxc service translation bits or attach some form of stripped down bpf_lxc object to them in order to perform DNAT and reverse DNAT. This approach taken in here achieves the same in just very few lines of extra code. [0] https://docs.mirantis.com/mke/3.4/ref-arch/manager-nodes.html https://docs.mirantis.com/mke/3.4/ref-arch/worker-nodes.html [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a52ae4e32a61ad06ef67f0b3123adbdbac4fb83 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8520e224f547cd070c7c8f97b1fc6d58cff7ccaa https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=78cc316e9583067884eb8bd154301dc1e9ee945c Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Duffie Cooley <dcooley@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
pchaigno
removed request for
jibi,
kaworu,
errordeveloper,
nathanjsweet,
tklauser,
tgraf,
pchaigno,
christarazi,
jrajahalme and
qmonnet
October 11, 2021 07:50
pchaigno
unassigned gandro, jibi, kaworu, errordeveloper, nathanjsweet, tklauser, tgraf, pchaigno, christarazi and qmonnet
Oct 11, 2021
errordeveloper
approved these changes
Oct 11, 2021
/test-backport-1.10 Job 'Cilium-PR-Runtime-4.9' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment Edit: Jenkins runs images all timed out, again. Closed and reopened PR |
/test-backport-1.10 |
retest-1.16-netnext |
test-1.16-netnext |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Once this PR is merged, you can update the PR labels via: