Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: Add extension for running sock LB on MKE-related containers #17513

Merged
merged 1 commit into from Oct 5, 2021

Conversation

borkmann
Copy link
Member

@borkmann borkmann commented Oct 1, 2021

See commit msg.

@borkmann borkmann added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. sig/loadbalancing labels Oct 1, 2021
@borkmann borkmann requested a review from brb October 1, 2021 12:50
@borkmann borkmann requested review from a team October 1, 2021 12:50
@borkmann borkmann requested review from a team as code owners October 1, 2021 12:50
@borkmann borkmann requested a review from ti-mo October 1, 2021 12:50
@borkmann borkmann force-pushed the pr/mke-fix branch 2 times, most recently from 4ea1883 to d4a6a2e Compare October 1, 2021 13:21
@borkmann
Copy link
Member Author

borkmann commented Oct 1, 2021

test-me-please

Copy link
Contributor

@joamaki joamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable given the constraints in this situation.

@borkmann
Copy link
Member Author

borkmann commented Oct 1, 2021

test-me-please

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests LoadBalancer Connectivity to endpoint via LB

Failure Output

FAIL: Can not connect to service "http://192.168.1.146" from outside cluster (1/10)

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.16-net-next so I can create a new GitHub issue to track it.

@borkmann
Copy link
Member Author

borkmann commented Oct 1, 2021

(rebased)

@borkmann
Copy link
Member Author

borkmann commented Oct 1, 2021

(extended commit message to describe other approaches that have been evaluated as well just for the record)

@borkmann
Copy link
Member Author

borkmann commented Oct 1, 2021

test-me-please

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sPolicyTest Basic Test Validate to-entities policies Validate toEntities All

Failure Output

FAIL: Found 2 k8s-app=cilium logs matching list of errors that must be investigated:

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.16-net-next so I can create a new GitHub issue to track it.

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the well-written explanation in the commit message. Seems like most reasonable option we have for supporting MKE today, and this code is pretty self-contained. I can't wait for some combination of socket cookies or cgroup v2 to let us solve this problem without platform-specific tricks like this.

@borkmann
Copy link
Member Author

borkmann commented Oct 4, 2021

/mlh new-flake Cilium-PR-K8s-1.16-net-next

👍 created #17521

Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic LGTM, a few nits.

Also, maybe it makes sense to extend the commit msg in order to mention the Istio usecase wrt the socket-lb bypass (that's how we discovered the issue) in the following paragraph:

This is different compared to regular K8s deployments where such components
reside in the hostns. For the socket LB which is enabled in KPR=strict deployments
this is problematic as these containers are not seen the same way as hostns and
therefore not all the translations might take place as we perform in hostns (like
accessing NodePort via loopback/local addressing, etc).

daemon/cmd/kube_proxy_replacement.go Outdated Show resolved Hide resolved
daemon/cmd/kube_proxy_replacement.go Show resolved Hide resolved
daemon/cmd/kube_proxy_replacement.go Outdated Show resolved Hide resolved
daemon/cmd/kube_proxy_replacement.go Outdated Show resolved Hide resolved
daemon/cmd/kube_proxy_replacement.go Outdated Show resolved Hide resolved
daemon/cmd/kube_proxy_replacement.go Outdated Show resolved Hide resolved
daemon/cmd/kube_proxy_replacement.go Show resolved Hide resolved
@borkmann borkmann force-pushed the pr/mke-fix branch 2 times, most recently from 74b56d9 to 94b99ad Compare October 4, 2021 14:44
@borkmann
Copy link
Member Author

borkmann commented Oct 4, 2021

test-me-please

This adds two hidden/undocumented options to the agent which allows Cilium in
KPR=strict mode to be deployed with Mirantis Kubernetes Engine (MKE):

  --enable-mke=true
  --mke-cgroup-mount=""    (auto-detection as default, or for manual specification:)
  --mke-cgroup-mount="/sys/fs/cgroup/net_cls,net_prio"

MKE adds a number of Docker containers onto each MKE node which are otherwise
neither visible nor managed from Cilium side, example:

  docker network inspect ucp-bridge -f "{{json .Containers }}"  | jq . | grep Name
    "Name": "ucp-kv",
    "Name": "ucp-kube-controller-manager",
    "Name": "ucp-kube-apiserver",
    "Name": "ucp-swarm-manager",
    "Name": "ucp-kubelet",
    "Name": "ucp-auth-store",
    "Name": "ucp-cluster-root-ca",
    "Name": "ucp-hardware-info",
    "Name": "ucp-client-root-ca",
    "Name": "ucp-kube-scheduler",
    "Name": "ucp-proxy",
    "Name": "ucp-controller",

They [0] contain things like the kubeapi-server which then live in their own
network namespace with their own private address range of 172.16.0.0/12. The
link to the hostns is set up from MKE side and are veth pairs which are connected
to a bridge device:

  [...]
  59: br-61d49ba5e56d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
      link/ether 02:42:b2:e4:55:ff brd ff:ff:ff:ff:ff:ff
      inet 172.19.0.1/16 brd 172.19.255.255 scope global br-61d49ba5e56d
       valid_lft forever preferred_lft forever
  61: vethd56c086@if60: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-61d49ba5e56d state UP group default
      link/ether 06:ad:07:c6:55:e8 brd ff:ff:ff:ff:ff:ff link-netnsid 4
  63: veth7db52f6@if62: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-61d49ba5e56d state UP group default
      link/ether aa:10:e2:d8:b7:6c brd ff:ff:ff:ff:ff:ff link-netnsid 5
  65: vethe23d66c@if64: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-61d49ba5e56d state UP group default
      link/ether ba:f1:e3:de:ce:a0 brd ff:ff:ff:ff:ff:ff link-netnsid 6
  [...]

This is different compared to regular K8s deployments where such components
reside in the hostns. For the socket LB which is enabled in KPR=strict deployments
this is problematic as these containers are not seen the same way as hostns and
therefore not all the translations might take place as we perform in hostns (like
accessing NodePort via loopback/local addressing, etc). We've noticed this in
particular in combination with the work in f7303af ("Adds a new option to skip
socket lb when in pod ns") which is needed to get the Istio use-case working under
MKE since the latter treats these MKE system containers in the same way as application
Pods and therefore disables socket LB for them whereas no bpf_lxc style per packet
translation gets attached from tc, hence complete lack of service translation in
this scenario.

One observation in MKE environments is that cgroups-v2 is only supported with
MCR 20.10 which is not available in every deployment at this point. However, MKE
makes use of cgroup-v1 and under /sys/fs/cgroup/net_cls/ it populates both the
com.docker.ucp/ and docker/ subdirectories. One idea for a non-intrusive fix to
get KPR=strict deployments working is to tag these container's net_cls controllers
with a magic marker which we can then be read out from the socket LB with the kernel
extension we added some time ago [1]. Given this relies on 'current' as task we
can query for get_cgroup_classid() to determine that this should have similar
service handling behavior as in hostns. This works reliable as 'current' points to
the application doing the syscall which is always in process context.

Pods are under /sys/fs/cgroup/net_cls/kubepods/ whereas all MKE containers under
/sys/fs/cgroup/net_cls/{com.docker.ucp,docker}/. Upon agent start, it will set a
net_cls tag for all paths under the latter. On cgroup side, this will walk all
sockets of all processes of a given cgroup and tag them. In case MKE sets up a
subpath under the latter, then this will automatically inherit the net_cls tag as
per cgroup semantics.

This has two limitations which were found to be acceptable: i) this will only work
in Kind environments with kernel fixes we upstreamed in [2], and ii) no other
application on the node can use the same net_cls tag. Running MKE on Kind is not
supported at the moment, so i) is a non-issue right now. And it's very unlikely
to run into collisions related to ii).

This approach has been tested on RHEL8, and Duffie asserted that connectivity works
as expected [when testing] manually.

For the sake of record, there were 2 alternative options that have been weighted
against this approach: i) attaching cgroups-v2 non-root programs, ii) per packet
translation at tc level. Unfortunately i) was not an option since MKE does not
support cgroups-v2 in near future and therefore MKE-related containers are also
not in their own cgroup-v2 path in the unified hierarchy. Otherwise it would have
allowed for a clean way to override default behavior for specific containers. And
option ii) would have ended up in a very intrusive way, meaning, the agent would
need to detect MKE related veth devices, attach to tc ingress and tc egress and we
would have to split out the bpf_lxc service translation bits or attach some form
of stripped down bpf_lxc object to them in order to perform DNAT and reverse DNAT.
This approach taken in here achieves the same in just very few lines of extra code.

  [0] https://docs.mirantis.com/mke/3.4/ref-arch/manager-nodes.html
      https://docs.mirantis.com/mke/3.4/ref-arch/worker-nodes.html
  [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a52ae4e32a61ad06ef67f0b3123adbdbac4fb83
  [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8520e224f547cd070c7c8f97b1fc6d58cff7ccaa
      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=78cc316e9583067884eb8bd154301dc1e9ee945c

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Duffie Cooley <dcooley@isovalent.com>
@borkmann
Copy link
Member Author

borkmann commented Oct 4, 2021

test-me-please

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sBookInfoDemoTest Bookinfo Demo Tests bookinfo demo

Failure Output

FAIL: DNS entry is not ready after timeout

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.16-net-next so I can create a new GitHub issue to track it.

@borkmann
Copy link
Member Author

borkmann commented Oct 4, 2021

retest-gke

@borkmann
Copy link
Member Author

borkmann commented Oct 5, 2021

/mlh new-flake Cilium-PR-K8s-1.16-net-next

👍 created #17534

@borkmann borkmann merged commit 13ebeb0 into master Oct 5, 2021
@borkmann borkmann deleted the pr/mke-fix branch October 5, 2021 06:19
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.5 Oct 5, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.5 Oct 8, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.5 Oct 8, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.5 Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.10.5
Backport done to v1.10
Development

Successfully merging this pull request may close these issues.

None yet

6 participants