Skip to content

Commit

Permalink
bpf: Add extension for running sock LB on MKE-related containers
Browse files Browse the repository at this point in the history
This adds two hidden/undocumented options to the agent which allows Cilium in
KPR=strict mode to be deployed with Mirantis Kubernetes Engine (MKE):

  --enable-mke=true
  --mke-cgroup-mount=""    (auto-detection as default, or for manual specification:)
  --mke-cgroup-mount="/sys/fs/cgroup/net_cls,net_prio"

MKE adds a number of Docker containers onto each MKE node which are otherwise
neither visible nor managed from Cilium side, example:

  docker network inspect ucp-bridge -f "{{json .Containers }}"  | jq . | grep Name
    "Name": "ucp-kv",
    "Name": "ucp-kube-controller-manager",
    "Name": "ucp-kube-apiserver",
    "Name": "ucp-swarm-manager",
    "Name": "ucp-kubelet",
    "Name": "ucp-auth-store",
    "Name": "ucp-cluster-root-ca",
    "Name": "ucp-hardware-info",
    "Name": "ucp-client-root-ca",
    "Name": "ucp-kube-scheduler",
    "Name": "ucp-proxy",
    "Name": "ucp-controller",

They [0] contain things like the kubeapi-server which then live in their own
network namespace with their own private address range of 172.16.0.0/12. The
link to the hostns is set up from MKE side and are veth pairs which are connected
to a bridge device:

  [...]
  59: br-61d49ba5e56d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
      link/ether 02:42:b2:e4:55:ff brd ff:ff:ff:ff:ff:ff
      inet 172.19.0.1/16 brd 172.19.255.255 scope global br-61d49ba5e56d
       valid_lft forever preferred_lft forever
  61: vethd56c086@if60: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-61d49ba5e56d state UP group default
      link/ether 06:ad:07:c6:55:e8 brd ff:ff:ff:ff:ff:ff link-netnsid 4
  63: veth7db52f6@if62: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-61d49ba5e56d state UP group default
      link/ether aa:10:e2:d8:b7:6c brd ff:ff:ff:ff:ff:ff link-netnsid 5
  65: vethe23d66c@if64: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-61d49ba5e56d state UP group default
      link/ether ba:f1:e3:de:ce:a0 brd ff:ff:ff:ff:ff:ff link-netnsid 6
  [...]

This is different compared to regular K8s deployments where such components
reside in the hostns. For the socket LB which is enabled in KPR=strict deployments
this is problematic as these containers are not seen the same way as hostns and
therefore not all the translations might take place as we perform in hostns (like
accessing NodePort via loopback/local addressing, etc). We've noticed this in
particular in combination with the work in f7303af ("Adds a new option to skip
socket lb when in pod ns") which is needed to get the Istio use-case working under
MKE since the latter treats these MKE system containers in the same way as application
Pods and therefore disables socket LB for them whereas no bpf_lxc style per packet
translation gets attached from tc, hence complete lack of service translation in
this scenario.

One observation in MKE environments is that cgroups-v2 is only supported with
MCR 20.10 which is not available in every deployment at this point. However, MKE
makes use of cgroup-v1 and under /sys/fs/cgroup/net_cls/ it populates both the
com.docker.ucp/ and docker/ subdirectories. One idea for a non-intrusive fix to
get KPR=strict deployments working is to tag these container's net_cls controllers
with a magic marker which we can then be read out from the socket LB with the kernel
extension we added some time ago [1]. Given this relies on 'current' as task we
can query for get_cgroup_classid() to determine that this should have similar
service handling behavior as in hostns. This works reliable as 'current' points to
the application doing the syscall which is always in process context.

Pods are under /sys/fs/cgroup/net_cls/kubepods/ whereas all MKE containers under
/sys/fs/cgroup/net_cls/{com.docker.ucp,docker}/. Upon agent start, it will set a
net_cls tag for all paths under the latter. On cgroup side, this will walk all
sockets of all processes of a given cgroup and tag them. In case MKE sets up a
subpath under the latter, then this will automatically inherit the net_cls tag as
per cgroup semantics.

This has two limitations which were found to be acceptable: i) this will only work
in Kind environments with kernel fixes we upstreamed in [2], and ii) no other
application on the node can use the same net_cls tag. Running MKE on Kind is not
supported at the moment, so i) is a non-issue right now. And it's very unlikely
to run into collisions related to ii).

This approach has been tested on RHEL8, and Duffie asserted that connectivity works
as expected [when testing] manually.

For the sake of record, there were 2 alternative options that have been weighted
against this approach: i) attaching cgroups-v2 non-root programs, ii) per packet
translation at tc level. Unfortunately i) was not an option since MKE does not
support cgroups-v2 in near future and therefore MKE-related containers are also
not in their own cgroup-v2 path in the unified hierarchy. Otherwise it would have
allowed for a clean way to override default behavior for specific containers. And
option ii) would have ended up in a very intrusive way, meaning, the agent would
need to detect MKE related veth devices, attach to tc ingress and tc egress and we
would have to split out the bpf_lxc service translation bits or attach some form
of stripped down bpf_lxc object to them in order to perform DNAT and reverse DNAT.
This approach taken in here achieves the same in just very few lines of extra code.

  [0] https://docs.mirantis.com/mke/3.4/ref-arch/manager-nodes.html
      https://docs.mirantis.com/mke/3.4/ref-arch/worker-nodes.html
  [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a52ae4e32a61ad06ef67f0b3123adbdbac4fb83
  [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8520e224f547cd070c7c8f97b1fc6d58cff7ccaa
      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=78cc316e9583067884eb8bd154301dc1e9ee945c

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Duffie Cooley <dcooley@isovalent.com>
  • Loading branch information
borkmann committed Oct 5, 2021
1 parent b36c93b commit 13ebeb0
Show file tree
Hide file tree
Showing 7 changed files with 136 additions and 1 deletion.
13 changes: 12 additions & 1 deletion bpf/bpf_sock.c
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,16 @@ void ctx_set_port(struct bpf_sock_addr *ctx, __be16 dport)
ctx->user_port = (__u32)dport;
}

static __always_inline __maybe_unused bool task_in_extended_hostns(void)
{
#ifdef ENABLE_MKE
/* Extension for non-Cilium managed containers on MKE. */
return get_cgroup_classid() == MKE_HOST;
#else
return false;
#endif
}

static __always_inline __maybe_unused bool
ctx_in_hostns(void *ctx __maybe_unused, __net_cookie *cookie)
{
Expand All @@ -91,7 +101,8 @@ ctx_in_hostns(void *ctx __maybe_unused, __net_cookie *cookie)

if (cookie)
*cookie = own_cookie;
return own_cookie == HOST_NETNS_COOKIE;
return own_cookie == HOST_NETNS_COOKIE ||
task_in_extended_hostns();
#else
if (cookie)
*cookie = 0;
Expand Down
3 changes: 3 additions & 0 deletions bpf/include/bpf/helpers.h
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@ static __u64 BPF_FUNC(jiffies64);
static __sock_cookie BPF_FUNC(get_socket_cookie, void *ctx);
static __net_cookie BPF_FUNC(get_netns_cookie, void *ctx);

/* Legacy cgroups */
static __u32 BPF_FUNC(get_cgroup_classid);

/* Debugging */
static __printf(1, 3) void
BPF_FUNC(trace_printk, const char *fmt, int fmt_size, ...);
Expand Down
8 changes: 8 additions & 0 deletions daemon/cmd/daemon_main.go
Original file line number Diff line number Diff line change
Expand Up @@ -563,6 +563,14 @@ func init() {
flags.Bool(option.EnableLocalRedirectPolicy, false, "Enable Local Redirect Policy")
option.BindEnv(option.EnableLocalRedirectPolicy)

flags.Bool(option.EnableMKE, false, "Enable BPF kube-proxy replacement for MKE environments")
flags.MarkHidden(option.EnableMKE)
option.BindEnv(option.EnableMKE)

flags.String(option.CgroupPathMKE, "", "Cgroup v1 net_cls mount path for MKE environments")
flags.MarkHidden(option.CgroupPathMKE)
option.BindEnv(option.CgroupPathMKE)

flags.String(option.NodePortMode, option.NodePortModeSNAT, "BPF NodePort mode (\"snat\", \"dsr\", \"hybrid\")")
flags.MarkHidden(option.NodePortMode)
option.BindEnv(option.NodePortMode)
Expand Down
93 changes: 93 additions & 0 deletions daemon/cmd/kube_proxy_replacement.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,22 @@ package cmd
import (
"errors"
"fmt"
"io"
"math"
"net"
"os"
"path/filepath"
"strconv"
"strings"

"github.com/cilium/cilium/pkg/bpf"
"github.com/cilium/cilium/pkg/datapath/linux/probes"
"github.com/cilium/cilium/pkg/datapath/loader"
datapathOption "github.com/cilium/cilium/pkg/datapath/option"
"github.com/cilium/cilium/pkg/logging/logfields"
"github.com/cilium/cilium/pkg/mac"
"github.com/cilium/cilium/pkg/maglev"
"github.com/cilium/cilium/pkg/mountinfo"
"github.com/cilium/cilium/pkg/node"
"github.com/cilium/cilium/pkg/option"
"github.com/cilium/cilium/pkg/probe"
Expand Down Expand Up @@ -230,6 +235,28 @@ func initKubeProxyReplacementOptions() (bool, error) {
// be v4-in-v6 connections even if the agent has v6 support disabled.
probe.HaveIPv6Support()

if option.Config.EnableMKE {
foundClassid := false
foundCookie := false
if h := probesManager.GetHelpers("cgroup_sock_addr"); h != nil {
if _, ok := h["bpf_get_cgroup_classid"]; ok {
foundClassid = true
}
if _, ok := h["bpf_get_netns_cookie"]; ok {
foundCookie = true
}
}
if !foundClassid || !foundCookie {
if strict {
log.Fatalf("BPF kube-proxy replacement under MKE with --%s needs kernel 5.7 or newer", option.EnableMKE)
} else {
option.Config.EnableHostServicesTCP = false
option.Config.EnableHostServicesUDP = false
log.Warnf("Disabling host reachable services under MKE with --%s. Needs kernel 5.7 or newer.", option.EnableMKE)
}
}
}

option.Config.EnableHostServicesPeer = true
if option.Config.EnableIPv4 {
if err := bpf.TestDummyProg(bpf.ProgTypeCgroupSockAddr, bpf.BPF_CGROUP_INET4_GETPEERNAME); err != nil {
Expand Down Expand Up @@ -488,6 +515,12 @@ func finishKubeProxyReplacementInit(isKubeProxyReplacementStrict bool) error {
// | After this point, BPF NodePort should not be disabled |
// +-------------------------------------------------------+

// For MKE, we only need to change/extend the socket LB behavior in case
// of kube-proxy replacement. Otherwise, nothing else is needed.
if option.Config.EnableMKE && option.Config.EnableHostReachableServices {
markHostExtension()
}

if !option.Config.EnableHostLegacyRouting {
msg := ""
switch {
Expand Down Expand Up @@ -597,6 +630,66 @@ func disableNodePort() {
option.Config.EnableHostLegacyRouting = true
}

// markHostExtension tells the socket LB that MKE managed containers belong
// to the "hostns" as well despite them residing in their own netns. We use
// net_cls as a marker.
func markHostExtension() {
prefix := option.Config.CgroupPathMKE
if prefix == "" {
mountInfos, err := mountinfo.GetMountInfo()
if err != nil {
log.WithError(err).Fatal("Cannot retrieve mount infos for MKE")
}
for _, mountInfo := range mountInfos {
if mountInfo.FilesystemType == "cgroup" &&
strings.Contains(mountInfo.SuperOptions, "net_cls") {
// There can be multiple entries with the same mountpoint.
// Assert that there is no conflict.
if prefix != "" && prefix != mountInfo.MountPoint {
log.Fatalf("Multiple cgroup v1 net_cls mounts: %s, %s",
prefix, mountInfo.MountPoint)
}
prefix = mountInfo.MountPoint
}
}
}
if prefix == "" {
log.Fatal("Cannot retrieve cgroup v1 net_cls mount info for MKE")
}
log.WithField(logfields.Path, prefix).Info("Found cgroup v1 net_cls mount on MKE")
err := filepath.Walk(prefix,
func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
if !info.IsDir() || strings.Contains(path, "kubepods") || path == prefix {
return nil
}
log.WithField(logfields.Path, path).Info("Marking as MKE host extension")
f, err := os.OpenFile(path+"/net_cls.classid", os.O_RDWR, 0644)
if err != nil {
return err
}
defer f.Close()
valBytes, err := io.ReadAll(f)
if err != nil {
return err
}
class, err := strconv.Atoi(string(valBytes[:len(valBytes)-1]))
if err != nil {
return err
}
if class != 0 && class != option.HostExtensionMKE {
return errors.New("net_cls.classid already in use")
}
_, err = io.WriteString(f, fmt.Sprintf("%d", option.HostExtensionMKE))
return err
})
if err != nil {
log.WithError(err).Fatal("Cannot mark MKE-related container")
}
}

// checkNodePortAndEphemeralPortRanges checks whether the ephemeral port range
// does not clash with the nodeport range to prevent the BPF nodeport from
// hijacking an existing connection on the local host which source port is
Expand Down
4 changes: 4 additions & 0 deletions pkg/datapath/linux/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,10 @@ func (h *HeaderfileWriter) WriteNodeConfig(w io.Writer, cfg *datapath.LocalNodeC
if option.Config.EnableHealthDatapath {
cDefinesMap["ENABLE_HEALTH_CHECK"] = "1"
}
if option.Config.EnableMKE && option.Config.EnableHostReachableServices {
cDefinesMap["ENABLE_MKE"] = "1"
cDefinesMap["MKE_HOST"] = fmt.Sprintf("%d", option.HostExtensionMKE)
}
if option.Config.EnableRecorder {
cDefinesMap["ENABLE_CAPTURE"] = "1"
if option.Config.EnableIPv4 {
Expand Down
14 changes: 14 additions & 0 deletions pkg/option/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,12 @@ const (
// EnableLocalRedirectPolicy enables support for local redirect policy
EnableLocalRedirectPolicy = "enable-local-redirect-policy"

// EnableMKE enables MKE specific 'chaining' for kube-proxy replacement
EnableMKE = "enable-mke"

// CgroupPathMKE points to the cgroupv1 net_cls mount instance
CgroupPathMKE = "mke-cgroup-mount"

// LibDir enables the directory path to store runtime build environment
LibDir = "lib-dir"

Expand Down Expand Up @@ -1729,6 +1735,12 @@ type DaemonConfig struct {
// EnableRecorder enables the datapath pcap recorder
EnableRecorder bool

// EnableMKE enables MKE specific 'chaining' for kube-proxy replacement
EnableMKE bool

// CgroupPathMKE points to the cgroupv1 net_cls mount instance
CgroupPathMKE string

// KubeProxyReplacementHealthzBindAddr is the KubeProxyReplacement healthz server bind addr
KubeProxyReplacementHealthzBindAddr string

Expand Down Expand Up @@ -2443,6 +2455,8 @@ func (c *DaemonConfig) Populate() {
c.EnableSessionAffinity = viper.GetBool(EnableSessionAffinity)
c.EnableBandwidthManager = viper.GetBool(EnableBandwidthManager)
c.EnableRecorder = viper.GetBool(EnableRecorder)
c.EnableMKE = viper.GetBool(EnableMKE)
c.CgroupPathMKE = viper.GetString(CgroupPathMKE)
c.EnableHostFirewall = viper.GetBool(EnableHostFirewall)
c.EnableLocalRedirectPolicy = viper.GetBool(EnableLocalRedirectPolicy)
c.EncryptInterface = viper.GetStringSlice(EncryptInterface)
Expand Down
2 changes: 2 additions & 0 deletions pkg/option/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,5 @@ const (
ClockSourceKtime BPFClockSource = iota
ClockSourceJiffies
)

const HostExtensionMKE = 0x1bda7a

0 comments on commit 13ebeb0

Please sign in to comment.