Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPF MASQ for veth mode and ip-masq-agent #11148

Merged
merged 13 commits into from
Apr 30, 2020
4 changes: 4 additions & 0 deletions Documentation/cmdref/cilium-agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,13 +54,15 @@ cilium-agent [flags]
--disable-iptables-feeder-rules strings Chains to ignore when installing feeder rules.
--egress-masquerade-interfaces string Limit egress masquerading to interface selector
--enable-auto-protect-node-port-range Append NodePort range to net.ipv4.ip_local_reserved_ports if it overlaps with ephemeral port range (net.ipv4.ip_local_port_range) (default true)
--enable-bpf-masquerade Masquerade packets from endpoints leaving the host with BPF instead of iptables
--enable-endpoint-health-checking Enable connectivity health checking between virtual endpoints (default true)
--enable-endpoint-routes Use per endpoint routes instead of routing via cilium_host
--enable-external-ips Enable k8s service externalIPs feature (requires enabling enable-node-port) (default true)
--enable-health-checking Enable connectivity health checking (default true)
--enable-host-port Enable k8s hostPort mapping feature (requires enabling enable-node-port) (default true)
--enable-host-reachable-services Enable reachability of services for host applications (beta)
--enable-hubble Enable hubble server
--enable-ip-masq-agent Enable BPF ip-masq-agent
aanm marked this conversation as resolved.
Show resolved Hide resolved
aanm marked this conversation as resolved.
Show resolved Hide resolved
--enable-ipsec Enable IPSec support
--enable-ipv4 Enable IPv4 support (default true)
--enable-ipv4-fragment-tracking Enable IPv4 fragments tracking for L4-based lookups (default true)
Expand Down Expand Up @@ -104,6 +106,8 @@ cilium-agent [flags]
--identity-change-grace-period duration Time to wait before using new identity on endpoint identity change (default 5s)
--install-iptables-rules Install base iptables rules for cilium to mainly interact with kube-proxy (and masquerading) (default true)
--ip-allocation-timeout duration Time after which an incomplete CIDR allocation is considered failed (default 2m0s)
--ip-masq-agent-config-path string ip-masq-agent configuration file path (default "/etc/config/ip-masq-agent")
--ip-masq-agent-sync-period duration ip-masq-agent configuration file synchronization period (default 1m0s)
--ipam string Backend to use for IPAM (default "hostscope-legacy")
--ipsec-key-file string Path to IPSec key file
--ipv4-node string IPv4 address of node (default "auto")
Expand Down
1 change: 1 addition & 0 deletions Documentation/cmdref/cilium_bpf.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Direct access to local BPF maps
* [cilium bpf ct](../cilium_bpf_ct) - Connection tracking tables
* [cilium bpf endpoint](../cilium_bpf_endpoint) - Local endpoint map
* [cilium bpf ipcache](../cilium_bpf_ipcache) - Manage the IPCache mappings for IP/CIDR <-> Identity
* [cilium bpf ipmasq](../cilium_bpf_ipmasq) - ip-masq-agent CIDRs
* [cilium bpf lb](../cilium_bpf_lb) - Load-balancing configuration
* [cilium bpf metrics](../cilium_bpf_metrics) - BPF datapath traffic metrics
* [cilium bpf nat](../cilium_bpf_nat) - NAT mapping tables
Expand Down
29 changes: 29 additions & 0 deletions Documentation/cmdref/cilium_bpf_ipmasq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
<!-- This file was autogenerated via cilium cmdref, do not edit manually-->

## cilium bpf ipmasq

ip-masq-agent CIDRs

### Synopsis

ip-masq-agent CIDRs

### Options

```
-h, --help help for ipmasq
```

### Options inherited from parent commands

```
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
```

### SEE ALSO

* [cilium bpf](../cilium_bpf) - Direct access to local BPF maps
* [cilium bpf ipmasq list](../cilium_bpf_ipmasq_list) - List ip-masq-agent CIDRs

33 changes: 33 additions & 0 deletions Documentation/cmdref/cilium_bpf_ipmasq_list.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
<!-- This file was autogenerated via cilium cmdref, do not edit manually-->

## cilium bpf ipmasq list

List ip-masq-agent CIDRs

### Synopsis

List ip-masq-agent CIDRs. Packets sent from pods to IPs from these CIDRs avoid masquerading.

```
cilium bpf ipmasq list [flags]
```

### Options

```
-h, --help help for list
-o, --output string json| jsonpath='{}'
```

### Options inherited from parent commands

```
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
```

### SEE ALSO

* [cilium bpf ipmasq](../cilium_bpf_ipmasq) - ip-masq-agent CIDRs

23 changes: 2 additions & 21 deletions Documentation/gettingstarted/ipvlan.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ datapath instead of the default veth-based one.
- L7 policy enforcement
- NAT64
- IPVLAN with tunneling
- BPF-based masquerading

.. note::

Expand Down Expand Up @@ -73,13 +74,7 @@ connect to kube-apiserver.

Masquerading with iptables in L3-only mode is not possible since netfilter
hooks are bypassed in the kernel in this mode, hence L3S (symmetric) had
to be introduced in the kernel at the cost of performance. However, Cilium
supports its own BPF-based masquerading which does not rely in any way on
iptables masquerading. If the ``global.installIptablesRules`` parameter is set
to ``"false"`` and ``global.masquerade`` set to ``"true"``, then Cilium will
use the more efficient BPF-based masquerading where ipvlan can remain in
L3 mode as well (instead of L3S). A Linux kernel v4.16 or higher would be
required for BPF-based masquerading.
to be introduced in the kernel at the cost of performance.
brb marked this conversation as resolved.
Show resolved Hide resolved

Example ConfigMap extract for ipvlan in pure L3 mode:

Expand Down Expand Up @@ -107,20 +102,6 @@ masquerading all traffic leaving the node:
--set global.masquerade=true \\
--set global.autoDirectNodeRoutes=true

Example ConfigMap extract for ipvlan in L3 mode with more efficient
BPF-based masquerading instead of iptables-based:

.. parsed-literal::

helm install cilium |CHART_RELEASE| \\
--namespace kube-system \\
--set global.datapathMode=ipvlan \\
--set global.ipvlan.masterDevice=bond0 \\
--set global.tunnel=disabled \\
--set global.masquerade=true \\
--set global.installIptablesRules=false \\
--set global.autoDirectNodeRoutes=true

Verify that it has come up correctly:

.. parsed-literal::
Expand Down
2 changes: 2 additions & 0 deletions Documentation/spelling_wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,7 @@ integrations
io
ip
ipcache
ipmasq
iproute
ipsec
iptables
Expand Down Expand Up @@ -405,6 +406,7 @@ lwt
macOS
matchLabels
matchPattern
masq
mc
mediabot
memcache
Expand Down
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,7 @@ generate-k8s-api:
maps:eventsmap\
maps:fragmap\
maps:ipcache\
maps:ipmasq\
maps:lbmap\
maps:lxcmap\
maps:metricsmap\
Expand Down
23 changes: 3 additions & 20 deletions bpf/bpf_netdev.c
Original file line number Diff line number Diff line change
Expand Up @@ -449,7 +449,8 @@ int tail_handle_ipv4(struct __ctx_buff *ctx)

ret = handle_ipv4(ctx, proxy_identity);
if (IS_ERR(ret))
return send_drop_notify_error(ctx, proxy_identity, ret, CTX_ACT_DROP, METRIC_INGRESS);
return send_drop_notify_error(ctx, proxy_identity,
ret, CTX_ACT_DROP, METRIC_INGRESS);
return ret;
}

Expand Down Expand Up @@ -694,15 +695,6 @@ int from_netdev(struct __ctx_buff *ctx)
/* Pass unknown traffic to the stack */
return CTX_ACT_OK;

#ifdef ENABLE_MASQUERADE
cilium_dbg_capture(ctx, DBG_CAPTURE_SNAT_PRE, ctx_get_ifindex(ctx));
ret = snat_process(ctx, BPF_PKT_DIR);
if (ret != CTX_ACT_OK) {
return ret;
}
cilium_dbg_capture(ctx, DBG_CAPTURE_SNAT_POST, ctx_get_ifindex(ctx));
#endif /* ENABLE_MASQUERADE */

return do_netdev(ctx, proto);
}

Expand All @@ -721,16 +713,7 @@ int to_netdev(struct __ctx_buff *ctx __maybe_unused)
ret = nodeport_nat_fwd(ctx, false);
if (IS_ERR(ret))
return send_drop_notify_error(ctx, 0, ret, CTX_ACT_DROP, METRIC_EGRESS);
#elif defined(ENABLE_MASQUERADE)
__u16 proto;
if (!validate_ethertype(ctx, &proto))
/* Pass unknown traffic to the stack */
return CTX_ACT_OK;
cilium_dbg_capture(ctx, DBG_CAPTURE_SNAT_PRE, ctx_get_ifindex(ctx));
ret = snat_process(ctx, BPF_PKT_DIR);
if (!ret)
cilium_dbg_capture(ctx, DBG_CAPTURE_SNAT_POST, ctx_get_ifindex(ctx));
#endif /* ENABLE_MASQUERADE */
#endif
return ret;
}

Expand Down
15 changes: 0 additions & 15 deletions bpf/bpf_xdp.c
Original file line number Diff line number Diff line change
Expand Up @@ -23,21 +23,6 @@
# undef CIDR6_LPM_PREFILTER
#endif

struct lpm_v4_key {
struct bpf_lpm_trie_key lpm;
__u8 addr[4];
};

struct lpm_v6_key {
struct bpf_lpm_trie_key lpm;
__u8 addr[16];
};

struct lpm_val {
/* Just dummy for now. */
__u8 flags;
};

#ifdef CIDR4_FILTER
struct bpf_elf_map __section_maps CIDR4_HMAP_NAME = {
.type = BPF_MAP_TYPE_HASH,
Expand Down
29 changes: 11 additions & 18 deletions bpf/init.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,15 +25,14 @@ XDP_DEV=$7
XDP_MODE=$8
MTU=$9
IPSEC=${10}
MASQ=${11}
ENCRYPT_DEV=${12}
HOSTLB=${13}
HOSTLB_UDP=${14}
CGROUP_ROOT=${15}
BPFFS_ROOT=${16}
NODE_PORT=${17}
NODE_PORT_BIND=${18}
MCPU=${19}
ENCRYPT_DEV=${11}
HOSTLB=${12}
HOSTLB_UDP=${13}
CGROUP_ROOT=${14}
BPFFS_ROOT=${15}
NODE_PORT=${16}
NODE_PORT_BIND=${17}
MCPU=${18}

ID_HOST=1
ID_WORLD=2
Expand Down Expand Up @@ -336,13 +335,7 @@ function bpf_load()
NODE_MAC=$(ip link show $DEV | grep ether | awk '{print $2}')
NODE_MAC="{.addr=$(mac2array $NODE_MAC)}"

if [ "$WHERE" == "ingress" ]; then
OPTS_DIR="-DBPF_PKT_DIR=1"
else
OPTS_DIR="-DBPF_PKT_DIR=0"
fi

OPTS="${OPTS} ${OPTS_DIR} -DNODE_MAC=${NODE_MAC} -DCALLS_MAP=${CALLS_MAP}"
OPTS="${OPTS} -DNODE_MAC=${NODE_MAC} -DCALLS_MAP=${CALLS_MAP}"
bpf_compile $IN $OUT obj "$OPTS"
tc qdisc replace dev $DEV clsact || true
[ -z "$(tc filter show dev $DEV $WHERE | grep -v 'pref 1 bpf chain 0 $\|pref 1 bpf chain 0 handle 0x1')" ] || tc filter del dev $DEV $WHERE
Expand All @@ -367,7 +360,7 @@ function bpf_load_cgroups()
CGRP=$8
BPFMNT=$9

OPTS="${OPTS} ${OPTS_DIR} -DCALLS_MAP=${CALLS_MAP}"
OPTS="${OPTS} -DCALLS_MAP=${CALLS_MAP}"
bpf_compile $IN $OUT obj "$OPTS"

TMP_FILE="$BPFMNT/tc/globals/cilium_cgroups_$WHERE"
Expand Down Expand Up @@ -555,7 +548,7 @@ if [ "$MODE" = "direct" ] || [ "$MODE" = "ipvlan" ] || [ "$MODE" = "routed" ] ||
fi

bpf_load $NATIVE_DEV "$COPTS" "ingress" bpf_netdev.c bpf_netdev.o "from-netdev" $CALLS_MAP
if [ "$MASQ" = "true" ] || [ "$NODE_PORT" = "true" ]; then
if [ "$NODE_PORT" = "true" ]; then
bpf_load $NATIVE_DEV "$COPTS" "egress" bpf_netdev.c bpf_netdev.o "to-netdev" $CALLS_MAP
else
bpf_unload $NATIVE_DEV "egress"
Expand Down
15 changes: 15 additions & 0 deletions bpf/lib/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -719,6 +719,21 @@ static __always_inline int redirect_peer(int ifindex __maybe_unused,
#endif /* ENABLE_HOST_REDIRECT */
}

struct lpm_v4_key {
struct bpf_lpm_trie_key lpm;
__u8 addr[4];
};

struct lpm_v6_key {
struct bpf_lpm_trie_key lpm;
__u8 addr[16];
};

struct lpm_val {
/* Just dummy for now. */
__u8 flags;
};

#include "overloadable.h"

#endif /* __LIB_COMMON_H_ */