Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPF MASQ for veth, ip-masq-agent and multi-dev NodePort #10878

Closed
wants to merge 26 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
9599c4e
datapath: Remove ipvlan MASQ and dead code
brb Feb 21, 2020
b428ac3
daemon: Add --enable-bpf-masquerade flag
brb Feb 24, 2020
f6eb947
datapath: Enable BPF MASQ for veth mode in IPv4
brb Feb 20, 2020
adcad29
datapath: Fix BPF MASQ for veth when DSR is enabled
brb Feb 25, 2020
0f1e039
helm: Add bpfMasquerade param
brb Feb 25, 2020
b7d38f7
probe: Set RLIMIT_MEMLOCK=RLIM_INFINITY for HaveFullLPM
brb Mar 19, 2020
8663a12
daemon: Add ip-masq-agent flags
brb Feb 28, 2020
ec4a1a7
maps/ipmasq: Add LPM CIDR map for ip-masq-agent
brb Feb 28, 2020
cbc4c1b
ipmasq: Add ip-masq-agent package
brb Feb 28, 2020
ca66961
ipmasq: Configure via YAML instead of JSON
brb Mar 3, 2020
8f40a64
datapath: Add ip-masq-agent implementation
brb Apr 24, 2020
738b777
cli: Add cilium bpf ipmasq list
brb Mar 19, 2020
7067619
helm: Add ipMasqAgent option
brb Mar 3, 2020
8c0f828
datapath: Replace ct_entry.backend_id with ifindex
brb Mar 6, 2020
4cc6da7
common: Move GoArray2C to common
brb Mar 18, 2020
1b7b012
daemon, datapath: Attach bpf_netdev.o to multiple netdevs
brb Apr 24, 2020
3bfb3cf
datapath: Use direct routing iface for fwding NodePort requests
brb Apr 24, 2020
143e91d
cli: Cleanup multiple bpf_netdev.o programs upon cleanup
brb Mar 18, 2020
bede087
daemon: Check NodePort --device names and ifindices
brb Mar 18, 2020
4a48e62
cli: Add NodePort devices to cilium status
brb Mar 18, 2020
c76dd35
daemon: Warn about rp_filter=1 in DSR mode
brb Mar 26, 2020
62a9dfe
test: Add secondary private network to CI
brb Mar 18, 2020
9bd896f
test: Add BPF and ip-masq-agent integration tests
brb Feb 27, 2020
851c6cc
test: Add NodePort tests for multiple devices
brb Mar 18, 2020
703147e
test: Extend testNodePort
brb Mar 19, 2020
023eb20
test: Allow to specify sport in testDSR()
brb Apr 9, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 5 additions & 1 deletion Documentation/cmdref/cilium-agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,20 +46,22 @@ cilium-agent [flags]
--datapath-mode string Datapath mode name (default "veth")
-D, --debug Enable debugging mode
--debug-verbose strings List of enabled verbose debug groups
-d, --device string Device facing cluster/external network for direct L3 (non-overlay mode) (default "undefined")
-d, --device strings List of devices facing cluster/external network for attaching bpf_netdev (first device should be one used for direct routing if tunneling is disabled)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an aside from this PR as a whole, but where does this (first device should be one used for direct routing if tunneling is disabled) restriction come from? Why can't we just detect the default route to determine which device that is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a default route, but a device with NodeIP. Please see #10878 (comment).

The restriction comes from the fact that we cannot forward a packet to a remote node via a device which it came to if it doesn't have a podCIDR route.

--disable-cnp-status-updates Do not send CNP NodeStatus updates to the Kubernetes api-server (recommended to run with "cnp-node-status-gc=false" in cilium-operator)
--disable-conntrack Disable connection tracking
--disable-endpoint-crd Disable use of CiliumEndpoint CRD
--disable-iptables-feeder-rules strings Chains to ignore when installing feeder rules.
--egress-masquerade-interfaces string Limit egress masquerading to interface selector
--enable-auto-protect-node-port-range Append NodePort range to net.ipv4.ip_local_reserved_ports if it overlaps with ephemeral port range (net.ipv4.ip_local_port_range) (default true)
--enable-bpf-masquerade Masquerade packets from endpoints leaving the host with BPF instead of iptables
--enable-endpoint-health-checking Enable connectivity health checking between virtual endpoints (default true)
--enable-endpoint-routes Use per endpoint routes instead of routing via cilium_host
--enable-external-ips Enable k8s service externalIPs feature (requires enabling enable-node-port) (default true)
--enable-health-checking Enable connectivity health checking (default true)
--enable-host-port Enable k8s hostPort mapping feature (requires enabling enable-node-port) (default true)
--enable-host-reachable-services Enable reachability of services for host applications (beta)
--enable-hubble Enable hubble server
--enable-ip-masq-agent Enable BPF ip-masq-agent
--enable-ipsec Enable IPSec support
--enable-ipv4 Enable IPv4 support (default true)
--enable-ipv4-fragment-tracking Enable IPv4 fragments tracking for L4-based lookups (default true)
Expand Down Expand Up @@ -102,6 +104,8 @@ cilium-agent [flags]
--identity-change-grace-period duration Time to wait before using new identity on endpoint identity change (default 5s)
--install-iptables-rules Install base iptables rules for cilium to mainly interact with kube-proxy (and masquerading) (default true)
--ip-allocation-timeout duration Time after which an incomplete CIDR allocation is considered failed (default 2m0s)
--ip-masq-agent-config-path string ip-masq-agent configuration file path (default "/etc/config/ip-masq-agent")
--ip-masq-agent-sync-period duration ip-masq-agent configuration file synchronization period (default 1m0s)
--ipam string Backend to use for IPAM (default "hostscope-legacy")
--ipsec-key-file string Path to IPSec key file
--ipv4-node string IPv4 address of node (default "auto")
Expand Down
1 change: 1 addition & 0 deletions Documentation/cmdref/cilium_bpf.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Direct access to local BPF maps
* [cilium bpf ct](../cilium_bpf_ct) - Connection tracking tables
* [cilium bpf endpoint](../cilium_bpf_endpoint) - Local endpoint map
* [cilium bpf ipcache](../cilium_bpf_ipcache) - Manage the IPCache mappings for IP/CIDR <-> Identity
* [cilium bpf ipmasq](../cilium_bpf_ipmasq) - ip-masq-agent CIDRs
* [cilium bpf lb](../cilium_bpf_lb) - Load-balancing configuration
* [cilium bpf metrics](../cilium_bpf_metrics) - BPF datapath traffic metrics
* [cilium bpf nat](../cilium_bpf_nat) - NAT mapping tables
Expand Down
29 changes: 29 additions & 0 deletions Documentation/cmdref/cilium_bpf_ipmasq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
<!-- This file was autogenerated via cilium cmdref, do not edit manually-->

## cilium bpf ipmasq

ip-masq-agent CIDRs

### Synopsis

ip-masq-agent CIDRs

### Options

```
-h, --help help for ipmasq
```

### Options inherited from parent commands

```
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
```

### SEE ALSO

* [cilium bpf](../cilium_bpf) - Direct access to local BPF maps
* [cilium bpf ipmasq list](../cilium_bpf_ipmasq_list) - List ip-masq-agent CIDRs

33 changes: 33 additions & 0 deletions Documentation/cmdref/cilium_bpf_ipmasq_list.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
<!-- This file was autogenerated via cilium cmdref, do not edit manually-->

## cilium bpf ipmasq list

List ip-masq-agent CIDRs

### Synopsis

List ip-masq-agent CIDRs. Packets sent from pods to IPs from the CIDRs avoid masquerading

```
cilium bpf ipmasq list [flags]
```

### Options

```
-h, --help help for list
-o, --output string json| jsonpath='{}'
```

### Options inherited from parent commands

```
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
```

### SEE ALSO

* [cilium bpf ipmasq](../cilium_bpf_ipmasq) - ip-masq-agent CIDRs

22 changes: 1 addition & 21 deletions Documentation/gettingstarted/ipvlan.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,13 +73,7 @@ connect to kube-apiserver.

Masquerading with iptables in L3-only mode is not possible since netfilter
hooks are bypassed in the kernel in this mode, hence L3S (symmetric) had
to be introduced in the kernel at the cost of performance. However, Cilium
supports its own BPF-based masquerading which does not rely in any way on
iptables masquerading. If the ``global.installIptablesRules`` parameter is set
to ``"false"`` and ``global.masquerade`` set to ``"true"``, then Cilium will
use the more efficient BPF-based masquerading where ipvlan can remain in
L3 mode as well (instead of L3S). A Linux kernel v4.16 or higher would be
required for BPF-based masquerading.
to be introduced in the kernel at the cost of performance.

Example ConfigMap extract for ipvlan in pure L3 mode:

Expand Down Expand Up @@ -107,20 +101,6 @@ masquerading all traffic leaving the node:
--set global.masquerade=true \\
--set global.autoDirectNodeRoutes=true

Example ConfigMap extract for ipvlan in L3 mode with more efficient
BPF-based masquerading instead of iptables-based:

.. parsed-literal::

helm install cilium |CHART_RELEASE| \\
--namespace kube-system \\
--set global.datapathMode=ipvlan \\
--set global.ipvlan.masterDevice=bond0 \\
--set global.tunnel=disabled \\
--set global.masquerade=true \\
--set global.installIptablesRules=false \\
--set global.autoDirectNodeRoutes=true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to re-add the documentation here on how to use ipvlan in L3-only mode with the ip masq agent?

Copy link
Member Author

@brb brb Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but only after we have tested that the ipvlan masq actually works (in a follow up).

Verify that it has come up correctly:

.. parsed-literal::
Expand Down
4 changes: 2 additions & 2 deletions Documentation/gettingstarted/kubeproxy-free.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ the Cilium agent is running in the desired mode:
.. parsed-literal::

kubectl exec -it -n kube-system cilium-fmh8d -- cilium status | grep KubeProxyReplacement
KubeProxyReplacement: Strict [NodePort (SNAT, 30000-32767, XDP: NONE), HostPort, ExternalIPs, HostReachableServices (TCP, UDP)]
KubeProxyReplacement: Strict (eth0) [NodePort (SNAT, 30000-32767, XDP: NONE), HostPort, ExternalIPs, HostReachableServices (TCP, UDP)]

As a next, optional step, we deploy nginx pods, create a new NodePort service and
validate that Cilium installed the service correctly.
Expand Down Expand Up @@ -589,7 +589,7 @@ The current Cilium kube-proxy replacement mode can also be introspected through
.. parsed-literal::

kubectl exec -it -n kube-system cilium-xxxxx -- cilium status | grep KubeProxyReplacement
KubeProxyReplacement: Strict [NodePort (SNAT, 30000-32767, XDP: NONE), HostPort, ExternalIPs, HostReachableServices (TCP, UDP)]
KubeProxyReplacement: Strict (eth0) [NodePort (SNAT, 30000-32767, XDP: NONE), HostPort, ExternalIPs, HostReachableServices (TCP, UDP)]

Limitations
###########
Expand Down
2 changes: 2 additions & 0 deletions Documentation/spelling_wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,7 @@ integrations
io
ip
ipcache
ipmasq
iproute
ipsec
iptables
Expand Down Expand Up @@ -396,6 +397,7 @@ lwt
macOS
matchLabels
matchPattern
masq
mc
mediabot
memcache
Expand Down
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,7 @@ generate-k8s-api:
$(call generate_k8s_api_deepcopy,github.com/cilium/cilium/pkg,"k8s:types")
$(call generate_k8s_api_deepcopy,github.com/cilium/cilium/pkg,"maps:policymap")
$(call generate_k8s_api_deepcopy,github.com/cilium/cilium/pkg,"maps:ipcache")
$(call generate_k8s_api_deepcopy,github.com/cilium/cilium/pkg,"maps:ipmasq")
$(call generate_k8s_api_deepcopy,github.com/cilium/cilium/pkg,"maps:lxcmap")
$(call generate_k8s_api_deepcopy,github.com/cilium/cilium/pkg,"maps:tunnel")
$(call generate_k8s_api_deepcopy,github.com/cilium/cilium/pkg,"maps:encrypt")
Expand Down
3 changes: 3 additions & 0 deletions api/v1/models/kube_proxy_replacement.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions api/v1/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1672,6 +1672,10 @@ definitions:
- Strict
- Probe
- Partial
devices:
type: array
items:
type: string
features:
type: object
properties:
Expand Down
12 changes: 12 additions & 0 deletions api/v1/server/embedded_spec.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions bpf/bpf_lxc.c
Original file line number Diff line number Diff line change
Expand Up @@ -895,6 +895,7 @@ ipv6_policy(struct __ctx_buff *ctx, int ifindex, __u32 src_label, __u8 *reason,

ct_state_new.src_sec_id = src_label;
ct_state_new.node_port = ct_state.node_port;
ct_state_new.ifindex = ct_state.ifindex;
ret = ct_create6(get_ct_map6(&tuple), &CT_MAP_ANY6, &tuple, ctx, CT_INGRESS,
&ct_state_new, verdict > 0);
if (IS_ERR(ret))
Expand Down Expand Up @@ -1110,6 +1111,7 @@ ipv4_policy(struct __ctx_buff *ctx, int ifindex, __u32 src_label, __u8 *reason,

ct_state_new.src_sec_id = src_label;
ct_state_new.node_port = ct_state.node_port;
ct_state_new.ifindex = ct_state.ifindex;
ret = ct_create4(get_ct_map4(&tuple), &CT_MAP_ANY4, &tuple, ctx, CT_INGRESS,
&ct_state_new, verdict > 0);
if (IS_ERR(ret))
Expand Down
23 changes: 3 additions & 20 deletions bpf/bpf_netdev.c
Original file line number Diff line number Diff line change
Expand Up @@ -449,7 +449,8 @@ int tail_handle_ipv4(struct __ctx_buff *ctx)

ret = handle_ipv4(ctx, proxy_identity);
if (IS_ERR(ret))
return send_drop_notify_error(ctx, proxy_identity, ret, CTX_ACT_DROP, METRIC_INGRESS);
return send_drop_notify_error(ctx, proxy_identity,
ret, CTX_ACT_DROP, METRIC_INGRESS);
return ret;
}

Expand Down Expand Up @@ -694,15 +695,6 @@ int from_netdev(struct __ctx_buff *ctx)
/* Pass unknown traffic to the stack */
return CTX_ACT_OK;

#ifdef ENABLE_MASQUERADE
cilium_dbg_capture(ctx, DBG_CAPTURE_SNAT_PRE, ctx_get_ifindex(ctx));
ret = snat_process(ctx, BPF_PKT_DIR);
if (ret != CTX_ACT_OK) {
return ret;
}
cilium_dbg_capture(ctx, DBG_CAPTURE_SNAT_POST, ctx_get_ifindex(ctx));
#endif /* ENABLE_MASQUERADE */

return do_netdev(ctx, proto);
}

Expand All @@ -721,16 +713,7 @@ int to_netdev(struct __ctx_buff *ctx __maybe_unused)
ret = nodeport_nat_fwd(ctx, false);
if (IS_ERR(ret))
return send_drop_notify_error(ctx, 0, ret, CTX_ACT_DROP, METRIC_EGRESS);
#elif defined(ENABLE_MASQUERADE)
__u16 proto;
if (!validate_ethertype(ctx, &proto))
/* Pass unknown traffic to the stack */
return CTX_ACT_OK;
cilium_dbg_capture(ctx, DBG_CAPTURE_SNAT_PRE, ctx_get_ifindex(ctx));
ret = snat_process(ctx, BPF_PKT_DIR);
if (!ret)
cilium_dbg_capture(ctx, DBG_CAPTURE_SNAT_POST, ctx_get_ifindex(ctx));
#endif /* ENABLE_MASQUERADE */
#endif
return ret;
}

Expand Down
15 changes: 0 additions & 15 deletions bpf/bpf_xdp.c
Original file line number Diff line number Diff line change
Expand Up @@ -23,21 +23,6 @@
# undef CIDR6_LPM_PREFILTER
#endif

struct lpm_v4_key {
struct bpf_lpm_trie_key lpm;
__u8 addr[4];
};

struct lpm_v6_key {
struct bpf_lpm_trie_key lpm;
__u8 addr[16];
};

struct lpm_val {
/* Just dummy for now. */
__u8 flags;
};

#ifdef CIDR4_FILTER
struct bpf_elf_map __section_maps CIDR4_HMAP_NAME = {
.type = BPF_MAP_TYPE_HASH,
Expand Down