Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: Inter-cluster SNAT with ClusterIP global service #24212

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
f1ed5fd
Small bug fixes for per-cluster CT and SNAT map
YutaroHayakawa Feb 18, 2023
6b73dc6
bpf: Define UINT8_MAX
YutaroHayakawa Mar 22, 2023
38c0f90
bpf: Change ipcache_lookup4/6 to accept __u32 cluster_id
YutaroHayakawa Mar 22, 2023
a48e4ec
bpf: Introduce CLUSTER_ID and IPV4_INTER_CLUSTER_SNAT macros
YutaroHayakawa Mar 17, 2023
91014c3
bpf: Add a helper function to extract ClusterID from identity
YutaroHayakawa Mar 17, 2023
b609fb0
bpf: Add helper functions to transfer ClusterID with mark
YutaroHayakawa Mar 17, 2023
0919b29
bpf: Add helper functions to get per-cluster CT
YutaroHayakawa Jan 27, 2023
5a8de7d
bpf: Add helper functions to get per-cluster SNAT maps
YutaroHayakawa Jan 27, 2023
b21fe53
bpf,hubble: Introduce DROP_CT_NO_MAP_FOUND
YutaroHayakawa Mar 17, 2023
2a3e567
bpf,hubble: Introduce DROP_SNAT_NO_MAP_FOUND
YutaroHayakawa Mar 17, 2023
42cf67a
bpf,hubble: Introduce DROP_INVALID_CLUSTER_ID
YutaroHayakawa Mar 22, 2023
0f540b2
bpf: Introduce from_tunnel field to ctmap value
YutaroHayakawa Feb 22, 2023
3b451b4
bpf: Always enable per-packet LB for cluster-aware addressing
YutaroHayakawa Mar 17, 2023
36f0789
bpf: Client/Egress Obtain service backend's ClusterID
YutaroHayakawa Mar 17, 2023
e2a039a
bpf: Client/Egress Lookup/Create per-cluster CT map entry on egress
YutaroHayakawa Mar 17, 2023
7d5c36e
bpf: Client/Egress Cluster-aware egress network policy and tunnel red…
YutaroHayakawa Mar 17, 2023
795e3e1
bpf: Client/Egress Inter-cluster SNAT egress
YutaroHayakawa Mar 17, 2023
0921bd3
bpf: Server/Ingress Request path of the inter-cluster communication
YutaroHayakawa Mar 16, 2023
6be195e
bpf: Server/Egress Reply path of the inter-cluster communication
YutaroHayakawa Mar 17, 2023
fa37b58
bpf: Client/Ingress Inter-cluster SNAT ingress
YutaroHayakawa Mar 17, 2023
c1f2237
bpf: Client/Ingress Reply path of the client cluster lxc
YutaroHayakawa Feb 22, 2023
b5996c2
bpf,test: Initialize per-cluster CT/SNAT for test
YutaroHayakawa Mar 16, 2023
32fd6aa
bpf,test: Add BPF unit tests for inter-cluster SNAT communication
YutaroHayakawa Mar 4, 2023
75802ca
bpf,test: Add complexity-test scenarios for inter-cluster SNAT
YutaroHayakawa Mar 12, 2023
9a274db
bpf,test: Disable coverage reports for some tests
YutaroHayakawa Mar 18, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 5 additions & 1 deletion .github/workflows/lint-bpf-checks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -150,8 +150,12 @@ jobs:
persist-credentials: false
fetch-depth: 0
- name: Run BPF tests with code coverage reporting
env:
# Disable coverage report for these test cases since they are hitting
# https://github.com/cilium/coverbee/issues/7
NOCOVER_PATTERN: "inter_cluster_snat_clusterip.*|session_affinity_test.o|tc_egressgw_redirect.o|tc_egressgw_snat.o|tc_nodeport_lb4_dsr_backend.o|tc_nodeport_lb4_dsr_lb.o|tc_nodeport_lb4_nat_backend.o|tc_nodeport_lb4_nat_lb.o|xdp_nodeport_lb4_dsr_lb.o|xdp_nodeport_lb4_nat_backend.o|xdp_nodeport_lb4_nat_lb.o|xdp_nodeport_lb4_test.o"
run: |
make -C test run_bpf_tests COVER=1 || (echo "Run 'make -C test run_bpf_tests COVER=1' locally to investigate failures"; exit 1)
make -C test run_bpf_tests COVER=1 NOCOVER="$NOCOVER_PATTERN" || (echo "Run 'make -C test run_bpf_tests COVER=1 NOCOVER=\"$NOCOVER_PATTERN\"' locally to investigate failures"; exit 1)
- name: Archive code coverage results
uses: actions/upload-artifact@0b7f8abb1508181956e8e162db84b466c27e18ce # v3.1.2
with:
Expand Down
3 changes: 3 additions & 0 deletions api/v1/flow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -957,6 +957,9 @@ here.
| NAT46 | 187 | |
| NAT64 | 188 | |
| AUTH_REQUIRED | 189 | |
| CT_NO_MAP_FOUND | 190 | |
| SNAT_NO_MAP_FOUND | 191 | |
| INVALID_CLUSTER_ID | 192 | |



Expand Down
351 changes: 182 additions & 169 deletions api/v1/flow/flow.pb.go

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions api/v1/flow/flow.proto
Original file line number Diff line number Diff line change
Expand Up @@ -375,6 +375,9 @@ enum DropReason {
NAT46 = 187;
NAT64 = 188;
AUTH_REQUIRED = 189;
CT_NO_MAP_FOUND = 190;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YutaroHayakawa FYI, @ysksuzuki is using the drop number 190 in his PR. Whoever is ready to merge first wins. Let me know when you're ready and I'll give you 🍏 for API.

SNAT_NO_MAP_FOUND = 191;
INVALID_CLUSTER_ID = 192;
}

enum TrafficDirection {
Expand Down
3 changes: 3 additions & 0 deletions api/v1/observer/observer.pb.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion bpf/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,8 @@ LB_OPTIONS = \
-DENABLE_IPV6:-DENCAP_IFINDEX:-DTUNNEL_MODE:-DENABLE_IPSEC:-DENABLE_NODEPORT:-DENABLE_NODEPORT_ACCELERATION:-DENABLE_SESSION_AFFINITY:-DENABLE_BANDWIDTH_MANAGER:-DENABLE_SRC_RANGE_CHECK:-DLB_SELECTION:-DLB_SELECTION_MAGLEV: \
-DENABLE_IPV6:-DENCAP_IFINDEX:-DTUNNEL_MODE:-DENABLE_IPSEC:-DENABLE_NODEPORT:-DENABLE_NODEPORT_ACCELERATION:-DENABLE_SESSION_AFFINITY:-DENABLE_BANDWIDTH_MANAGER:-DENABLE_SRC_RANGE_CHECK:-DLB_SELECTION:-DLB_SELECTION_MAGLEV:-DENABLE_SOCKET_LB_HOST_ONLY: \
-DENABLE_IPV6:-DENCAP_IFINDEX:-DTUNNEL_MODE:-DENABLE_IPSEC:-DENABLE_NODEPORT:-DENABLE_NODEPORT_ACCELERATION:-DENABLE_SESSION_AFFINITY:-DENABLE_BANDWIDTH_MANAGER:-DENABLE_SRC_RANGE_CHECK:-DLB_SELECTION:-DLB_SELECTION_MAGLEV:-DENABLE_SOCKET_LB_HOST_ONLY:-DENABLE_L7_LB:-DENABLE_SCTP: \
-DENABLE_IPV6:-DENCAP_IFINDEX:-DTUNNEL_MODE:-DENABLE_IPSEC:-DENABLE_NODEPORT:-DENABLE_NODEPORT_ACCELERATION:-DENABLE_SESSION_AFFINITY:-DENABLE_BANDWIDTH_MANAGER:-DENABLE_SRC_RANGE_CHECK:-DLB_SELECTION:-DLB_SELECTION_MAGLEV:-DENABLE_SOCKET_LB_HOST_ONLY:-DENABLE_L7_LB:-DENABLE_SCTP:-DENABLE_VTEP:
-DENABLE_IPV6:-DENCAP_IFINDEX:-DTUNNEL_MODE:-DENABLE_IPSEC:-DENABLE_NODEPORT:-DENABLE_NODEPORT_ACCELERATION:-DENABLE_SESSION_AFFINITY:-DENABLE_BANDWIDTH_MANAGER:-DENABLE_SRC_RANGE_CHECK:-DLB_SELECTION:-DLB_SELECTION_MAGLEV:-DENABLE_SOCKET_LB_HOST_ONLY:-DENABLE_L7_LB:-DENABLE_SCTP:-DENABLE_VTEP: \
-DENABLE_IPV6:-DENCAP_IFINDEX:-DTUNNEL_MODE:-DENABLE_IPSEC:-DENABLE_NODEPORT:-DENABLE_NODEPORT_ACCELERATION:-DENABLE_SESSION_AFFINITY:-DENABLE_BANDWIDTH_MANAGER:-DENABLE_SRC_RANGE_CHECK:-DLB_SELECTION:-DLB_SELECTION_MAGLEV:-DENABLE_SOCKET_LB_HOST_ONLY:-DENABLE_L7_LB:-DENABLE_SCTP:-DENABLE_VTEP:-DENABLE_CLUSTER_AWARE_ADDRESSING:-DENABLE_INTER_CLUSTER_SNAT:

# These options are intended to max out the BPF program complexity. it is load
# tested as well.
Expand Down
5 changes: 3 additions & 2 deletions bpf/bpf_host.c
Original file line number Diff line number Diff line change
Expand Up @@ -575,7 +575,8 @@ handle_ipv4(struct __ctx_buff *ctx, __u32 secctx,
#endif

return ipv4_local_delivery(ctx, l3_off, secctx, ip4, ep,
METRIC_INGRESS, from_host, false);
METRIC_INGRESS, from_host, false,
false, 0);
}

/* Below remainder is only relevant when traffic is pushed via cilium_host.
Expand Down Expand Up @@ -1200,7 +1201,7 @@ int cil_to_netdev(struct __ctx_buff *ctx __maybe_unused)
* handle_nat_fwd tail calls in the majority of cases,
* so control might never return to this program.
*/
ret = handle_nat_fwd(ctx);
ret = handle_nat_fwd(ctx, 0);
if (IS_ERR(ret))
return send_drop_notify_error(ctx, 0, ret,
CTX_ACT_DROP,
Expand Down
95 changes: 79 additions & 16 deletions bpf/bpf_lxc.c
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
/* Copyright Authors of Cilium */

#include "bpf/types_mapper.h"
#include <bpf/ctx/skb.h>
#include <bpf/api.h>

Expand Down Expand Up @@ -65,7 +66,8 @@
#if !defined(ENABLE_SOCKET_LB_FULL) || \
defined(ENABLE_SOCKET_LB_HOST_ONLY) || \
defined(ENABLE_L7_LB) || \
defined(ENABLE_SCTP)
defined(ENABLE_SCTP) || \
defined(ENABLE_CLUSTER_AWARE_ADDRESSING)
# define ENABLE_PER_PACKET_LB 1
#endif

Expand All @@ -80,6 +82,7 @@ static __always_inline int __per_packet_lb_svc_xlate_4(void *ctx, struct iphdr *
struct lb4_service *svc;
struct lb4_key key = {};
__u16 proxy_port = 0;
__u32 cluster_id = 0;
int l4_off;
int ret = 0;

Expand All @@ -105,13 +108,13 @@ static __always_inline int __per_packet_lb_svc_xlate_4(void *ctx, struct iphdr *
#endif /* ENABLE_L7_LB */
ret = lb4_local(get_ct_map4(&tuple), ctx, ETH_HLEN, l4_off,
&key, &tuple, svc, &ct_state_new,
has_l4_header, false);
has_l4_header, false, &cluster_id);
if (IS_ERR(ret))
return ret;
}
skip_service_lookup:
/* Store state to be picked up on the continuation tail call. */
lb4_ctx_store_state(ctx, &ct_state_new, proxy_port);
lb4_ctx_store_state(ctx, &ct_state_new, proxy_port, cluster_id);
ep_tail_call(ctx, CILIUM_CALL_IPV4_CT_EGRESS);
return DROP_MISSED_TAIL_CALL;
}
Expand Down Expand Up @@ -173,6 +176,22 @@ static __always_inline int __per_packet_lb_svc_xlate_6(void *ctx, struct ipv6hdr
#error "Either ENABLE_ARP_PASSTHROUGH or ENABLE_ARP_RESPONDER can be defined"
#endif

#ifdef ENABLE_IPV4
static __always_inline void *
select_ct_map4(struct __ctx_buff *ctx __maybe_unused, int dir __maybe_unused,
struct ipv4_ct_tuple *tuple)
{
__u32 cluster_id = 0;
#ifdef ENABLE_CLUSTER_AWARE_ADDRESSING
if (dir == CT_EGRESS)
cluster_id = ctx_load_meta(ctx, CB_CLUSTER_ID_EGRESS);
else if (dir == CT_INGRESS)
cluster_id = ctx_load_meta(ctx, CB_CLUSTER_ID_INGRESS);
#endif
return get_cluster_ct_map4(tuple, cluster_id);
}
#endif

#define TAIL_CT_LOOKUP4(ID, NAME, DIR, CONDITION, TARGET_ID, TARGET_NAME) \
declare_tailcall_if(CONDITION, ID) \
int NAME(struct __ctx_buff *ctx) \
Expand All @@ -184,6 +203,7 @@ int NAME(struct __ctx_buff *ctx) \
void *data, *data_end; \
struct iphdr *ip4; \
__u32 zero = 0; \
void *map; \
\
ct_state = (struct ct_state *)&ct_buffer.ct_state; \
tuple = (struct ipv4_ct_tuple *)&ct_buffer.tuple; \
Expand All @@ -197,7 +217,11 @@ int NAME(struct __ctx_buff *ctx) \
\
l4_off = ETH_HLEN + ipv4_hdrlen(ip4); \
\
ct_buffer.ret = ct_lookup4(get_ct_map4(tuple), tuple, ctx, l4_off, \
map = select_ct_map4(ctx, DIR, tuple); \
if (!map) \
return DROP_CT_NO_MAP_FOUND; \
\
ct_buffer.ret = ct_lookup4(map, tuple, ctx, l4_off, \
DIR, ct_state, &ct_buffer.monitor); \
if (ct_buffer.ret < 0) \
return ct_buffer.ret; \
Expand Down Expand Up @@ -780,17 +804,25 @@ static __always_inline int handle_ipv4_from_lxc(struct __ctx_buff *ctx, __u32 *d
enum ct_status ct_status;
__u16 proxy_port = 0;
bool from_l7lb = false;
__u32 cluster_id = 0;
void *ct_map, *ct_related_map = NULL;

if (!revalidate_data(ctx, &data, &data_end, &ip4))
return DROP_INVALID;

has_l4_header = ipv4_has_l4_header(ip4);

#ifdef ENABLE_PER_PACKET_LB
/* Restore ct_state from per packet lb handling in the previous tail call. */
lb4_ctx_restore_state(ctx, &ct_state_new, ip4->daddr, &proxy_port, &cluster_id);
hairpin_flow = ct_state_new.loopback;
#endif /* ENABLE_PER_PACKET_LB */

/* Determine the destination category for policy fallback. */
if (1) {
struct remote_endpoint_info *info;

info = lookup_ip4_remote_endpoint(ip4->daddr, 0);
info = lookup_ip4_remote_endpoint(ip4->daddr, cluster_id);
if (info && info->sec_label) {
*dst_id = info->sec_label;
tunnel_endpoint = info->tunnel_endpoint;
Expand All @@ -804,12 +836,6 @@ static __always_inline int handle_ipv4_from_lxc(struct __ctx_buff *ctx, __u32 *d
ip4->daddr, *dst_id);
}

#ifdef ENABLE_PER_PACKET_LB
/* Restore ct_state from per packet lb handling in the previous tail call. */
lb4_ctx_restore_state(ctx, &ct_state_new, ip4->daddr, &proxy_port);
hairpin_flow = ct_state_new.loopback;
#endif /* ENABLE_PER_PACKET_LB */

l4_off = ETH_HLEN + ipv4_hdrlen(ip4);

ct_buffer = map_lookup_elem(&CT_TAIL_CALL_BUFFER4, &zero);
Expand Down Expand Up @@ -903,10 +929,19 @@ static __always_inline int handle_ipv4_from_lxc(struct __ctx_buff *ctx, __u32 *d
* reverse NAT.
*/
ct_state_new.src_sec_id = SECLABEL;

ct_map = get_cluster_ct_map4(tuple, cluster_id);
if (!ct_map)
return DROP_CT_NO_MAP_FOUND;

ct_related_map = get_cluster_ct_any_map4(cluster_id);
if (!ct_related_map)
return DROP_CT_NO_MAP_FOUND;

/* We could avoid creating related entries for legacy ClusterIP
* handling here, but turns out that verifier cannot handle it.
*/
ret = ct_create4(get_ct_map4(tuple), &CT_MAP_ANY4, tuple, ctx,
ret = ct_create4(ct_map, ct_related_map, tuple, ctx,
CT_EGRESS, &ct_state_new, proxy_port > 0, from_l7lb, false);
if (IS_ERR(ret))
return ret;
Expand Down Expand Up @@ -1011,7 +1046,8 @@ static __always_inline int handle_ipv4_from_lxc(struct __ctx_buff *ctx, __u32 *d
policy_clear_mark(ctx);
/* If the packet is from L7 LB it is coming from the host */
return ipv4_local_delivery(ctx, ETH_HLEN, SECLABEL, ip4,
ep, METRIC_EGRESS, from_l7lb, hairpin_flow);
ep, METRIC_EGRESS, from_l7lb, hairpin_flow,
false, 0);
}
}

Expand Down Expand Up @@ -1088,8 +1124,24 @@ static __always_inline int handle_ipv4_from_lxc(struct __ctx_buff *ctx, __u32 *d
{
struct tunnel_key key = {};

if (cluster_id > UINT8_MAX)
return DROP_INVALID_CLUSTER_ID;

key.ip4 = ip4->daddr & IPV4_MASK;
key.family = ENDPOINT_KEY_IPV4;
key.cluster_id = (__u8)cluster_id;

#ifdef ENABLE_CLUSTER_AWARE_ADDRESSING
/*
* The destination is remote node, but the connection is originated from tunnel.
* Maybe the remote cluster performed SNAT for the inter-cluster communication
* and this is the reply for that. In that case, we need to send it back to tunnel.
*/
if (ct_status == CT_REPLY) {
if (identity_is_remote_node(*dst_id) && ct_state->from_tunnel)
tunnel_endpoint = ip4->daddr;
}
#endif

ret = encap_and_redirect_lxc(ctx, tunnel_endpoint, encrypt_key,
&key, node_id, SECLABEL, *dst_id,
Expand All @@ -1101,6 +1153,13 @@ static __always_inline int handle_ipv4_from_lxc(struct __ctx_buff *ctx, __u32 *d
*/
else if (ret == CTX_ACT_OK)
goto encrypt_to_stack;
#ifdef ENABLE_CLUSTER_AWARE_ADDRESSING
/* When we redirect, put cluster_id into mark */
else if (ret == CTX_ACT_REDIRECT) {
ctx_set_cluster_id_mark(ctx, cluster_id);
return ret;
}
#endif
/* This is either redirect by encap code or an error has
* occurred either way return and stack will consume ctx.
*/
Expand Down Expand Up @@ -1639,7 +1698,7 @@ TAIL_CT_LOOKUP6(CILIUM_CALL_IPV6_CT_INGRESS, tail_ipv6_ct_ingress, CT_INGRESS,
static __always_inline int
ipv4_policy(struct __ctx_buff *ctx, int ifindex, __u32 src_label, enum ct_status *ct_status,
struct ipv4_ct_tuple *tuple_out, __s8 *ext_err, __u16 *proxy_port,
bool from_host __maybe_unused)
bool from_host __maybe_unused, bool from_tunnel)
{
struct ct_state ct_state_on_stack __maybe_unused, *ct_state, ct_state_new = {};
struct ipv4_ct_tuple tuple_on_stack __maybe_unused, *tuple;
Expand Down Expand Up @@ -1791,6 +1850,7 @@ ipv4_policy(struct __ctx_buff *ctx, int ifindex, __u32 src_label, enum ct_status

if (ret == CT_NEW) {
ct_state_new.src_sec_id = src_label;
ct_state_new.from_tunnel = from_tunnel;
ret = ct_create4(get_ct_map4(tuple), &CT_MAP_ANY4, tuple, ctx, CT_INGRESS,
&ct_state_new, *proxy_port > 0, false,
verdict == DROP_POLICY_AUTH_REQUIRED);
Expand Down Expand Up @@ -1845,16 +1905,19 @@ int tail_ipv4_policy(struct __ctx_buff *ctx)
int ret, ifindex = ctx_load_meta(ctx, CB_IFINDEX);
__u32 src_label = ctx_load_meta(ctx, CB_SRC_LABEL);
bool from_host = ctx_load_meta(ctx, CB_FROM_HOST);
bool from_tunnel = ctx_load_meta(ctx, CB_FROM_TUNNEL);
bool proxy_redirect __maybe_unused = false;
enum ct_status ct_status = 0;
__u16 proxy_port = 0;
__s8 ext_err = 0;

ctx_store_meta(ctx, CB_SRC_LABEL, 0);
ctx_store_meta(ctx, CB_CLUSTER_ID_INGRESS, 0);
ctx_store_meta(ctx, CB_FROM_HOST, 0);
ctx_store_meta(ctx, CB_FROM_TUNNEL, 0);

ret = ipv4_policy(ctx, ifindex, src_label, &ct_status, &tuple,
&ext_err, &proxy_port, from_host);
&ext_err, &proxy_port, from_host, from_tunnel);
if (ret == POLICY_ACT_PROXY_REDIRECT) {
ret = ctx_redirect_to_proxy4(ctx, &tuple, proxy_port, from_host);
proxy_redirect = true;
Expand Down Expand Up @@ -1933,7 +1996,7 @@ int tail_ipv4_to_endpoint(struct __ctx_buff *ctx)
ctx_store_meta(ctx, CB_SRC_LABEL, 0);

ret = ipv4_policy(ctx, 0, src_identity, &ct_status, NULL,
&ext_err, &proxy_port, true);
&ext_err, &proxy_port, true, false);
if (ret == POLICY_ACT_PROXY_REDIRECT) {
ret = ctx_redirect_to_proxy_hairpin_ipv4(ctx, proxy_port);
proxy_redirect = true;
Expand Down