Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd DSR IPv4 for NodePort BPF #9473
Conversation
This comment has been minimized.
This comment has been minimized.
coveralls
commented
Oct 22, 2019
•
8bf72cd
to
3a5e6c8
This comment has been minimized.
This comment has been minimized.
stale
bot
commented
Nov 27, 2019
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This comment has been minimized.
This comment has been minimized.
github-actions
bot
commented
Dec 5, 2019
|
Release note label not set, please set the appropriate release note. |
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
1 similar comment
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
1 similar comment
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
3 similar comments
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This commit adds a direct server return (DSR) support for the NodePort BPF for IPv4 and in the direct routing mode. The main idea of DSR is to avoid SNAT'ing an original request sent to an LB, so that a backend could directly reply to a client (the originator of the request) and the original source IP could be preserved. To achieve this, we introduce a new IPv4 option which stores a NodePort service IP and port number. The option is set by bpf_netdev running on a public iface of an intermediate node which received the original request. Once the option has been set, the request (the dst IP addr of the request is DNAT'd to the backend IP addr) is forwarded to a node running the backend. After receiving the fwd'd request, bpf_lxc of the backend parses the option, stores the svc addr:port in the NAT table and sets the "dsr" bit in a CT entry. When sending a reply to the client, bpf_lxc finds out that the "dsr" bit was set, does a lookup in the NAT table to find the mapping, and finally rewrites the source addr and port to the svc addr and port. The current approach has a shortcoming that if the request size is > (MTU - 8bytes), the request will be dropped after we append the IPv4 option. To partially solve this, in the case of TCP we set the option only for SYN packets which should have an empty payload. However, the problem still exists for TCP with SYN cookies and UDP packets. For those cases, a client needs to decrease its MTU by 8bytes. Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
For the DSR test case, we need to schedule the test-k8s2 (prev. test-k8s1) pod on k8s2. Otherwise, a request from the client-from-outside Docker container running on k8s1 to the pod via k8s2 (sending via k8s1 does not test the DSR) would be dropped by the kernel due to a routing loop detection mechanism: 1) k8s2 recv: client-from-outside (192.168.10.10) @ k8s1 -> k8s2:NodePort 2) k8s2 fwd to k8s1: client-from-outside (192.168.10.10) @ k8s1 -> Pod @ k8s1 3) k8s1 recv the packet on enp0s8, and has a route "192.168.10.0/24 dev $DOCKER_BRIDGE" <- kernel detects a potential loop. Signed-off-by: Martynas Pumputis <m@lambda.lt>
In the case of DSR, the following CT and NAT entries are created on a host which runs a service endpoint and to which a client request is forwarded: * NAT: endpoint -> client (XLATE_SRC aka TUPLE_F_OUT) * CT: client -> endpoint (TUPLE_F_IN) Previously, the CT GC ignored NAT entries when a corresponding CT entry was of the TUPLE_F_IN type. Therefore, the DSR NAT entries could not have been collected. Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
This is going to be needed by some k8sT/Services.go tests. Signed-off-by: Martynas Pumputis <m@lambda.lt>
Previously, we ran curl from the "client-from-outside" container
in the tests which required sending requests from a third host.
We simulated the third host by running a container
("client-from-outside") in a Docker network which was not managed
by Cilium.
Unfortunately, requests sent to a NodePort service from the container
were handled by bpf_sock.c which prevented from testing the NodePort
implementation in bpf_netdev.c.
Fix it by introducing a "real" host, and run curl from it.
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
As we are planning to support multiple mutually inclusive modes for NodePort, introduce a flag to store them. Also, re-use the flag for enabling the DSR option. Signed-off-by: Martynas Pumputis <m@lambda.lt>
Add revalidate_data() to avoid verifier complaining on the 4.19.57 kernel when loading bpf_lxc.o: level=warning msg=" R0=inv(id=0,umax_value=2147483647,var_off=(0x0; 0x7fffffff)) R1=inv0 R3=inv0 R4=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R5=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv0 R8=inv(id=0) R9=inv0 R10=fp0,call_-1 fp-104=0 fp-112=0 fp-120=0 fp-152=0 fp-216=0" subsys=datapath-loader level=warning msg="1007: (7b) *(u64 *)(r10 -152) = r3" subsys=datapath-loader level=warning msg="1008: (b7) r7 = 0" subsys=datapath-loader level=warning msg="1009: (71) r1 = *(u8 *)(r8 +0)" subsys=datapath-loader level=warning msg="R8 invalid mem access 'inv'" subsys=datapath-loader Signed-off-by: Martynas Pumputis <m@lambda.lt>
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This comment has been minimized.
This comment has been minimized.
|
(Hopefully) final PTAL:
Anyway, TODO for the follow-up PRs is the following:
|
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
1 similar comment
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
2 similar comments
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
This comment has been minimized.
This comment has been minimized.
|
CI failed due to
Seemed to be a flake. Re-running. |
This comment has been minimized.
This comment has been minimized.
|
test-me-please |
brb commentedOct 22, 2019
•
edited
This PR adds a direct server return (DSR) support for the NodePort BPF for IPv4 and in the direct routing mode.
The main idea of DSR is to avoid SNAT'ing an original request sent to an LB, so that a backend could directly reply to a client (the originator of the request) and the original source IP could be preserved.
To achieve this, we introduce a new IPv4 option which stores a NodePort service IP and port number. The option is set by
bpf_netdevrunning on a public iface of an intermediate node which received the original request. Once the option has been set, the request (the dst IP addr of the request is DNAT'd to the backend IP addr) is forwarded to a node running the backend. After receiving the fwd'd request,bpf_lxcof the backend parses the option, stores the svc addr and port in the NAT table and sets thedsrbit in a CT entry.When sending a reply to the client,
bpf_lxcof the backend finds out that thedsrbit was set, does a lookup in the NAT table to find the mapping, and finally rewrites the source addr and port to the svc addr and port.The current approach has a shortcoming that if the request size is
> (MTU - 8bytes), the request will be dropped after we append the IPv4 option.To partially solve this, in the case of TCP we set the option only for
SYNpackets which should have an empty payload. However, the problem still exists for TCP with SYN cookies and UDP packets. For those cases, a client needs to decrease its MTU by 8bytes.DSR for IPv6 and docs will be submitted in a follow-up PR.
Reviewable per commit.
Related #8979
This change is