Skip to content
Please note that GitHub no longer supports Internet Explorer.

We recommend upgrading to the latest Microsoft Edge, Google Chrome, or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DSR IPv4 for NodePort BPF #9473

Merged
merged 13 commits into from Jan 17, 2020
Merged

Add DSR IPv4 for NodePort BPF #9473

merged 13 commits into from Jan 17, 2020

Conversation

@brb
Copy link
Member

brb commented Oct 22, 2019

This PR adds a direct server return (DSR) support for the NodePort BPF for IPv4 and in the direct routing mode.

The main idea of DSR is to avoid SNAT'ing an original request sent to an LB, so that a backend could directly reply to a client (the originator of the request) and the original source IP could be preserved.

To achieve this, we introduce a new IPv4 option which stores a NodePort service IP and port number. The option is set by bpf_netdev running on a public iface of an intermediate node which received the original request. Once the option has been set, the request (the dst IP addr of the request is DNAT'd to the backend IP addr) is forwarded to a node running the backend. After receiving the fwd'd request, bpf_lxc of the backend parses the option, stores the svc addr and port in the NAT table and sets the dsr bit in a CT entry.

When sending a reply to the client, bpf_lxc of the backend finds out that the dsr bit was set, does a lookup in the NAT table to find the mapping, and finally rewrites the source addr and port to the svc addr and port.

The current approach has a shortcoming that if the request size is > (MTU - 8bytes), the request will be dropped after we append the IPv4 option.

To partially solve this, in the case of TCP we set the option only for SYN packets which should have an empty payload. However, the problem still exists for TCP with SYN cookies and UDP packets. For those cases, a client needs to decrease its MTU by 8bytes.

Add direct server return (DSR) for NodePort BPF

DSR for IPv6 and docs will be submitted in a follow-up PR.

Reviewable per commit.

Related #8979


This change is Reviewable

@brb brb force-pushed the pr/brb/dsr-v3 branch from ca3a6ca to ca304ec Oct 22, 2019
@coveralls

This comment has been minimized.

Copy link

coveralls commented Oct 22, 2019

Coverage Status

Coverage decreased (-0.001%) to 45.9% when pulling c209a99 on pr/brb/dsr-v3 into abcfef6 on master.

@brb brb force-pushed the pr/brb/dsr-v3 branch 3 times, most recently from 8bf72cd to 3a5e6c8 Oct 28, 2019
@stale

This comment has been minimized.

Copy link

stale bot commented Nov 27, 2019

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 27, 2019
@brb brb removed the stale label Nov 27, 2019
@brb brb force-pushed the pr/brb/dsr-v3 branch from 3a5e6c8 to bf0bc18 Dec 5, 2019
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Dec 5, 2019

test-me-please

@github-actions

This comment has been minimized.

Copy link

github-actions bot commented Dec 5, 2019

Release note label not set, please set the appropriate release note.

@brb

This comment has been minimized.

Copy link
Member Author

brb commented Dec 6, 2019

test-me-please

@brb brb force-pushed the pr/brb/dsr-v3 branch from b7fb02e to 9339e18 Dec 9, 2019
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Dec 9, 2019

test-me-please

@brb brb force-pushed the pr/brb/dsr-v3 branch from 9339e18 to 397d84f Dec 9, 2019
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Dec 9, 2019

test-me-please

@brb brb force-pushed the pr/brb/dsr-v3 branch from 397d84f to 1c094ce Dec 9, 2019
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Dec 9, 2019

test-me-please

1 similar comment
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Dec 9, 2019

test-me-please

@brb brb force-pushed the pr/brb/dsr-v3 branch from 1c094ce to 001e097 Dec 10, 2019
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Dec 10, 2019

test-me-please

1 similar comment
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Dec 11, 2019

test-me-please

@brb brb force-pushed the pr/brb/dsr-v3 branch from 7ce09a5 to 7028bd8 Dec 11, 2019
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Dec 11, 2019

test-me-please

3 similar comments
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Dec 16, 2019

test-me-please

@brb

This comment has been minimized.

Copy link
Member Author

brb commented Dec 16, 2019

test-me-please

@brb

This comment has been minimized.

Copy link
Member Author

brb commented Dec 17, 2019

test-me-please

@brb brb force-pushed the pr/brb/dsr-v3 branch from 7028bd8 to bd70c35 Dec 18, 2019
bpf/lib/nodeport.h Outdated Show resolved Hide resolved
brb added 11 commits Oct 22, 2019
This commit adds a direct server return (DSR) support for the NodePort
BPF for IPv4 and in the direct routing mode.

The main idea of DSR is to avoid SNAT'ing an original request sent to an
LB, so that a backend could directly reply to a client (the originator
of the request) and the original source IP could be preserved.

To achieve this, we introduce a new IPv4 option which stores a NodePort
service IP and port number. The option is set by bpf_netdev running on
a public iface of an intermediate node which received the original request.
Once the option has been set, the request (the dst IP addr of the request is
DNAT'd to the backend IP addr) is forwarded to a node running the backend.
After receiving the fwd'd request, bpf_lxc of the backend parses the
option, stores the svc addr:port in the NAT table and sets the "dsr" bit
in a CT entry.

When sending a reply to the client, bpf_lxc finds out that the "dsr" bit
was set, does a lookup in the NAT table to find the mapping, and finally
rewrites the source addr and port to the svc addr and port.

The current approach has a shortcoming that if the request size is > (MTU -
8bytes), the request will be dropped after we append the IPv4 option.

To partially solve this, in the case of TCP we set the option only for
SYN packets which should have an empty payload. However, the problem
still exists for TCP with SYN cookies and UDP packets. For those cases,
a client needs to decrease its MTU by 8bytes.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
For the DSR test case, we need to schedule the test-k8s2 (prev.
test-k8s1) pod on k8s2. Otherwise, a request from the
client-from-outside Docker container running on k8s1 to the pod
via k8s2 (sending via k8s1 does not test the DSR) would be dropped
by the kernel due to a routing loop detection mechanism:

1) k8s2 recv: client-from-outside (192.168.10.10) @ k8s1 -> k8s2:NodePort
2) k8s2 fwd to k8s1: client-from-outside (192.168.10.10) @ k8s1 -> Pod @ k8s1
3) k8s1 recv the packet on enp0s8, and has a route
   "192.168.10.0/24 dev $DOCKER_BRIDGE" <- kernel detects a potential loop.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
In the case of DSR, the following CT and NAT entries are created on
a host which runs a service endpoint and to which a client request is
forwarded:

* NAT: endpoint -> client (XLATE_SRC aka TUPLE_F_OUT)
* CT:  client -> endpoint (TUPLE_F_IN)

Previously, the CT GC ignored NAT entries when a corresponding CT entry was
of the TUPLE_F_IN type. Therefore, the DSR NAT entries could not have been
collected.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
This is going to be needed by some k8sT/Services.go tests.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Previously, we ran curl from the "client-from-outside" container
in the tests which required sending requests from a third host.

We simulated the third host by running a container
("client-from-outside") in a Docker network which was not managed
by Cilium.

Unfortunately, requests sent to a NodePort service from the container
were handled by bpf_sock.c which prevented from testing the NodePort
implementation in bpf_netdev.c.

Fix it by introducing a "real" host, and run curl from it.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
As we are planning to support multiple mutually inclusive modes for
NodePort, introduce a flag to store them. Also, re-use the flag
for enabling the DSR option.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Add revalidate_data() to avoid verifier complaining on the 4.19.57
kernel when loading bpf_lxc.o:

level=warning msg=" R0=inv(id=0,umax_value=2147483647,var_off=(0x0; 0x7fffffff)) R1=inv0 R3=inv0 R4=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R5=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv0 R8=inv(id=0) R9=inv0 R10=fp0,call_-1 fp-104=0 fp-112=0 fp-120=0 fp-152=0 fp-216=0" subsys=datapath-loader
level=warning msg="1007: (7b) *(u64 *)(r10 -152) = r3" subsys=datapath-loader
level=warning msg="1008: (b7) r7 = 0" subsys=datapath-loader
level=warning msg="1009: (71) r1 = *(u8 *)(r8 +0)" subsys=datapath-loader
level=warning msg="R8 invalid mem access 'inv'" subsys=datapath-loader

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb brb force-pushed the pr/brb/dsr-v3 branch from 130bddf to 3a76dc3 Jan 16, 2020
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Jan 16, 2020

test-me-please

@brb

This comment has been minimized.

Copy link
Member Author

brb commented Jan 16, 2020

(Hopefully) final PTAL:

  • @aanm I've disabled cilium-agent on the k8s3 node. Please review 3a76dc3.
  • @borkmann I've fixed some cosmetics.

Anyway, TODO for the follow-up PRs is the following:

  • IPv6
  • Documentation
  • Check whether DSR works in the tunneling mode (also, maybe consider dropping the nodeport handling in bpf_overlay because due to the optimization all cases might be handled by bpf_sock)
  • Move some DSR code from bpf_lxc to inline functions in nodeport.h
  • Consider supporting other IPv4 options
  • CI
    • Set the NoSchedule taint for the k8s3 node to avoid potential flakes
    • Get rid of the affinity in the demo_ds.yaml
    • Don't label the k8s3 node if it doesn't exist
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Jan 16, 2020

test-me-please

1 similar comment
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Jan 16, 2020

test-me-please

@brb brb force-pushed the pr/brb/dsr-v3 branch from a5f66e9 to 9b422ed Jan 16, 2020
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Jan 16, 2020

test-me-please

2 similar comments
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Jan 16, 2020

test-me-please

@brb

This comment has been minimized.

Copy link
Member Author

brb commented Jan 16, 2020

test-me-please

@brb brb force-pushed the pr/brb/dsr-v3 branch from b6baf1f to c209a99 Jan 17, 2020
@brb

This comment has been minimized.

Copy link
Member Author

brb commented Jan 17, 2020

test-me-please

@brb

This comment has been minimized.

Copy link
Member Author

brb commented Jan 17, 2020

CI failed due to Suite-k8s-1.11.K8sDatapathConfig MonitorAggregation Checks that monitor aggregation flags send notifications:

/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:430
Could not locate final FIN notification in monitor log
Expected
    <int>: 0
to be >
    <int>: 0
/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/k8s-1.11-gopath/src/github.com/cilium/cilium/test/k8sT/DatapathConfiguration.go:197

Seemed to be a flake. Re-running.

@brb

This comment has been minimized.

Copy link
Member Author

brb commented Jan 17, 2020

test-me-please

@aanm aanm merged commit ff54dbd into master Jan 17, 2020
5 checks passed
5 checks passed
Cilium-Ginkgo-Tests Build finished.
Details
Hound 1 violation found.
Mergeability Mergeable!
Details
Travis CI - Pull Request Build Passed
Details
coverage/coveralls Coverage decreased (-0.001%) to 45.9%
Details
@aanm aanm deleted the pr/brb/dsr-v3 branch Jan 17, 2020
@borkmann borkmann moved this from In progress (1.7) to Done in kube-proxy removal Jan 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
5 participants
You can’t perform that action at this time.