Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IPv6 BIG TCP support #20349

Merged
merged 10 commits into from Jul 8, 2022
Merged

Add IPv6 BIG TCP support #20349

merged 10 commits into from Jul 8, 2022

Conversation

NikAleksandrov
Copy link

@NikAleksandrov NikAleksandrov commented Jun 29, 2022

Add a new IPv6 BIG TCP[1] option and infrastructure which allows to change a device's GRO and GSO max sizes depending on their current values and the new option (enable-ipv6-big-tcp). BIG TCP is used to allow the network stack to cook bigger TSO/GRO packets (512k is the new maximum, 64k was the old), it can be used only with IPv6 and improves latency and performance due to the reduced number of packets that traverse the stack. It works by adding a new temporary Hop by Hop header right after the IPv6 header, it is stripped before the packet is sent on the wire.

The defaults (65536) are set only if necessary (i.e. if the GSO/GRO max sizes were previously changed), when the option is enabled they get set to 196608. The bpf_dynptr_data helper is used to probe for 5.19+ kernel. BIG TCP can be used only in BPF host routing mode, without encryption and tunneling, there are checks that enforce compatibility. If BIG TCP cannot tune all external interfaces then it will revert back to the defaults. Currently it is supported only on mlx4/mlx5 and veth devices, also any device that inherits gro/gso maximum sizes from others (e.g. bonding should work).

LWN also has a nice BIG TCP summary available here

If the new BIG TCP Cilium option is enabled and initialized it can be seen in the logs:
level=info msg="Setting up IPv6 BIG TCP" subsys=big-tcp
level=info msg="Setting gso_max_size to 196608 and gro_max_size to 196608" device=eth0 subsys=big-tcp

Netperf benchmarks:

- TCP_RR
MIN_LATENCY  P90_LATENCY  P99_LATENCY  THROUGHPUT (trans/sec)

BIG TCP Disabled
60            91           136         13087.53

BIG-TCP Enabled
38            76           109         15629.73

- TCP_STREAM (Mbps)
16386.21 (BIG TCP disabled)
23181.41 (BIG TCP enabled)

Standalone tests which use a Kind cluster are added. BIG TCP currently only works with veth and mellanox devices
so we need a Kind cluster to test it properly, I've left a TODO note that it should be migrated to the new infra ones it
lands.

Note that pods need to be restarted when the option is changed due to the veth configuration which gets applied when the
devices are created and before they're put in the target netns.

Patch-set overview:
Patch 01 - adds basic BIG TCP infrastructure allowing to enable/disable it and check for requirements
Patch 02 - adds GRO/GSO max sizes to the daemon configuration status so they can be exposed for veth device configuration
Patch 03 - adds the new "--enable-ipv6-big-tcp" option to the daemon
Patch 04 - uses the exposed GRO/GSO max sizes to configure the newly created pod veth devices
Patch 05 - adds helm support for the new option (enableIPv6BIGTCP)
Patch 06 - exposes the IPv6 BIG TCP state in "cilium status" (IPv6 BIG TCP: Enabled/Disabled)
Patch 07 - documents the new feature in the tuning docs
Patch 08 - adds BIG TCP tests which provision a Kind cluster, verify the option is properly enabled and run netperf

Requirements to enable:

  • Kernel >= 5.19
  • eBPF Host-Routing
  • eBPF-based kube-proxy replacement
  • eBPF-based masquerading
  • Tunneling and encryption disabled
  • Supported NICs: mlx4, mlx5

Thanks,
Nik

[1] https://lore.kernel.org/netdev/20220513183408.686447-3-eric.dumazet@gmail.com/T/

@NikAleksandrov NikAleksandrov requested review from a team as code owners June 29, 2022 14:40
@NikAleksandrov NikAleksandrov requested review from a team June 29, 2022 14:40
@NikAleksandrov NikAleksandrov requested a review from a team as a code owner June 29, 2022 14:40
@NikAleksandrov NikAleksandrov requested a review from a team June 29, 2022 14:40
@NikAleksandrov NikAleksandrov requested review from a team as code owners June 29, 2022 14:40
@NikAleksandrov NikAleksandrov requested a review from a team June 29, 2022 14:40
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 29, 2022
@NikAleksandrov
Copy link
Author

/test

@borkmann borkmann requested review from jrfastab and brb June 29, 2022 15:06
@borkmann borkmann added release-note/major This PR introduces major new functionality to Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. kind/performance There is a performance impact of this. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 29, 2022
@NikAleksandrov
Copy link
Author

I accidentally pushed the wrong auto-generated model file, sorry about that.
I've pushed the correct one now.

@NikAleksandrov
Copy link
Author

/test

@NikAleksandrov
Copy link
Author

/test

@NikAleksandrov
Copy link
Author

NikAleksandrov commented Jul 6, 2022

/test

Job 'Cilium-PR-K8s-1.23-kernel-4.19' failed:

Click to show.

Test Name

K8sFQDNTest Restart Cilium validate that FQDN is still working

Failure Output

FAIL: Timed out after 240.000s.

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-4.19 so I can create one.

@NikAleksandrov
Copy link
Author

/test

@NikAleksandrov
Copy link
Author

NikAleksandrov commented Jul 6, 2022

/test

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) Checks in-cluster KPR with L7 policy Tests NodePort with L7 Policy

Failure Output

FAIL: Error creating resource /home/jenkins/workspace/Cilium-PR-K8s-GKE@5/src/github.com/cilium/cilium/test/k8s/manifests/l7-policy-demo.yaml: Cannot retrieve cilium pod cilium-86x5l policy revision: cannot get the revision 

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

api/v1/openapi.yaml Show resolved Hide resolved
daemon/cmd/daemon.go Outdated Show resolved Hide resolved
pkg/datapath/linux/bigtcp/bigtcp.go Outdated Show resolved Hide resolved
pkg/datapath/linux/bigtcp/bigtcp.go Outdated Show resolved Hide resolved
@NikAleksandrov
Copy link
Author

NikAleksandrov commented Jul 7, 2022

/test

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: Timed out after 61.429s.

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the vendor and API changes 👍

@tklauser tklauser added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Jul 7, 2022
Nikolay Aleksandrov added 10 commits July 8, 2022 09:29
Update vishvananda/netlink to support setting IFLA_GRO_MAX_SIZE and
IFLA_GSO_MAX_SIZE which is needed for BIG TCP.

Signed-off-by: Nikolay Aleksandrov <nikolay@isovalent.com>
Add basic IPv6 BIG TCP infrastructure which allows to change a device's
GRO and GSO max sizes depending on their current value and a new
enable-ipv6-big-tcp option. BIG TCP is used to allow the network stack
to cook bigger TSO/GRO packets (512k is the new maximum, 64k was the
old), it can be used only with IPv6 and improves latency and performance
due to the reduced number of packets that traverse the stack. It works
by adding a new temporary Hop by Hop header right after the IPv6 header.
Note that may break some eBPF programs which assume TCP header follows
immediately the IPv6 header.

The defaults (65536) are set only if necessary (i.e. if the GSO/GRO max
sizes were previously changed), when the option is enabled they get set
to 196608. The bpf_dynptr_data helper is used to probe for 5.19+ kernel.
BIG TCP can be used only in BPF host routing mode, without encryption and
tunneling, there are checks that enforce compatibility. If BIG TCP cannot
tune all external interfaces then it will revert back to the defaults.
Currently it is supported only on mlx4/mlx5 and veth devices, also any
device that inherits gro/gso maximum sizes from others (e.g. bonding
should work).

If BIG TCP is enabled and initialized it can be seen in the logs:
level=info msg="Setting up IPv6 BIG TCP" subsys=big-tcp
level=info msg="Setting gso_max_size to 196608 and gro_max_size to 196608" device=eth0 subsys=big-tcp

Benchmarks:

- TCP_RR
MIN_LATENCY  P90_LATENCY  P99_LATENCY  THROUGHPUT (trans/sec)

BIG TCP Disabled
60           91           136          13087.53

BIG-TCP Enabled
38           76           109          15629.73

- TCP_STREAM
16386.21 (BIG TCP disabled)
23181.41 (BIG TCP enabled)

Signed-off-by: Nikolay Aleksandrov <nikolay@isovalent.com>
Add new GRO/GSO max size fields to the daemon configuration status API.
We need to expose them so they can be configured on pod veth devices.

Signed-off-by: Nikolay Aleksandrov <nikolay@isovalent.com>
Add daemon support for the new option, initialize it in its NewDaemon
call and expose the initialized GRO/GSO max sizes (BIG TCP config) through
the daemon configuration status.

Signed-off-by: Nikolay Aleksandrov <nikolay@isovalent.com>
Allow configuring GRO/GSO max sizes when setting up veth devices. These
are needed to enable BIG TCP support. They are configured only if > 0.
Pass the configured values when setting up veths.

Signed-off-by: Nikolay Aleksandrov <nikolay@isovalent.com>
Add Helm setting for IPv6 BIG TCP (enableIPv6BIGTCP) which defaults to
false. Used "make -C install/kubernetes cilium/values.yaml" to
autogenerate the values.yaml file.

Signed-off-by: Nikolay Aleksandrov <nikolay@isovalent.com>
Add current BIG TCP state to the Status Response model and expose it
in "cilium status". The struct naming (IPV6BigTCP) is due to the
automatic generation.

Output looks like:
$ kubectl -n kube-system exec cilium-rmxzw -- cilium status
...
IPv6 BIG TCP:            Enabled
...

Signed-off-by: Nikolay Aleksandrov <nikolay@isovalent.com>
Add a new entry explaining the BIG TCP feature, its requirements, how to
enable it and how to validate if it was successfully enabled.

Signed-off-by: Nikolay Aleksandrov <nikolay@isovalent.com>
Add BIG TCP's kernel requirements to "Required Kernel Versions for
Advanced Features"

Signed-off-by: Nikolay Aleksandrov <nikolay@isovalent.com>
Add standalone BIG TCP tests. They use a Kind cluster and setup Cilium
with BIG TCP enabled. Then verify that gso_max_size is set properly (not
verifying gro_max_size due to availability in iproute2, gso_max_size has
been supported for a long time and if it was set properly that means gro
max size was also set). And lastly perform a netperf TCP_RR test between
the client and server netperf pods. The test needs veth devices for BIG
TCP support and Kind is the natural choice, when we add e2e Kind testing
infra these tests can be moved and integrated (left a TODO note in the
test as well).

Signed-off-by: Nikolay Aleksandrov <nikolay@isovalent.com>
@NikAleksandrov
Copy link
Author

NikAleksandrov commented Jul 8, 2022

/test

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: Timed out after 61.388s.

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

@NikAleksandrov NikAleksandrov removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Jul 8, 2022
@NikAleksandrov
Copy link
Author

the ConformanceAKS test failure is unrelated to BIG TCP. It fails at Cilium Cleanup, details:

Run pkill -f "cilium.*hubble.*port-forward|kubectl.*port-forward.*hubble-relay"
Error: Process completed with exit code 1.

@NikAleksandrov NikAleksandrov added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 8, 2022
@borkmann borkmann merged commit 571263e into cilium:master Jul 8, 2022
@aanm aanm added this to the v1.13.0-rc0 milestone Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/performance There is a performance impact of this. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/major This PR introduces major new functionality to Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants