Skip to content

feat(connlib): utilise GSO for UDP sockets#7210

Merged
thomaseizinger merged 20 commits intomainfrom
chore/gso
Dec 2, 2024
Merged

feat(connlib): utilise GSO for UDP sockets#7210
thomaseizinger merged 20 commits intomainfrom
chore/gso

Conversation

@thomaseizinger
Copy link
Member

@thomaseizinger thomaseizinger commented Nov 1, 2024

Context

At present, connlib sends UDP packets one at a time. Sending a packet requires us to make a syscall which is quite expensive. Under load, i.e. during a speedtest, syscalls account for over 50% of our CPU time 0. In order to improve this situation, we need to somehow make use of GSO (generic segmentation offload). With GSO, we can send multiple packets to the same destination in a single syscall.

The tricky question here is, how can we achieve having multiple UDP packets ready at once so we can send them in a single syscall? Our TUN interface only feeds us packets one at a time and connlib's state machine is single-threaded. Additionally, we currently only have a single EncryptBuffer in which the to-be-sent datagram sits.

1. Stack-allocating encrypted IP packets

As a first step, we get rid of the single EncryptBuffer and instead stack-allocate each encrypted IP packet. Due to our small MTU, these packets are only around 1300 bytes. Stack-allocating that requires a few memcpy's but those are in the single-digit % range in the terms of CPU time performance hit. That is nothing compared to how much time we are spending on UDP syscalls. With the EncryptBuffer out the way, we can now "freely" move around the EncryptedPacket structs and - technically - we can have multiple of them at the same time.

2. Implementing GSO

The GSO interface allows you to pass multiple packets of the same length and for the same destination in a single syscall, meaning we cannot just batch-up arbitrary UDP packets. Counterintuitively, making use of GSO requires us to do more copying: In particular, we change the interface of Io such that "sending" a packet performs essentially a lookup of a BytesMut-buffer by destination and packet length and appends the payload to that packet.

3. Batch-read IP packets

In order to actually perform GSO, we need to process more than a single IP packet in one event-loop tick. We achieve this by batch-reading up to 50 IP packets from the mpsc-channel that connects connlib's main event-loop with the dedicated thread that reads and writes to the TUN device. These reads and writes happen concurrently to connlib's packet processing. Thus, it is likely that by the time connlib is ready to process another IP packet, multiple have been read from the device and are sitting in the channel. Batch-processing these IP packets means that the buffers in our GsoQueue are more likely to contain more than a single datagram.

Imagine you are running a file upload. The OS will send many packets to the same destination IP and likely max MTU to the TUN device. It is likely, that we read 10-20 of these packets in one batch (i.e. within a single "tick" of the event-loop). All packets will be appended to the same buffer in the GsoQueue and on the next event-loop tick, they will all be flushed out in a single syscall.

Results

Overall, this results in a significant reduction of syscalls for sending UDP message. In 1, we spend only a total of 16% of our CPU time in udpv6_sendmsg whereas in 0 (main), we spent a total of 34%. Do note that these numbers are relative to the total CPU time spent per program run and thus can't be compared directly (i.e. you cannot just do 34 - 16 and say we now spend 18% less time sending UDP packets). Nevertheless, this appears to be a great improvement.

In terms of throughput, we achieve a ~60% improvement in our benchmark suite. That one is running on localhost though so it might not necessarily be reflect like that in a real network.

@vercel
Copy link

vercel bot commented Nov 1, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
firezone ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 29, 2024 4:38am

@thomaseizinger
Copy link
Member Author

Still need to test this on other platforms but looks promising :)

I am especially curious on how what impact this will have on the gateway.

@jamilbk
Copy link
Member

jamilbk commented Nov 1, 2024

Still need to test this on other platforms but looks promising :)

I am especially curious on how what impact this will have on the gateway.

Just to make sure I'm following, what other platforms would GSO be applicable to? I thought this was a Linux-only API?

Yeah the Gateway is the main thing that will benefit from this for sure.

jamilbk
jamilbk previously approved these changes Nov 1, 2024
Copy link
Member

@jamilbk jamilbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trading a few mem copies for fewer syscalls seems to make sense.

Curious where a bit more time profiling and optimization can get us! It would be nice to be able to have benchmarks / blog post to point users to on sales calls. Sizing is one of the main questions / friction points admins have when deploying Gateways.

@thomaseizinger
Copy link
Member Author

Just to make sure I'm following, what other platforms would GSO be applicable to? I thought this was a Linux-only API?

As of recently, quinn-udp also has support for GSO on Apple: quinn-rs/quinn#1993.

@thomaseizinger
Copy link
Member Author

thomaseizinger commented Nov 1, 2024

Do note that these numbers are relative to the total CPU time spent per program run and thus can't be compared directly (i.e. you cannot just do 34 - 21 and say we now spend 13% less time sending UDP packets).

@conectado You are better at statistics than me, am I thinking through this correctly? :)

@thomaseizinger

This comment was marked as resolved.

@thomaseizinger
Copy link
Member Author

Curious where a bit more time profiling and optimization can get us!

The next thing will be creating multiple threads for the TUN device and reading / writing those in parallel.

@thomaseizinger
Copy link
Member Author

thomaseizinger commented Nov 3, 2024

@jamilbk If you have the time, I'd be interested to learn if you get any better upload speed on your Mac with this (don't forget to build in release mode). I only have 40MBit upload here and that pretty much tops out already anyway.

@thomaseizinger
Copy link
Member Author

It would be nice to be able to have benchmarks / blog post to point users to on sales calls. Sizing is one of the main questions / friction points admins have when deploying Gateways.

#7243

On top of that, we could easily build something where we run benchmarks on differently sized machines to see how that affects the throughput.

@conectado
Copy link
Contributor

Do note that these numbers are relative to the total CPU time spent per program run and thus can't be compared directly (i.e. you cannot just do 34 - 21 and say we now spend 13% less time sending UDP packets).

@conectado You are better at statistics than me, am I thinking through this correctly? :)

I think you're correct! those number are relative to CPU time and if we consider we measured the program for long enough to average out any outlier event and we assume the same conditions, it should actually represent 13% less CPU time spent on udpv6_sendmsg, which probably means it was called less times as there is no reason to assume an speed up on that function itself.

But as you said, we can't assume it means we spent 13% less time on processing a packet as that time could be diverted to any other stack-frame, to actually be sure of the time improvements we should see the whole packet processing in a single stack-frame.

Which we might be actually be able to do if we make some benchmarks with a custom scheduler? It would be an interesting avenue to explore if we want to continue making performance improvements.

@thomaseizinger thomaseizinger added this pull request to the merge queue Dec 2, 2024
Merged via the queue into main with commit 0a65541 Dec 2, 2024
@thomaseizinger thomaseizinger deleted the chore/gso branch December 2, 2024 01:24
github-merge-queue bot pushed a commit that referenced this pull request Dec 5, 2024
## Context

At present, we only have a single thread that reads and writes to the
TUN device on all platforms. On Linux, it is possible to open the file
descriptor of a TUN device multiple times by setting the
`IFF_MULTI_QUEUE` option using `ioctl`. Using multi-queue, we can then
spawn multiple threads that concurrently read and write to the TUN
device. This is critical for achieving a better throughput.

## Solution

`IFF_MULTI_QUEUE` is a Linux-only thing and therefore only applies to
headless-client, GUI-client on Linux and the Gateway (it may also be
possible on Android, I haven't tried). As such, we need to first change
our internal abstractions a bit to move the creation of the TUN thread
to the `Tun` abstraction itself. For this, we change the interface of
`Tun` to the following:

- `poll_recv_many`: An API, inspired by tokio's `mpsc::Receiver` where
multiple items in a channel can be batch-received.
- `poll_send_ready`: Mimics the API of `Sink` to check whether more
items can be written.
- `send`: Mimics the API of `Sink` to actually send an item.

With these APIs in place, we can implement various (performance)
improvements for the different platforms.

- On Linux, this allows us to spawn multiple threads to read and write
from the TUN device and send all packets into the same channel. The `Io`
component of `connlib` then uses `poll_recv_many` to read batches of up
to 100 packets at once. This ties in well with #7210 because we can then
use GSO to send the encrypted packets in single syscalls to the OS.
- On Windows, we already have a dedicated recv thread because `WinTun`'s
most-convenient API uses blocking IO. As such, we can now also tie into
that by batch-receiving from this channel.
- In addition to using multiple threads, this API now also uses correct
readiness checks on Linux, Darwin and Android to uphold backpressure in
case we cannot write to the TUN device.

## Configuration

Local testing has shown that 2 threads give the best performance for a
local `iperf3` run. I suspect this is because there is only so much
traffic that a single application (i.e. `iperf3`) can generate. With
more than 2 threads, the throughput actually drops drastically because
`connlib`'s main thread is too busy with lock-contention and triggering
`Waker`s for the TUN threads (which mostly idle around if there are 4+
of them). I've made it configurable on the Gateway though so we can
experiment with this during concurrent speedtests etc.

In addition, switching `connlib` to a single-threaded tokio runtime
further increased the throughput. I suspect due to less task / context
switching.

## Results

Local testing with `iperf3` shows some very promising results. We now
achieve a throughput of 2+ Gbit/s.

```
Connecting to host 172.20.0.110, port 5201
Reverse mode, remote host 172.20.0.110 is sending
[  5] local 100.80.159.34 port 57040 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   274 MBytes  2.30 Gbits/sec
[  5]   1.00-2.00   sec   279 MBytes  2.34 Gbits/sec
[  5]   2.00-3.00   sec   216 MBytes  1.82 Gbits/sec
[  5]   3.00-4.00   sec   224 MBytes  1.88 Gbits/sec
[  5]   4.00-5.00   sec   234 MBytes  1.96 Gbits/sec
[  5]   5.00-6.00   sec   238 MBytes  2.00 Gbits/sec
[  5]   6.00-7.00   sec   229 MBytes  1.92 Gbits/sec
[  5]   7.00-8.00   sec   222 MBytes  1.86 Gbits/sec
[  5]   8.00-9.00   sec   223 MBytes  1.87 Gbits/sec
[  5]   9.00-10.00  sec   217 MBytes  1.82 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.30 GBytes  1.98 Gbits/sec  22247             sender
[  5]   0.00-10.00  sec  2.30 GBytes  1.98 Gbits/sec                  receiver

iperf Done.
```

This is a pretty solid improvement over what is in `main`:

```
Connecting to host 172.20.0.110, port 5201
[  5] local 100.65.159.3 port 56970 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  90.4 MBytes   758 Mbits/sec  1800    106 KBytes
[  5]   1.00-2.00   sec  93.4 MBytes   783 Mbits/sec  1550   51.6 KBytes
[  5]   2.00-3.00   sec  92.6 MBytes   777 Mbits/sec  1350   76.8 KBytes
[  5]   3.00-4.00   sec  92.9 MBytes   779 Mbits/sec  1800   56.4 KBytes
[  5]   4.00-5.00   sec  93.4 MBytes   783 Mbits/sec  1650   69.6 KBytes
[  5]   5.00-6.00   sec  90.6 MBytes   760 Mbits/sec  1500   73.2 KBytes
[  5]   6.00-7.00   sec  87.6 MBytes   735 Mbits/sec  1400   76.8 KBytes
[  5]   7.00-8.00   sec  92.6 MBytes   777 Mbits/sec  1600   82.7 KBytes
[  5]   8.00-9.00   sec  91.1 MBytes   764 Mbits/sec  1500   70.8 KBytes
[  5]   9.00-10.00  sec  92.0 MBytes   771 Mbits/sec  1550   85.1 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   917 MBytes   769 Mbits/sec  15700             sender
[  5]   0.00-10.00  sec   916 MBytes   768 Mbits/sec                  receiver

iperf Done.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants