feat(connlib): utilise GSO for UDP sockets#7210
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
|
Still need to test this on other platforms but looks promising :) I am especially curious on how what impact this will have on the gateway. |
Just to make sure I'm following, what other platforms would GSO be applicable to? I thought this was a Linux-only API? Yeah the Gateway is the main thing that will benefit from this for sure. |
jamilbk
left a comment
There was a problem hiding this comment.
Trading a few mem copies for fewer syscalls seems to make sense.
Curious where a bit more time profiling and optimization can get us! It would be nice to be able to have benchmarks / blog post to point users to on sales calls. Sizing is one of the main questions / friction points admins have when deploying Gateways.
As of recently, |
@conectado You are better at statistics than me, am I thinking through this correctly? :) |
This comment was marked as resolved.
This comment was marked as resolved.
The next thing will be creating multiple threads for the TUN device and reading / writing those in parallel. |
|
@jamilbk If you have the time, I'd be interested to learn if you get any better upload speed on your Mac with this (don't forget to build in release mode). I only have 40MBit upload here and that pretty much tops out already anyway. |
On top of that, we could easily build something where we run benchmarks on differently sized machines to see how that affects the throughput. |
I think you're correct! those number are relative to CPU time and if we consider we measured the program for long enough to average out any outlier event and we assume the same conditions, it should actually represent 13% less CPU time spent on But as you said, we can't assume it means we spent 13% less time on processing a packet as that time could be diverted to any other stack-frame, to actually be sure of the time improvements we should see the whole packet processing in a single stack-frame. Which we might be actually be able to do if we make some benchmarks with a custom scheduler? It would be an interesting avenue to explore if we want to continue making performance improvements. |
3ad5cd4 to
8b82790
Compare
8b82790 to
90b2072
Compare
|
Wow! The performance improvement of this in our benchmark suite is insane. The throughput for TCP download jumped from 380 MBit to 620 MBit! |
eb1d18e to
7ff9534
Compare
7ff9534 to
3208101
Compare
This always returns > 0 unless the channel is closed (which it should never be).
We never have trailing bytes because `GsoQueue` always tracks by segment-size.
777d08f to
3163a0e
Compare
04ebb2b to
82e358b
Compare
## Context At present, we only have a single thread that reads and writes to the TUN device on all platforms. On Linux, it is possible to open the file descriptor of a TUN device multiple times by setting the `IFF_MULTI_QUEUE` option using `ioctl`. Using multi-queue, we can then spawn multiple threads that concurrently read and write to the TUN device. This is critical for achieving a better throughput. ## Solution `IFF_MULTI_QUEUE` is a Linux-only thing and therefore only applies to headless-client, GUI-client on Linux and the Gateway (it may also be possible on Android, I haven't tried). As such, we need to first change our internal abstractions a bit to move the creation of the TUN thread to the `Tun` abstraction itself. For this, we change the interface of `Tun` to the following: - `poll_recv_many`: An API, inspired by tokio's `mpsc::Receiver` where multiple items in a channel can be batch-received. - `poll_send_ready`: Mimics the API of `Sink` to check whether more items can be written. - `send`: Mimics the API of `Sink` to actually send an item. With these APIs in place, we can implement various (performance) improvements for the different platforms. - On Linux, this allows us to spawn multiple threads to read and write from the TUN device and send all packets into the same channel. The `Io` component of `connlib` then uses `poll_recv_many` to read batches of up to 100 packets at once. This ties in well with #7210 because we can then use GSO to send the encrypted packets in single syscalls to the OS. - On Windows, we already have a dedicated recv thread because `WinTun`'s most-convenient API uses blocking IO. As such, we can now also tie into that by batch-receiving from this channel. - In addition to using multiple threads, this API now also uses correct readiness checks on Linux, Darwin and Android to uphold backpressure in case we cannot write to the TUN device. ## Configuration Local testing has shown that 2 threads give the best performance for a local `iperf3` run. I suspect this is because there is only so much traffic that a single application (i.e. `iperf3`) can generate. With more than 2 threads, the throughput actually drops drastically because `connlib`'s main thread is too busy with lock-contention and triggering `Waker`s for the TUN threads (which mostly idle around if there are 4+ of them). I've made it configurable on the Gateway though so we can experiment with this during concurrent speedtests etc. In addition, switching `connlib` to a single-threaded tokio runtime further increased the throughput. I suspect due to less task / context switching. ## Results Local testing with `iperf3` shows some very promising results. We now achieve a throughput of 2+ Gbit/s. ``` Connecting to host 172.20.0.110, port 5201 Reverse mode, remote host 172.20.0.110 is sending [ 5] local 100.80.159.34 port 57040 connected to 172.20.0.110 port 5201 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 274 MBytes 2.30 Gbits/sec [ 5] 1.00-2.00 sec 279 MBytes 2.34 Gbits/sec [ 5] 2.00-3.00 sec 216 MBytes 1.82 Gbits/sec [ 5] 3.00-4.00 sec 224 MBytes 1.88 Gbits/sec [ 5] 4.00-5.00 sec 234 MBytes 1.96 Gbits/sec [ 5] 5.00-6.00 sec 238 MBytes 2.00 Gbits/sec [ 5] 6.00-7.00 sec 229 MBytes 1.92 Gbits/sec [ 5] 7.00-8.00 sec 222 MBytes 1.86 Gbits/sec [ 5] 8.00-9.00 sec 223 MBytes 1.87 Gbits/sec [ 5] 9.00-10.00 sec 217 MBytes 1.82 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 2.30 GBytes 1.98 Gbits/sec 22247 sender [ 5] 0.00-10.00 sec 2.30 GBytes 1.98 Gbits/sec receiver iperf Done. ``` This is a pretty solid improvement over what is in `main`: ``` Connecting to host 172.20.0.110, port 5201 [ 5] local 100.65.159.3 port 56970 connected to 172.20.0.110 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 90.4 MBytes 758 Mbits/sec 1800 106 KBytes [ 5] 1.00-2.00 sec 93.4 MBytes 783 Mbits/sec 1550 51.6 KBytes [ 5] 2.00-3.00 sec 92.6 MBytes 777 Mbits/sec 1350 76.8 KBytes [ 5] 3.00-4.00 sec 92.9 MBytes 779 Mbits/sec 1800 56.4 KBytes [ 5] 4.00-5.00 sec 93.4 MBytes 783 Mbits/sec 1650 69.6 KBytes [ 5] 5.00-6.00 sec 90.6 MBytes 760 Mbits/sec 1500 73.2 KBytes [ 5] 6.00-7.00 sec 87.6 MBytes 735 Mbits/sec 1400 76.8 KBytes [ 5] 7.00-8.00 sec 92.6 MBytes 777 Mbits/sec 1600 82.7 KBytes [ 5] 8.00-9.00 sec 91.1 MBytes 764 Mbits/sec 1500 70.8 KBytes [ 5] 9.00-10.00 sec 92.0 MBytes 771 Mbits/sec 1550 85.1 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 917 MBytes 769 Mbits/sec 15700 sender [ 5] 0.00-10.00 sec 916 MBytes 768 Mbits/sec receiver iperf Done. ```
Context
At present,
connlibsends UDP packets one at a time. Sending a packet requires us to make a syscall which is quite expensive. Under load, i.e. during a speedtest, syscalls account for over 50% of our CPU time 0. In order to improve this situation, we need to somehow make use of GSO (generic segmentation offload). With GSO, we can send multiple packets to the same destination in a single syscall.The tricky question here is, how can we achieve having multiple UDP packets ready at once so we can send them in a single syscall? Our TUN interface only feeds us packets one at a time and
connlib's state machine is single-threaded. Additionally, we currently only have a singleEncryptBufferin which the to-be-sent datagram sits.1. Stack-allocating encrypted IP packets
As a first step, we get rid of the single
EncryptBufferand instead stack-allocate each encrypted IP packet. Due to our small MTU, these packets are only around 1300 bytes. Stack-allocating that requires a few memcpy's but those are in the single-digit % range in the terms of CPU time performance hit. That is nothing compared to how much time we are spending on UDP syscalls. With theEncryptBufferout the way, we can now "freely" move around theEncryptedPacketstructs and - technically - we can have multiple of them at the same time.2. Implementing GSO
The GSO interface allows you to pass multiple packets of the same length and for the same destination in a single syscall, meaning we cannot just batch-up arbitrary UDP packets. Counterintuitively, making use of GSO requires us to do more copying: In particular, we change the interface of
Iosuch that "sending" a packet performs essentially a lookup of aBytesMut-buffer by destination and packet length and appends the payload to that packet.3. Batch-read IP packets
In order to actually perform GSO, we need to process more than a single IP packet in one event-loop tick. We achieve this by batch-reading up to 50 IP packets from the mpsc-channel that connects
connlib's main event-loop with the dedicated thread that reads and writes to the TUN device. These reads and writes happen concurrently toconnlib's packet processing. Thus, it is likely that by the timeconnlibis ready to process another IP packet, multiple have been read from the device and are sitting in the channel. Batch-processing these IP packets means that the buffers in ourGsoQueueare more likely to contain more than a single datagram.Imagine you are running a file upload. The OS will send many packets to the same destination IP and likely max MTU to the TUN device. It is likely, that we read 10-20 of these packets in one batch (i.e. within a single "tick" of the event-loop). All packets will be appended to the same buffer in the
GsoQueueand on the next event-loop tick, they will all be flushed out in a single syscall.Results
Overall, this results in a significant reduction of syscalls for sending UDP message. In 1, we spend only a total of 16% of our CPU time in
udpv6_sendmsgwhereas in 0 (main), we spent a total of 34%. Do note that these numbers are relative to the total CPU time spent per program run and thus can't be compared directly (i.e. you cannot just do 34 - 16 and say we now spend 18% less time sending UDP packets). Nevertheless, this appears to be a great improvement.In terms of throughput, we achieve a ~60% improvement in our benchmark suite. That one is running on localhost though so it might not necessarily be reflect like that in a real network.