Skip to content

feat(linux): multi-threaded TUN device operations#7449

Merged
thomaseizinger merged 9 commits into
mainfrom
chore/multi-threaded-tun
Dec 5, 2024
Merged

feat(linux): multi-threaded TUN device operations#7449
thomaseizinger merged 9 commits into
mainfrom
chore/multi-threaded-tun

Conversation

@thomaseizinger

@thomaseizinger thomaseizinger commented Dec 2, 2024

Copy link
Copy Markdown
Member

Context

At present, we only have a single thread that reads and writes to the TUN device on all platforms. On Linux, it is possible to open the file descriptor of a TUN device multiple times by setting the IFF_MULTI_QUEUE option using ioctl. Using multi-queue, we can then spawn multiple threads that concurrently read and write to the TUN device. This is critical for achieving a better throughput.

Solution

IFF_MULTI_QUEUE is a Linux-only thing and therefore only applies to headless-client, GUI-client on Linux and the Gateway (it may also be possible on Android, I haven't tried). As such, we need to first change our internal abstractions a bit to move the creation of the TUN thread to the Tun abstraction itself. For this, we change the interface of Tun to the following:

  • poll_recv_many: An API, inspired by tokio's mpsc::Receiver where multiple items in a channel can be batch-received.
  • poll_send_ready: Mimics the API of Sink to check whether more items can be written.
  • send: Mimics the API of Sink to actually send an item.

With these APIs in place, we can implement various (performance) improvements for the different platforms.

  • On Linux, this allows us to spawn multiple threads to read and write from the TUN device and send all packets into the same channel. The Io component of connlib then uses poll_recv_many to read batches of up to 100 packets at once. This ties in well with feat(connlib): utilise GSO for UDP sockets #7210 because we can then use GSO to send the encrypted packets in single syscalls to the OS.
  • On Windows, we already have a dedicated recv thread because WinTun's most-convenient API uses blocking IO. As such, we can now also tie into that by batch-receiving from this channel.
  • In addition to using multiple threads, this API now also uses correct readiness checks on Linux, Darwin and Android to uphold backpressure in case we cannot write to the TUN device.

Configuration

Local testing has shown that 2 threads give the best performance for a local iperf3 run. I suspect this is because there is only so much traffic that a single application (i.e. iperf3) can generate. With more than 2 threads, the throughput actually drops drastically because connlib's main thread is too busy with lock-contention and triggering Wakers for the TUN threads (which mostly idle around if there are 4+ of them). I've made it configurable on the Gateway though so we can experiment with this during concurrent speedtests etc.

In addition, switching connlib to a single-threaded tokio runtime further increased the throughput. I suspect due to less task / context switching.

Results

Local testing with iperf3 shows some very promising results. We now achieve a throughput of 2+ Gbit/s.

Connecting to host 172.20.0.110, port 5201
Reverse mode, remote host 172.20.0.110 is sending
[  5] local 100.80.159.34 port 57040 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   274 MBytes  2.30 Gbits/sec
[  5]   1.00-2.00   sec   279 MBytes  2.34 Gbits/sec
[  5]   2.00-3.00   sec   216 MBytes  1.82 Gbits/sec
[  5]   3.00-4.00   sec   224 MBytes  1.88 Gbits/sec
[  5]   4.00-5.00   sec   234 MBytes  1.96 Gbits/sec
[  5]   5.00-6.00   sec   238 MBytes  2.00 Gbits/sec
[  5]   6.00-7.00   sec   229 MBytes  1.92 Gbits/sec
[  5]   7.00-8.00   sec   222 MBytes  1.86 Gbits/sec
[  5]   8.00-9.00   sec   223 MBytes  1.87 Gbits/sec
[  5]   9.00-10.00  sec   217 MBytes  1.82 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.30 GBytes  1.98 Gbits/sec  22247             sender
[  5]   0.00-10.00  sec  2.30 GBytes  1.98 Gbits/sec                  receiver

iperf Done.

This is a pretty solid improvement over what is in main:

Connecting to host 172.20.0.110, port 5201
[  5] local 100.65.159.3 port 56970 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  90.4 MBytes   758 Mbits/sec  1800    106 KBytes
[  5]   1.00-2.00   sec  93.4 MBytes   783 Mbits/sec  1550   51.6 KBytes
[  5]   2.00-3.00   sec  92.6 MBytes   777 Mbits/sec  1350   76.8 KBytes
[  5]   3.00-4.00   sec  92.9 MBytes   779 Mbits/sec  1800   56.4 KBytes
[  5]   4.00-5.00   sec  93.4 MBytes   783 Mbits/sec  1650   69.6 KBytes
[  5]   5.00-6.00   sec  90.6 MBytes   760 Mbits/sec  1500   73.2 KBytes
[  5]   6.00-7.00   sec  87.6 MBytes   735 Mbits/sec  1400   76.8 KBytes
[  5]   7.00-8.00   sec  92.6 MBytes   777 Mbits/sec  1600   82.7 KBytes
[  5]   8.00-9.00   sec  91.1 MBytes   764 Mbits/sec  1500   70.8 KBytes
[  5]   9.00-10.00  sec  92.0 MBytes   771 Mbits/sec  1550   85.1 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   917 MBytes   769 Mbits/sec  15700             sender
[  5]   0.00-10.00  sec   916 MBytes   768 Mbits/sec                  receiver

iperf Done.

@vercel

vercel Bot commented Dec 2, 2024

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
firezone ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 4, 2024 11:07pm

@jamilbk jamilbk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look with a fresh mind (+ coffee) tomorrow, I want to understand this one a little more deeply.

@thomaseizinger

Copy link
Copy Markdown
Member Author

Will take a look with a fresh mind (+ coffee) tomorrow, I want to understand this one a little more deeply.

Okay, I can tidy up the commits a bit then.

@thomaseizinger thomaseizinger force-pushed the chore/multi-threaded-tun branch from bf11c13 to 22e4ad7 Compare December 4, 2024 22:44
@thomaseizinger

Copy link
Copy Markdown
Member Author

@jamilbk Commits are now atomic and should be easier to review.

Connlib consists of primarily one task that processes IP packets
sequentially. Moving this task between different tokio worker threads is
actually counter-productive for performance due to context-switching
overhead.

By explicitly creating a single-threaded runtime, we avoid this.
Benchmarking has shown that running 2 threads for the TUN on Linux gives
the best performance. With more than two threads, `connlib`'s main
thread experiences too much lock contention and "waking" code around the
channel with the TUN read/write threads.
@thomaseizinger thomaseizinger added this pull request to the merge queue Dec 5, 2024
Merged via the queue into main with commit 90cf191 Dec 5, 2024
@thomaseizinger thomaseizinger deleted the chore/multi-threaded-tun branch December 5, 2024 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants