feat(linux): multi-threaded TUN device operations#7449
Merged
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
a9c994b to
63042df
Compare
1f432d0 to
664959b
Compare
Closed
jamilbk
reviewed
Dec 4, 2024
jamilbk
left a comment
Member
There was a problem hiding this comment.
Will take a look with a fresh mind (+ coffee) tomorrow, I want to understand this one a little more deeply.
Member
Author
Okay, I can tidy up the commits a bit then. |
bf11c13 to
22e4ad7
Compare
Member
Author
|
@jamilbk Commits are now atomic and should be easier to review. |
22e4ad7 to
f70a36f
Compare
f70a36f to
ed82c21
Compare
a558d6d to
a37076d
Compare
Connlib consists of primarily one task that processes IP packets sequentially. Moving this task between different tokio worker threads is actually counter-productive for performance due to context-switching overhead. By explicitly creating a single-threaded runtime, we avoid this.
Benchmarking has shown that running 2 threads for the TUN on Linux gives the best performance. With more than two threads, `connlib`'s main thread experiences too much lock contention and "waking" code around the channel with the TUN read/write threads.
a37076d to
a4711ee
Compare
a4711ee to
2e7b069
Compare
jamilbk
approved these changes
Dec 5, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
At present, we only have a single thread that reads and writes to the TUN device on all platforms. On Linux, it is possible to open the file descriptor of a TUN device multiple times by setting the
IFF_MULTI_QUEUEoption usingioctl. Using multi-queue, we can then spawn multiple threads that concurrently read and write to the TUN device. This is critical for achieving a better throughput.Solution
IFF_MULTI_QUEUEis a Linux-only thing and therefore only applies to headless-client, GUI-client on Linux and the Gateway (it may also be possible on Android, I haven't tried). As such, we need to first change our internal abstractions a bit to move the creation of the TUN thread to theTunabstraction itself. For this, we change the interface ofTunto the following:poll_recv_many: An API, inspired by tokio'smpsc::Receiverwhere multiple items in a channel can be batch-received.poll_send_ready: Mimics the API ofSinkto check whether more items can be written.send: Mimics the API ofSinkto actually send an item.With these APIs in place, we can implement various (performance) improvements for the different platforms.
Iocomponent ofconnlibthen usespoll_recv_manyto read batches of up to 100 packets at once. This ties in well with feat(connlib): utilise GSO for UDP sockets #7210 because we can then use GSO to send the encrypted packets in single syscalls to the OS.WinTun's most-convenient API uses blocking IO. As such, we can now also tie into that by batch-receiving from this channel.Configuration
Local testing has shown that 2 threads give the best performance for a local
iperf3run. I suspect this is because there is only so much traffic that a single application (i.e.iperf3) can generate. With more than 2 threads, the throughput actually drops drastically becauseconnlib's main thread is too busy with lock-contention and triggeringWakers for the TUN threads (which mostly idle around if there are 4+ of them). I've made it configurable on the Gateway though so we can experiment with this during concurrent speedtests etc.In addition, switching
connlibto a single-threaded tokio runtime further increased the throughput. I suspect due to less task / context switching.Results
Local testing with
iperf3shows some very promising results. We now achieve a throughput of 2+ Gbit/s.This is a pretty solid improvement over what is in
main: