Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(connlib): decrease connection setup latency #4022

Merged
merged 2 commits into from
Mar 9, 2024

Conversation

thomaseizinger
Copy link
Member

snownet is built in a SANS-IO way, which means it doesn't have internal timers or IO. It is up to the upper layer to correctly check poll_timeout and call handle_timeout as soon as that expires. When we want to be called again (i.e. the result of poll_timeout) may change every time snownets internal state changes. This is especially critical during the initial setup of a connection.

As we learn about our own candidates and candidates from the other party, we form new pairs. To actually detect whether the pair is a viable network path, we need to send a STUN request. When to send STUN requests is controlled by time. A newly formed pair should send a STUN request as soon as possible to minimize latency.

Previously, we did not update the timer upon which we "wake" snownet using handle_timeout. As such, we waited unnecessarily long before sending STUN requests to newly formed pairs. With this patch, we check poll_timeout at end of the Tunnel's poll function and immediately call handle_timeout in case we need to.

Currently, str0m throttles updates to handle_timeout in 50ms blocks which still creates some delay. With that commented out, I observed improvements of ~0.7s for establishing new connections. Most of the time, the 2nd ping already goes through!

Copy link

vercel bot commented Mar 7, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
firezone ⬜️ Ignored (Inspect) Visit Preview Mar 9, 2024 9:08am

Copy link

github-actions bot commented Mar 7, 2024

Terraform Cloud Plan Output

Plan: 8 to add, 7 to change, 8 to destroy.

Terraform Cloud Plan

Copy link

github-actions bot commented Mar 7, 2024

Performance Test Results

TCP

Test Name Received/s Sent/s Retransmits
direct-tcp-client2server 62.1 MiB (-3%) 62.6 MiB (-3%) 68 (+33%)
direct-tcp-server2client 53.3 MiB (-2%) 53.6 MiB (-2%) 37 (-5%)
relayed-tcp-client2server 29.4 MiB (+1%) 31.7 MiB (+1%) 67 (-66%)
relayed-tcp-server2client 27.8 MiB (-0%) 30.6 MiB (+1%) 741 (+187%)

UDP

Test Name Total/s Jitter Lost
direct-udp-client2server 50.0 MiB (-0%) 0.21ms (+12%) 0.21% (+Infinity%)
direct-udp-server2client 50.0 MiB (-0%) 0.19ms (-16%) 0.00% (NaN%)
relayed-udp-client2server 50.0 MiB (-0%) 0.35ms (+5%) 19.37% (+2%)
relayed-udp-server2client 50.0 MiB (-0%) 0.85ms (+106%) 27.60% (+8%)

@thomaseizinger
Copy link
Member Author

Currently, str0m throttles updates to handle_timeout in 50ms blocks which still creates some delay. With that commented out, I observed improvements of ~0.7s for establishing new connections. Most of the time, the 2nd ping already goes through!

This is being dealt with in algesten/str0m#477.

@thomaseizinger
Copy link
Member Author

Currently, str0m throttles updates to handle_timeout in 50ms blocks which still creates some delay. With that commented out, I observed improvements of ~0.7s for establishing new connections. Most of the time, the 2nd ping already goes through!

This is being dealt with in algesten/str0m#477.

This has been merged, we just need to update our fork now.

Copy link
Member

@jamilbk jamilbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I'm understanding this from a high level:

  • snownet uses a single-threaded event loop to manage things (like a game rendering engine)
  • The rate at which we wake up the loop to process new data is fixed with a timer
  • Previously, we waited until the next tick to handle the new candidates that may have arrived between our last tick and now
  • With this PR, we "interrupt" instead, and reset the timer for the main eventloop back to 0.

Do I have that right?

Comment on lines 390 to 402
fn poll_timeout(&mut self, cx: &mut Context<'_>) -> Poll<()> {
if let Some(timeout) = self.node.poll_timeout() {
let timeout = tokio::time::Instant::from_std(timeout);

if timeout != self.timeout.deadline() {
self.timeout.as_mut().reset(timeout)
}
}

Poll::Pending
ready!(self.timeout.poll_unpin(cx));
self.node.handle_timeout(self.timeout.deadline().into());

Poll::Ready(())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so this is exactly like quinn-proto, this tells the driver when to next wake up the state machine for a timeout?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

sockets: Sockets::new()?,
stats_timer: tokio::time::interval(Duration::from_secs(60)),
timeout: Box::pin(tokio::time::sleep_until(Instant::now().into())),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is when the connection is first starting? Why is the timeout set to now? Just because it must have some value, and now is a value that will be in the past so it won't have a spurious wakeup later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would be nicer to set it to None and thus correctly handle the (hypothetical) case of Node::poll_timeout returning None.


cx.waker().wake_by_ref();
fn poll_timeout(&mut self, cx: &mut Context<'_>) -> Poll<()> {
if let Some(timeout) = self.node.poll_timeout() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Node has its own timeout in addition to the ConnectionState?

What does the ConnectionState's timeout do?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node's poll_timeout only indicates the next time handle_timeout needs to be called and will always return the same timeout until something changes in its internal state, it's not a future just a Duration.

ConnectionState's timeout is a future, which is used to keep track of Node's timeout and have the executor poll the Tunnel when it happens so we can send it down to Node to handle it.

Comment on lines +126 to 131
// After any state change, check what the new timeout is and reset it if necessary.
if self.connections_state.poll_timeout(cx).is_ready() {
cx.waker().wake_by_ref()
}

Poll::Pending
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because I'm still getting used to the poll_ style, do I understand this right?

  • We call poll_timeout on this connections_state object and pass it our context, so if it needs to register a wakeup, it can do that
  • If the connections state has something ready, we do a wake_by_ref to tell the driver / runtime, "Immediately call me again, there might be more work to do"
  • We return Pending because whatever connections_state needs to do is already handled above here and we don't want to duplicate calling it in these other two blocks. But probably the runtime will immediately wake up our task again, we'll check connections_state, and we'll either tick it or return some Ready it returned to us.
  • In general if we return Ready or have a subtask (sub state machine?) that's ready, the caller should keep polling us until we return Pending and neither us nor our subtasks register an immediate wakeup? e.g. if there were no timeouts somehow, we would be polled until everything was done, and then only wake up when a new packet comes in?
  • And the runtime doesn't actually have a concept of "Keep polling until" because wake_by_ref is how we tell it, "I know I'm saying Pending but call me again as soon as you can schedule me, I may have more work to do"
  • I'm guessing connections_state manages several connections, so internally its poll_timeout checks on something like a timer heap, so if we have 5 connections, it just tells us which timeout will happen earliest.
  • All this runs in a single thread because the runtime overhead of locks and the dev overhead of multi-threading (having to deal with everything being &self and mixing sync and async) is not worth it, since we are not CPU-bound. If we needed an outrageous amount of bandwidth, like 1 Tbps, or if cryptography was more expensive, one CPU core would not suffice, but for our case of ballpark 1 Gbps, there's no need to use extra cores?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you've got it all right :)

The runtime will end up polling a future which is actually never Ready because that would shut it down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, the eventloops are still a bit fragile in regards to the polling rules but I am hoping to refactor that and make it crystal clear and easy to follow!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • All this runs in a single thread because the runtime overhead of locks and the dev overhead of multi-threading (having to deal with everything being &self and mixing sync and async) is not worth it, since we are not CPU-bound. If we needed an outrageous amount of bandwidth, like 1 Tbps, or if cryptography was more expensive, one CPU core would not suffice, but for our case of ballpark 1 Gbps, there's no need to use extra cores?

That is the bet yeah and so far it seems to turn out okay :)

There is even more optimisation potential like io-uring to further reduce CPU overhead. It would be interesting to know, what link speeds you need before this design fails.

Copy link
Collaborator

@ReactorScram ReactorScram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some questions trying to make sure I understand it. Also:

  • Are there tests in CI that show the latency decreasing?
  • Is the str0m dep already pointing to their main or does that get updated separately? You mentioned that they merged a patch, but I don't see a recent release on their repo.

Replying to Jamil's comment:

There can't be a fixed wake-up like the 60 FPS graphics tick in a game engine, right?

The 50 ms limit just means that some part of the code refused to 'tick' any faster than every 50 ms, so if some handshake needed 14 ticks to complete, that adds up to 700 ms? But in general if there are no incoming packets, it can sleep for any number of seconds until the next timeout, which is probably a timeout for a keepalive packet, right?

I assume there's something like a timer heap internally, and then the runtime drives it by polling something like Quinn's poll_timeout (https://docs.rs/quinn-proto/0.10.6/quinn_proto/struct.Connection.html#method.poll_timeout) and setting an async timer to sleep until that timeout?

@ReactorScram
Copy link
Collaborator

I think I clicked the wrong button in Github and made 5 reviews. Oops. Anyway I approve it.

@thomaseizinger
Copy link
Member Author

  • Are there tests in CI that show the latency decreasing?

Not a test directly but together with the other PR of printing the setup time, it can be observed in the logs.

  • Is the str0m dep already pointing to their main or does that get updated separately? You mentioned that they merged a patch, but I don't see a recent release on their repo.

It is not yet updated no because of the timeline on when I opened this PR and the PR to str0m.

@thomaseizinger
Copy link
Member Author

thomaseizinger commented Mar 8, 2024

  • snownet uses a single-threaded event loop to manage things (like a game rendering engine)

  • The rate at which we wake up the loop to process new data is fixed with a timer

  • Previously, we waited until the next tick to handle the new candidates that may have arrived between our last tick and now

  • With this PR, we "interrupt" instead, and reset the timer for the main eventloop back to 0.

@jamilbk Almost correct.

  • snownet doesn't have an eventloop at all, just the tunnel does. A snownet::Node is just a state machine. The eventloop just calls it and translates it into side-effects.
  • We don't reset the timer to 0 but to whatever the Node wants it to be.

@thomaseizinger thomaseizinger force-pushed the feat/connlib/always-poll-timeout branch from 493e3d5 to 09b3663 Compare March 9, 2024 09:08
@thomaseizinger thomaseizinger added this pull request to the merge queue Mar 9, 2024
Merged via the queue into main with commit a2f289f Mar 9, 2024
64 checks passed
@thomaseizinger thomaseizinger deleted the feat/connlib/always-poll-timeout branch March 9, 2024 09:42
github-merge-queue bot pushed a commit that referenced this pull request Mar 11, 2024
…#4047)

As part of implementing #4022,
I realised that I probably composed all the SANS-IO components in the
wrong way. This PR fixes this and adds some documentation on why.
Interestingly, this makes several things of the design simpler because
we end up passing around less state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants