Skip to content

Conversation

@thomaseizinger
Copy link
Member

@thomaseizinger thomaseizinger commented Sep 27, 2024

Within snownet - connlib's connectivity library - we use ICE to set up a UDP "connection" between a client and a gateway. UDP is an unreliable transport, meaning the only way how can detect that the connection is broken is for both parties to constantly send messages and acknowledgements back and forth. ICE uses STUN binding requests for this.

In the default configuration of str0m, a STUN binding is sent every 3s, and we tolerate at most 9 missing responses before we consider the connection broken. As these responses go missing, str0m halves this interval, which results in a total ICE timeout of around 17 seconds. We already tweak these values by reducing the number of requests to 8 and setting the interval to 1.5s. This results in a total ICE timeout of ~10s which effectively means that there is at most a 10s lag between the connection breaking and us considering it broken at which point new packets arriving at the TUN interface can trigger the setup of a new connection with the gateway.

Lowering these timeouts improves the user experience in case of a broken connection because the user doesn't have to wait as long before they can access their resources again. The downside of lowering these timeouts is that we generate a lot of background noise. Especially on mobile devices, this is bad because it prevents the CPU from going to sleep and thus simply being signed into Firezone will drain your battery, even if you don't use it.

Note that this doesn't apply at all if the client application on top detects a network change. In that case, we hard-reset all connections and instantly create new ones.

We attempted to fix this in #5576 by closing idle connections after 5 minutes. This however created new problems such as #6778.

The original problem here is that we send too many STUN messages as soon as a connection is established. Simply increasing the timeout is not an option because it would make the user experience really bad in case the connection actually drops for reasons that the client app can't detect.

In this patch, we attempt to solve this in a different way: Detecting a broken connection is only critical if the user is actively using the tunnel (i.e. sending traffic). If there is no traffic, it doesn't matter if we need longer to detect a broken connection. The user won't notice because their phone is probably in their pocket or something.

With this patch, we now implement the following behaviour:

  • A connection is considered idle after 10s of no application traffic.
  • On idle connections, we send a STUN requests every 60s
  • On idle connections, we wait for at most 4 missing responses before considering the connection broken.
  • Every connection will perform a client-initiated WireGuard keep-alive every 25s, unless there is application traffic.

These values have been chosen while considering the following sources:

  1. RFC4787, REQ-5 requires NATs to keep UDP NAT mappings alive for at least 2 minutes.
  2. conntrack adopts this requirement via the nf_conntrack_udp_timeout_stream configuration.
  3. 25s is the default keep-alive of the WireGuard kernel module.

In theory the WireGuard keep-alive itself should be good enough to keep all NAT bindings alive. In practice, missed keep-alives are not exposed by boringtun (the WireGuard implementation we rely on) and thus we need the additional STUN keep-alives to detect broken connections. We set those somewhat conservatively to 60s.

As soon as the user triggers new application traffic, these values are reverted back to their defaults, meaning even if the connection died just before the user is starting to use it again, we will know within the usual 10s because we are triggering new STUN requests more often.

Note that existing gateways still implement the "close idle connections after 5 minutes" behaviour. Customers will need to upgrade to a new gateway version to fully benefit from these new always-on, low-power connections.

Resolves: #6778.

@vercel
Copy link

vercel bot commented Sep 27, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
firezone ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 27, 2024 11:14pm

@thomaseizinger thomaseizinger changed the title feat(connlib): reduce STUN overhead on idle connections feat(connlib): always-on, lower-power connections Sep 27, 2024
@conectado
Copy link
Contributor

What will happen after this change with old gateways and new clients(or vice versa)? should we try to add some test similar to what I tried in #6798 ?

@conectado
Copy link
Contributor

What will happen after this change with old gateways and new clients(or vice versa)? should we try to add some test similar to what I tried in #6798 ?

Guess what I'm thinking could be a problem is with an old gateway, it will idle out after 5 minutes always and the client won't detect this until after 10 seconds.

So with an old gateway if the client is idle for 5 minutes and then the gateway dies and at that exact moment the user tries to use the connection it will have to wait 10 seconds for the client to re-connect.

@thomaseizinger thomaseizinger changed the title feat(connlib): always-on, lower-power connections feat(connlib): always-on, low-power connections Sep 27, 2024
@thomaseizinger
Copy link
Member Author

What will happen after this change with old gateways and new clients(or vice versa)? should we try to add some test similar to what I tried in #6798 ?

Guess what I'm thinking could be a problem is with an old gateway, it will idle out after 5 minutes always and the client won't detect this until after 10 seconds.

Yeah. There is no difference in behaviour compared to any other network interruption.

So with an old gateway if the client is idle for 5 minutes and then the gateway dies and at that exact moment the user tries to use the connection it will have to wait 10 seconds for the client to re-connect.

Yes although I'd argue that this is extremely unlikely and even if it happens, not much of a big deal.

@thomaseizinger
Copy link
Member Author

should we try to add some test similar to what I tried in #6798 ?

What would you like to test? That idling doesn't close the connection? How many packets we transmit?

@thomaseizinger

This comment was marked as resolved.

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
@ReactorScram
Copy link
Contributor

ReactorScram commented Sep 27, 2024

So as long as we have good Internet, we always keep up to O(n) low-power connections open, where n is the number of Sites / Gateway Groups?

Copy link
Contributor

@ReactorScram ReactorScram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay but I'm wondering if any customer has a setup with, say, 100 Resources, each with a dedicated Gateway in its own Site, doesn't that mean 100 always-on connections?

Maybe no existing customers has this so we could just keep an eye out for it. I'm thinking like, IoT devices that run their own Gateways and can't see each other so also get their own Site via API.

Copy link
Contributor

@conectado conectado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes looks very good, and I think this is a great improvement! But maybe we should document somewhere the edge case discussed above with an old gateway and new client?

@conectado
Copy link
Contributor

should we try to add some test similar to what I tried in #6798 ?

What would you like to test? That idling doesn't close the connection? How many packets we transmit?

I was thinking adding a test for a gateway with the old behavior and a client with the new one?

@thomaseizinger
Copy link
Member Author

Looks okay but I'm wondering if any customer has a setup with, say, 100 Resources, each with a dedicated Gateway in its own Site, doesn't that mean 100 always-on connections?

Maybe no existing customers has this so we could just keep an eye out for it. I'm thinking like, IoT devices that run their own Gateways and can't see each other so also get their own Site via API.

We multiplex everything over a single socket so even 100 connections don't actually consume more FD or something.

There will be more packets sent yes but I don't think we can avoid this O(n) behaviour somehow. We only have a connection if the user wanted to access a resource so it is kind of by-design.

@thomaseizinger
Copy link
Member Author

Code changes looks very good, and I think this is a great improvement! But maybe we should document somewhere the edge case discussed above with an old gateway and new client?

Happy to take on any proposal. I do think it is extremely rare and not very problematic so I am not sure it is worth the effort.

let packets_per_sec = num_packets / num_seconds / num_connections;

// This has been chosen through experimentation. It primarily serves as a regression tool to ensure our idle-traffic doesn't suddenly spike.
const THRESHOLD: f64 = 2.0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat surprised by this number. I think it should be less than that given that we only send one every 60s when idling.

It is significantly less than before though.

Either my math is wrong or the packets we send to relays are accounting for that many? It is weird.

I'll investigate in a follow-up.

@thomaseizinger
Copy link
Member Author

Merging this so I can do some testing with my Android device over the weekend.

@thomaseizinger thomaseizinger added this pull request to the merge queue Sep 27, 2024
Merged via the queue into main with commit 6736bb7 Sep 27, 2024
@thomaseizinger thomaseizinger deleted the feat/idle-increase-stun branch September 27, 2024 23:40
github-merge-queue bot pushed a commit that referenced this pull request Oct 25, 2024
In order to make Firezone more mobile-friendly, waking up the CPU less
often is key. In #6845, we introduced a low-power mode into `snownet`
that sends STUN messages on a longer interval if the connection is idle.
Whilst poking around `boringtun` as part integrating our fork into the
main codebase, I noticed that we are updating `boringtun`'s timers every
second - even on idle connections.

This PR applies the same idea of #6845 to the timers within `Node`: Idle
connections get "woken" less and if all connections are idle, we avoid
waking the `Node` altogether (unless we need to refresh allocations /
channels).

Calling `handle_timeout` less often revealed an issue in the tests where
we didn't fully process the state changes after invalidating a candidate
from the remote. To fix this, we now call `handle_timeout` directly
after `{add,remove}_remote_candidate`. This isn't super clean because at
first glance, it looks like `handle_timeout` should just be part of the
add/remove candidate function. It is quite common for sans-IO designs to
require calling `handle_timeout` after state has been changed. In
`{Client,Server}Node`, we do it implicitely so that we don't have to do
it in the tests and the event-loop.

It would be great to test this in some automated fashion but I haven't
figured out how yet. I did temporarily add an `info!` log to the
event-loop of the client and with this patch applied, the entire
event-loop goes to sleep on idle connections. It still does get woken
every now and then but no longer every second!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Some applications have connectivity issues when the tunnel idles out

4 participants