-
Notifications
You must be signed in to change notification settings - Fork 399
feat(connlib): always-on, low-power connections #6845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
|
What will happen after this change with old gateways and new clients(or vice versa)? should we try to add some test similar to what I tried in #6798 ? |
Guess what I'm thinking could be a problem is with an old gateway, it will idle out after 5 minutes always and the client won't detect this until after 10 seconds. So with an old gateway if the client is idle for 5 minutes and then the gateway dies and at that exact moment the user tries to use the connection it will have to wait 10 seconds for the client to re-connect. |
Yeah. There is no difference in behaviour compared to any other network interruption.
Yes although I'd argue that this is extremely unlikely and even if it happens, not much of a big deal. |
f1f8c71 to
e61f378
Compare
e61f378 to
c45d573
Compare
What would you like to test? That idling doesn't close the connection? How many packets we transmit? |
This comment was marked as resolved.
This comment was marked as resolved.
5fd0d3d to
f272b0b
Compare
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
|
So as long as we have good Internet, we always keep up to O(n) low-power connections open, where n is the number of Sites / Gateway Groups? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay but I'm wondering if any customer has a setup with, say, 100 Resources, each with a dedicated Gateway in its own Site, doesn't that mean 100 always-on connections?
Maybe no existing customers has this so we could just keep an eye out for it. I'm thinking like, IoT devices that run their own Gateways and can't see each other so also get their own Site via API.
conectado
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes looks very good, and I think this is a great improvement! But maybe we should document somewhere the edge case discussed above with an old gateway and new client?
I was thinking adding a test for a gateway with the old behavior and a client with the new one? |
We multiplex everything over a single socket so even 100 connections don't actually consume more FD or something. There will be more packets sent yes but I don't think we can avoid this O(n) behaviour somehow. We only have a connection if the user wanted to access a resource so it is kind of by-design. |
Happy to take on any proposal. I do think it is extremely rare and not very problematic so I am not sure it is worth the effort. |
| let packets_per_sec = num_packets / num_seconds / num_connections; | ||
|
|
||
| // This has been chosen through experimentation. It primarily serves as a regression tool to ensure our idle-traffic doesn't suddenly spike. | ||
| const THRESHOLD: f64 = 2.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat surprised by this number. I think it should be less than that given that we only send one every 60s when idling.
It is significantly less than before though.
Either my math is wrong or the packets we send to relays are accounting for that many? It is weird.
I'll investigate in a follow-up.
|
Merging this so I can do some testing with my Android device over the weekend. |
In order to make Firezone more mobile-friendly, waking up the CPU less often is key. In #6845, we introduced a low-power mode into `snownet` that sends STUN messages on a longer interval if the connection is idle. Whilst poking around `boringtun` as part integrating our fork into the main codebase, I noticed that we are updating `boringtun`'s timers every second - even on idle connections. This PR applies the same idea of #6845 to the timers within `Node`: Idle connections get "woken" less and if all connections are idle, we avoid waking the `Node` altogether (unless we need to refresh allocations / channels). Calling `handle_timeout` less often revealed an issue in the tests where we didn't fully process the state changes after invalidating a candidate from the remote. To fix this, we now call `handle_timeout` directly after `{add,remove}_remote_candidate`. This isn't super clean because at first glance, it looks like `handle_timeout` should just be part of the add/remove candidate function. It is quite common for sans-IO designs to require calling `handle_timeout` after state has been changed. In `{Client,Server}Node`, we do it implicitely so that we don't have to do it in the tests and the event-loop. It would be great to test this in some automated fashion but I haven't figured out how yet. I did temporarily add an `info!` log to the event-loop of the client and with this patch applied, the entire event-loop goes to sleep on idle connections. It still does get woken every now and then but no longer every second!
Within
snownet-connlib's connectivity library - we use ICE to set up a UDP "connection" between a client and a gateway. UDP is an unreliable transport, meaning the only way how can detect that the connection is broken is for both parties to constantly send messages and acknowledgements back and forth. ICE uses STUN binding requests for this.In the default configuration of
str0m, a STUN binding is sent every 3s, and we tolerate at most 9 missing responses before we consider the connection broken. As these responses go missing,str0mhalves this interval, which results in a total ICE timeout of around 17 seconds. We already tweak these values by reducing the number of requests to 8 and setting the interval to 1.5s. This results in a total ICE timeout of ~10s which effectively means that there is at most a 10s lag between the connection breaking and us considering it broken at which point new packets arriving at the TUN interface can trigger the setup of a new connection with the gateway.Lowering these timeouts improves the user experience in case of a broken connection because the user doesn't have to wait as long before they can access their resources again. The downside of lowering these timeouts is that we generate a lot of background noise. Especially on mobile devices, this is bad because it prevents the CPU from going to sleep and thus simply being signed into Firezone will drain your battery, even if you don't use it.
Note that this doesn't apply at all if the client application on top detects a network change. In that case, we hard-reset all connections and instantly create new ones.
We attempted to fix this in #5576 by closing idle connections after 5 minutes. This however created new problems such as #6778.
The original problem here is that we send too many STUN messages as soon as a connection is established. Simply increasing the timeout is not an option because it would make the user experience really bad in case the connection actually drops for reasons that the client app can't detect.
In this patch, we attempt to solve this in a different way: Detecting a broken connection is only critical if the user is actively using the tunnel (i.e. sending traffic). If there is no traffic, it doesn't matter if we need longer to detect a broken connection. The user won't notice because their phone is probably in their pocket or something.
With this patch, we now implement the following behaviour:
These values have been chosen while considering the following sources:
conntrackadopts this requirement via thenf_conntrack_udp_timeout_streamconfiguration.In theory the WireGuard keep-alive itself should be good enough to keep all NAT bindings alive. In practice, missed keep-alives are not exposed by boringtun (the WireGuard implementation we rely on) and thus we need the additional STUN keep-alives to detect broken connections. We set those somewhat conservatively to 60s.
As soon as the user triggers new application traffic, these values are reverted back to their defaults, meaning even if the connection died just before the user is starting to use it again, we will know within the usual 10s because we are triggering new STUN requests more often.
Note that existing gateways still implement the "close idle connections after 5 minutes" behaviour. Customers will need to upgrade to a new gateway version to fully benefit from these new always-on, low-power connections.
Resolves: #6778.