Reduce frequency of keep-alives on non-nominated pairs #464

thomaseizinger · 2024-02-27T23:20:35Z

Currently, I believe str0m sends keep-alives on all valid candidate pairs at the same rate. Sending frequent keep-alives is useful to detect network partitions. However, I think that is only relevant for the currently nominated pair.

Could we reduce the keep-alives on other pairs to something like 5 or 10 seconds?

The text was updated successfully, but these errors were encountered:

thomaseizinger · 2024-02-28T04:59:55Z

If we'd ignore trickle ICE, then the spec would actually say to stop sending keep alives on all but the nominated candidate pair: https://datatracker.ietf.org/doc/html/rfc8445#section-8.3

thomaseizinger · 2024-02-28T05:09:43Z

I think what I would want is a behaviour such as "move to completed if we haven't received a new candidate in a while" but still allow new candidates to be added in case they come through.

I am not sure if time is the best trigger for this condition.

Perhaps what we need is an implementation of the "end of candidates" indication, plus triggering an ICE restart if we learn a new candidate after we have sent "end of candidates"?

thomaseizinger · 2024-02-28T06:13:02Z

I think I figured out something that solves this for now. Upon every nomination, I'll invalidate all other candidates with less or eq the nominated priority. This means we can still find a better pair but we won't keep around others. This removes the ability of a "fallback" in case the current one stops working but in that case, we just discard the connection and form a new one (kind of like an ICE restart).

algesten · 2024-02-28T08:48:04Z

Yeah. So for WebRTC the goal is to maximize the chances of connectivity. The frequency of STUN requests to the fallbacks could potentially be reduced, but it's not right to remove them altogether.

If we'd ignore trickle ICE, then the spec would actually say to stop sending keep alives on all but the nominated candidate pair: https://datatracker.ietf.org/doc/html/rfc8445#section-8.3

This is something we have a deliberate stance on: https://github.com/algesten/str0m/blob/main/docs/ice.md#nomination-doesnt-stop-gathering

And also see: https://github.com/algesten/str0m/blob/main/docs/ice.md#why-aend-of-candidates

Ultimately it seems your application is less concerned with connectivity and more about not being too noisy. That's slightly at odds with the goals of str0m. invalidate_candidate seems like one solve, another thing we could explore is implementing this part of the spec:

https://datatracker.ietf.org/doc/html/rfc5245#section-5.7.3

an agent MUST limit the total number of connectivity
checks the agent performs across all check lists to a specific value,
and this value MUST be configurable. A default of 100 is
RECOMMENDED. This limit is enforced by discarding the lower-priority
candidate pairs until there are less than 100.

If we made this runtime configurable, you could potentially lower the limit to 1-2 after successful connect.

thomaseizinger · 2024-02-28T10:06:58Z

Ultimately it seems your application is less concerned with connectivity and more about not being too noisy. That's slightly at odds with the goals of str0m.

I imagine many applications will run into that. We "only" generate a handful of candidates (8-10 per peer) and if most of them are active, str0m will generate multiple megabytes of traffic per minute just for keep-alives. On mobile devices, that is not exactly ideal.

The quoted section sounds interesting, I think that may be another good angle to solve this. We would also need to be somewhat smart about which ones are dropped. Also, it would be important that this only happens once we've nominated a pair.

algesten · 2024-02-28T10:08:46Z

str0m will generate multiple megabytes of traffic per minute just for keep-alives

Did you measure this?

thomaseizinger · 2024-02-28T10:10:44Z

Ultimately it seems your application is less concerned with connectivity and more about not being too noisy.

That is not quite how I'd put it. Connectivity is super important actually but once we've found a pair, we'd rather have that connection fail and make a new one via the signaling layer instead of falling back :)

It appears to me that this is a trade-off that most apps will want to make at some point: Have lots of candidates first to maximize connectivity, then prune them to save bandwidth, battery and reduce noise.

algesten · 2024-02-28T10:13:05Z

That is not quite how I'd put it. Connectivity is super important actually but once we've found a pair, we'd rather have that connection fail and make a new one via the signaling layer instead of falling back :)

Yeah. Sorry. Clumsy way of putting it.

Have lots of candidates first to maximize connectivity, then prune them to save bandwidth, battery and reduce noise.

Noise apart, this presupposes there is a lot of bandwidth being uses here. My gut feeling is that it can't be that much – hence I asked whether you measured it.

thomaseizinger · 2024-02-28T12:33:44Z

That is not quite how I'd put it. Connectivity is super important actually but once we've found a pair, we'd rather have that connection fail and make a new one via the signaling layer instead of falling back :)

Yeah. Sorry. Clumsy way of putting it.

Have lots of candidates first to maximize connectivity, then prune them to save bandwidth, battery and reduce noise.

Noise apart, this presupposes there is a lot of bandwidth being uses here. My gut feeling is that it can't be that much – hence I asked whether you measured it.

We (non-scientifically) measured the total data usage of the Android app when it was just idling for about 5 minutes, at which point the only thing that should be happening is keep-alives, plus some book-keeping with the TURN servers which admittedly we didn't exclude from that measurement.

I'd have to go back and actually tally up all the binding requests to see how much it actually is if I remove all other traffic! Multiple megabytes might have been a bit of stretch :)

When we turned str0m's debug logs on, there were so many that you could in fact not read any other logs so it felt like that would have to be a big contribution to the overall data usage!

It shouldn't be too difficult to implement some counters that sum up all traffic generated by our own TURN client and str0m. I'll implement that tomorrow and see what numbers we get.

algesten · 2024-02-28T12:49:15Z

It shouldn't be too difficult to implement some counters that sum up all traffic generated by our own TURN client and str0m. I'll implement that tomorrow and see what numbers we get.

Excellent! Thanks! I love to get some hard data on that.

thomaseizinger · 2024-02-29T00:10:07Z

Here are some early logs:

2024-02-29T00:00:16.453321Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 6.09 kB, stun_bytes_to_peer_relayed: 14.35 kB }
2024-02-29T00:00:26.453946Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 7.55 kB, stun_bytes_to_peer_relayed: 25.11 kB }
2024-02-29T00:00:36.454360Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 9.15 kB, stun_bytes_to_peer_relayed: 36.73 kB }
2024-02-29T00:00:46.453929Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 10.75 kB, stun_bytes_to_peer_relayed: 48.35 kB }
2024-02-29T00:00:56.454297Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 12.35 kB, stun_bytes_to_peer_relayed: 59.97 kB }
2024-02-29T00:01:06.453849Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 13.95 kB, stun_bytes_to_peer_relayed: 71.59 kB }
2024-02-29T00:01:16.453982Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 15.55 kB, stun_bytes_to_peer_relayed: 83.21 kB }
2024-02-29T00:01:26.454021Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 17.15 kB, stun_bytes_to_peer_relayed: 94.83 kB }
2024-02-29T00:01:36.454246Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 18.75 kB, stun_bytes_to_peer_relayed: 106.45 kB }
2024-02-29T00:01:46.453911Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 20.35 kB, stun_bytes_to_peer_relayed: 118.07 kB }
2024-02-29T00:01:56.454139Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 21.95 kB, stun_bytes_to_peer_relayed: 129.69 kB }
2024-02-29T00:02:06.454218Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 23.55 kB, stun_bytes_to_peer_relayed: 141.31 kB }
2024-02-29T00:02:16.453379Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 25.15 kB, stun_bytes_to_peer_relayed: 152.93 kB }
2024-02-29T00:02:26.454363Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 26.67 kB, stun_bytes_to_peer_relayed: 164.55 kB }
2024-02-29T00:02:36.453653Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 28.27 kB, stun_bytes_to_peer_relayed: 176.17 kB }
2024-02-29T00:02:46.453956Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 29.87 kB, stun_bytes_to_peer_relayed: 187.79 kB }
2024-02-29T00:02:56.454060Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 31.47 kB, stun_bytes_to_peer_relayed: 199.41 kB }
2024-02-29T00:03:06.454192Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 33.07 kB, stun_bytes_to_peer_relayed: 211.03 kB }
2024-02-29T00:03:16.453685Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 34.67 kB, stun_bytes_to_peer_relayed: 222.65 kB }
2024-02-29T00:03:26.454280Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 36.27 kB, stun_bytes_to_peer_relayed: 234.27 kB }
2024-02-29T00:03:36.453754Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 37.87 kB, stun_bytes_to_peer_relayed: 245.89 kB }
2024-02-29T00:03:46.454647Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 39.47 kB, stun_bytes_to_peer_relayed: 257.51 kB }
2024-02-29T00:03:56.453414Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 41.07 kB, stun_bytes_to_peer_relayed: 269.13 kB }
2024-02-29T00:04:06.454558Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 42.67 kB, stun_bytes_to_peer_relayed: 280.75 kB }
2024-02-29T00:04:16.453673Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 44.27 kB, stun_bytes_to_peer_relayed: 292.37 kB }
2024-02-29T00:04:26.453906Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 45.87 kB, stun_bytes_to_peer_relayed: 303.99 kB }
2024-02-29T00:04:36.453788Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 47.47 kB, stun_bytes_to_peer_relayed: 315.61 kB }
2024-02-29T00:04:46.454831Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 49.07 kB, stun_bytes_to_peer_relayed: 327.23 kB }
2024-02-29T00:04:56.453251Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 50.67 kB, stun_bytes_to_peer_relayed: 338.85 kB }
2024-02-29T00:05:06.453654Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 52.27 kB, stun_bytes_to_peer_relayed: 350.47 kB }
2024-02-29T00:05:16.453233Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 53.87 kB, stun_bytes_to_peer_relayed: 362.09 kB }
2024-02-29T00:05:26.453618Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 55.47 kB, stun_bytes_to_peer_relayed: 373.71 kB }
2024-02-29T00:05:36.453282Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 57.07 kB, stun_bytes_to_peer_relayed: 385.33 kB }
2024-02-29T00:05:46.454126Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 58.67 kB, stun_bytes_to_peer_relayed: 396.95 kB }
2024-02-29T00:05:56.453692Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 60.27 kB, stun_bytes_to_peer_relayed: 408.42 kB }
2024-02-29T00:06:06.454238Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 61.87 kB, stun_bytes_to_peer_relayed: 419.88 kB }
2024-02-29T00:06:16.453243Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 63.40 kB, stun_bytes_to_peer_relayed: 431.50 kB }
2024-02-29T00:06:26.454474Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 65.00 kB, stun_bytes_to_peer_relayed: 443.12 kB }
2024-02-29T00:06:36.453941Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 66.60 kB, stun_bytes_to_peer_relayed: 454.74 kB }
2024-02-29T00:06:46.453610Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 68.12 kB, stun_bytes_to_peer_relayed: 466.36 kB }
2024-02-29T00:06:56.454841Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 69.72 kB, stun_bytes_to_peer_relayed: 477.98 kB }
2024-02-29T00:07:06.453429Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 71.32 kB, stun_bytes_to_peer_relayed: 489.60 kB }
2024-02-29T00:07:16.454171Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 72.92 kB, stun_bytes_to_peer_relayed: 501.22 kB }
2024-02-29T00:07:26.454635Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 74.52 kB, stun_bytes_to_peer_relayed: 512.84 kB }
2024-02-29T00:07:36.453232Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 76.12 kB, stun_bytes_to_peer_relayed: 524.46 kB }
2024-02-29T00:07:46.453754Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 77.72 kB, stun_bytes_to_peer_relayed: 536.08 kB }
2024-02-29T00:07:56.454411Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 79.32 kB, stun_bytes_to_peer_relayed: 547.70 kB }

This is already with some of the optimisations applied that we discussed, i.e. invalidating a lot of candidate pairs. Something is still off because there are still more than 1 pair being tested. The above logs are with 5 pairs I think.

thomaseizinger · 2024-02-29T02:25:51Z

The above comes down to 80kb a minute for keep-alives. During that time, we are sending 28 unique messages, (14 replies & 14 responses), so I think in total, it means we are keeping 14 candidate pairs alive.

If my math doesn't fail me, that is ~ 6kb per pair per minute.

algesten · 2024-03-26T08:22:48Z

This sounds within expected parameters. It will further be improved by having a resolution to #490 – where the user can reduce the frequency if they so wishes.

Notice that 6kb per pair/minute is nothing for the WebRTC use case.

algesten · 2024-03-26T08:23:04Z

Can we close this in favor of #490?

thomaseizinger · 2024-03-26T08:35:40Z

Yeah we resolved this by invalidating all other candidate pairs :)

thomaseizinger mentioned this issue Feb 27, 2024

snownet: reduce bandwidth usage of keep-alives firezone/firezone#3789

Closed

jamilbk mentioned this issue Feb 28, 2024

feat(snownet): only keep the best possible candidate pair alive firezone/firezone#3792

Merged

jamilbk mentioned this issue Feb 29, 2024

connlib: connection fallback firezone/firezone#3807

Closed

thomaseizinger closed this as completed Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce frequency of keep-alives on non-nominated pairs #464

Reduce frequency of keep-alives on non-nominated pairs #464

thomaseizinger commented Feb 27, 2024

thomaseizinger commented Feb 28, 2024

thomaseizinger commented Feb 28, 2024

thomaseizinger commented Feb 28, 2024

algesten commented Feb 28, 2024

thomaseizinger commented Feb 28, 2024

algesten commented Feb 28, 2024

thomaseizinger commented Feb 28, 2024

algesten commented Feb 28, 2024

thomaseizinger commented Feb 28, 2024 •

edited

Loading

algesten commented Feb 28, 2024

thomaseizinger commented Feb 29, 2024

thomaseizinger commented Feb 29, 2024

algesten commented Mar 26, 2024

algesten commented Mar 26, 2024

thomaseizinger commented Mar 26, 2024

Reduce frequency of keep-alives on non-nominated pairs #464

Reduce frequency of keep-alives on non-nominated pairs #464

Comments

thomaseizinger commented Feb 27, 2024

thomaseizinger commented Feb 28, 2024

thomaseizinger commented Feb 28, 2024

thomaseizinger commented Feb 28, 2024

algesten commented Feb 28, 2024

thomaseizinger commented Feb 28, 2024

algesten commented Feb 28, 2024

thomaseizinger commented Feb 28, 2024

algesten commented Feb 28, 2024

thomaseizinger commented Feb 28, 2024 • edited Loading

algesten commented Feb 28, 2024

thomaseizinger commented Feb 29, 2024

thomaseizinger commented Feb 29, 2024

algesten commented Mar 26, 2024

algesten commented Mar 26, 2024

thomaseizinger commented Mar 26, 2024

thomaseizinger commented Feb 28, 2024 •

edited

Loading