Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce frequency of keep-alives on non-nominated pairs #464

Closed
thomaseizinger opened this issue Feb 27, 2024 · 15 comments
Closed

Reduce frequency of keep-alives on non-nominated pairs #464

thomaseizinger opened this issue Feb 27, 2024 · 15 comments

Comments

@thomaseizinger
Copy link
Collaborator

Currently, I believe str0m sends keep-alives on all valid candidate pairs at the same rate. Sending frequent keep-alives is useful to detect network partitions. However, I think that is only relevant for the currently nominated pair.

Could we reduce the keep-alives on other pairs to something like 5 or 10 seconds?

@thomaseizinger
Copy link
Collaborator Author

If we'd ignore trickle ICE, then the spec would actually say to stop sending keep alives on all but the nominated candidate pair: https://datatracker.ietf.org/doc/html/rfc8445#section-8.3

@thomaseizinger
Copy link
Collaborator Author

I think what I would want is a behaviour such as "move to completed if we haven't received a new candidate in a while" but still allow new candidates to be added in case they come through.

I am not sure if time is the best trigger for this condition.

Perhaps what we need is an implementation of the "end of candidates" indication, plus triggering an ICE restart if we learn a new candidate after we have sent "end of candidates"?

@thomaseizinger
Copy link
Collaborator Author

I think I figured out something that solves this for now. Upon every nomination, I'll invalidate all other candidates with less or eq the nominated priority. This means we can still find a better pair but we won't keep around others. This removes the ability of a "fallback" in case the current one stops working but in that case, we just discard the connection and form a new one (kind of like an ICE restart).

@algesten
Copy link
Owner

Yeah. So for WebRTC the goal is to maximize the chances of connectivity. The frequency of STUN requests to the fallbacks could potentially be reduced, but it's not right to remove them altogether.

If we'd ignore trickle ICE, then the spec would actually say to stop sending keep alives on all but the nominated candidate pair: https://datatracker.ietf.org/doc/html/rfc8445#section-8.3

This is something we have a deliberate stance on: https://github.com/algesten/str0m/blob/main/docs/ice.md#nomination-doesnt-stop-gathering

And also see: https://github.com/algesten/str0m/blob/main/docs/ice.md#why-aend-of-candidates

Ultimately it seems your application is less concerned with connectivity and more about not being too noisy. That's slightly at odds with the goals of str0m. invalidate_candidate seems like one solve, another thing we could explore is implementing this part of the spec:

https://datatracker.ietf.org/doc/html/rfc5245#section-5.7.3

an agent MUST limit the total number of connectivity
checks the agent performs across all check lists to a specific value,
and this value MUST be configurable. A default of 100 is
RECOMMENDED. This limit is enforced by discarding the lower-priority
candidate pairs until there are less than 100.

If we made this runtime configurable, you could potentially lower the limit to 1-2 after successful connect.

@thomaseizinger
Copy link
Collaborator Author

Ultimately it seems your application is less concerned with connectivity and more about not being too noisy. That's slightly at odds with the goals of str0m.

I imagine many applications will run into that. We "only" generate a handful of candidates (8-10 per peer) and if most of them are active, str0m will generate multiple megabytes of traffic per minute just for keep-alives. On mobile devices, that is not exactly ideal.

The quoted section sounds interesting, I think that may be another good angle to solve this. We would also need to be somewhat smart about which ones are dropped. Also, it would be important that this only happens once we've nominated a pair.

@algesten
Copy link
Owner

str0m will generate multiple megabytes of traffic per minute just for keep-alives

Did you measure this?

@thomaseizinger
Copy link
Collaborator Author

Ultimately it seems your application is less concerned with connectivity and more about not being too noisy.

That is not quite how I'd put it. Connectivity is super important actually but once we've found a pair, we'd rather have that connection fail and make a new one via the signaling layer instead of falling back :)

It appears to me that this is a trade-off that most apps will want to make at some point: Have lots of candidates first to maximize connectivity, then prune them to save bandwidth, battery and reduce noise.

@algesten
Copy link
Owner

That is not quite how I'd put it. Connectivity is super important actually but once we've found a pair, we'd rather have that connection fail and make a new one via the signaling layer instead of falling back :)

Yeah. Sorry. Clumsy way of putting it.

Have lots of candidates first to maximize connectivity, then prune them to save bandwidth, battery and reduce noise.

Noise apart, this presupposes there is a lot of bandwidth being uses here. My gut feeling is that it can't be that much – hence I asked whether you measured it.

@thomaseizinger
Copy link
Collaborator Author

thomaseizinger commented Feb 28, 2024

That is not quite how I'd put it. Connectivity is super important actually but once we've found a pair, we'd rather have that connection fail and make a new one via the signaling layer instead of falling back :)

Yeah. Sorry. Clumsy way of putting it.

Have lots of candidates first to maximize connectivity, then prune them to save bandwidth, battery and reduce noise.

Noise apart, this presupposes there is a lot of bandwidth being uses here. My gut feeling is that it can't be that much – hence I asked whether you measured it.

We (non-scientifically) measured the total data usage of the Android app when it was just idling for about 5 minutes, at which point the only thing that should be happening is keep-alives, plus some book-keeping with the TURN servers which admittedly we didn't exclude from that measurement.

I'd have to go back and actually tally up all the binding requests to see how much it actually is if I remove all other traffic! Multiple megabytes might have been a bit of stretch :)

When we turned str0m's debug logs on, there were so many that you could in fact not read any other logs so it felt like that would have to be a big contribution to the overall data usage!

It shouldn't be too difficult to implement some counters that sum up all traffic generated by our own TURN client and str0m. I'll implement that tomorrow and see what numbers we get.

@algesten
Copy link
Owner

It shouldn't be too difficult to implement some counters that sum up all traffic generated by our own TURN client and str0m. I'll implement that tomorrow and see what numbers we get.

Excellent! Thanks! I love to get some hard data on that.

@thomaseizinger
Copy link
Collaborator Author

Here are some early logs:

2024-02-29T00:00:16.453321Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 6.09 kB, stun_bytes_to_peer_relayed: 14.35 kB }
2024-02-29T00:00:26.453946Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 7.55 kB, stun_bytes_to_peer_relayed: 25.11 kB }
2024-02-29T00:00:36.454360Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 9.15 kB, stun_bytes_to_peer_relayed: 36.73 kB }
2024-02-29T00:00:46.453929Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 10.75 kB, stun_bytes_to_peer_relayed: 48.35 kB }
2024-02-29T00:00:56.454297Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 12.35 kB, stun_bytes_to_peer_relayed: 59.97 kB }
2024-02-29T00:01:06.453849Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 13.95 kB, stun_bytes_to_peer_relayed: 71.59 kB }
2024-02-29T00:01:16.453982Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 15.55 kB, stun_bytes_to_peer_relayed: 83.21 kB }
2024-02-29T00:01:26.454021Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 17.15 kB, stun_bytes_to_peer_relayed: 94.83 kB }
2024-02-29T00:01:36.454246Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 18.75 kB, stun_bytes_to_peer_relayed: 106.45 kB }
2024-02-29T00:01:46.453911Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 20.35 kB, stun_bytes_to_peer_relayed: 118.07 kB }
2024-02-29T00:01:56.454139Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 21.95 kB, stun_bytes_to_peer_relayed: 129.69 kB }
2024-02-29T00:02:06.454218Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 23.55 kB, stun_bytes_to_peer_relayed: 141.31 kB }
2024-02-29T00:02:16.453379Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 25.15 kB, stun_bytes_to_peer_relayed: 152.93 kB }
2024-02-29T00:02:26.454363Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 26.67 kB, stun_bytes_to_peer_relayed: 164.55 kB }
2024-02-29T00:02:36.453653Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 28.27 kB, stun_bytes_to_peer_relayed: 176.17 kB }
2024-02-29T00:02:46.453956Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 29.87 kB, stun_bytes_to_peer_relayed: 187.79 kB }
2024-02-29T00:02:56.454060Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 31.47 kB, stun_bytes_to_peer_relayed: 199.41 kB }
2024-02-29T00:03:06.454192Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 33.07 kB, stun_bytes_to_peer_relayed: 211.03 kB }
2024-02-29T00:03:16.453685Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 34.67 kB, stun_bytes_to_peer_relayed: 222.65 kB }
2024-02-29T00:03:26.454280Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 36.27 kB, stun_bytes_to_peer_relayed: 234.27 kB }
2024-02-29T00:03:36.453754Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 37.87 kB, stun_bytes_to_peer_relayed: 245.89 kB }
2024-02-29T00:03:46.454647Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 39.47 kB, stun_bytes_to_peer_relayed: 257.51 kB }
2024-02-29T00:03:56.453414Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 41.07 kB, stun_bytes_to_peer_relayed: 269.13 kB }
2024-02-29T00:04:06.454558Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 42.67 kB, stun_bytes_to_peer_relayed: 280.75 kB }
2024-02-29T00:04:16.453673Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 44.27 kB, stun_bytes_to_peer_relayed: 292.37 kB }
2024-02-29T00:04:26.453906Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 45.87 kB, stun_bytes_to_peer_relayed: 303.99 kB }
2024-02-29T00:04:36.453788Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 47.47 kB, stun_bytes_to_peer_relayed: 315.61 kB }
2024-02-29T00:04:46.454831Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 49.07 kB, stun_bytes_to_peer_relayed: 327.23 kB }
2024-02-29T00:04:56.453251Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 50.67 kB, stun_bytes_to_peer_relayed: 338.85 kB }
2024-02-29T00:05:06.453654Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 52.27 kB, stun_bytes_to_peer_relayed: 350.47 kB }
2024-02-29T00:05:16.453233Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 53.87 kB, stun_bytes_to_peer_relayed: 362.09 kB }
2024-02-29T00:05:26.453618Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 55.47 kB, stun_bytes_to_peer_relayed: 373.71 kB }
2024-02-29T00:05:36.453282Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 57.07 kB, stun_bytes_to_peer_relayed: 385.33 kB }
2024-02-29T00:05:46.454126Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 58.67 kB, stun_bytes_to_peer_relayed: 396.95 kB }
2024-02-29T00:05:56.453692Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 60.27 kB, stun_bytes_to_peer_relayed: 408.42 kB }
2024-02-29T00:06:06.454238Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 61.87 kB, stun_bytes_to_peer_relayed: 419.88 kB }
2024-02-29T00:06:16.453243Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 63.40 kB, stun_bytes_to_peer_relayed: 431.50 kB }
2024-02-29T00:06:26.454474Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 65.00 kB, stun_bytes_to_peer_relayed: 443.12 kB }
2024-02-29T00:06:36.453941Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 66.60 kB, stun_bytes_to_peer_relayed: 454.74 kB }
2024-02-29T00:06:46.453610Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 68.12 kB, stun_bytes_to_peer_relayed: 466.36 kB }
2024-02-29T00:06:56.454841Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 69.72 kB, stun_bytes_to_peer_relayed: 477.98 kB }
2024-02-29T00:07:06.453429Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 71.32 kB, stun_bytes_to_peer_relayed: 489.60 kB }
2024-02-29T00:07:16.454171Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 72.92 kB, stun_bytes_to_peer_relayed: 501.22 kB }
2024-02-29T00:07:26.454635Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 74.52 kB, stun_bytes_to_peer_relayed: 512.84 kB }
2024-02-29T00:07:36.453232Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 76.12 kB, stun_bytes_to_peer_relayed: 524.46 kB }
2024-02-29T00:07:46.453754Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 77.72 kB, stun_bytes_to_peer_relayed: 536.08 kB }
2024-02-29T00:07:56.454411Z  INFO connlib::stats: ConnectionStats { stun_bytes_to_peer_direct: 79.32 kB, stun_bytes_to_peer_relayed: 547.70 kB }

This is already with some of the optimisations applied that we discussed, i.e. invalidating a lot of candidate pairs. Something is still off because there are still more than 1 pair being tested. The above logs are with 5 pairs I think.

@thomaseizinger
Copy link
Collaborator Author

The above comes down to 80kb a minute for keep-alives. During that time, we are sending 28 unique messages, (14 replies & 14 responses), so I think in total, it means we are keeping 14 candidate pairs alive.

If my math doesn't fail me, that is ~ 6kb per pair per minute.

@algesten
Copy link
Owner

This sounds within expected parameters. It will further be improved by having a resolution to #490 – where the user can reduce the frequency if they so wishes.

Notice that 6kb per pair/minute is nothing for the WebRTC use case.

@algesten
Copy link
Owner

Can we close this in favor of #490?

@thomaseizinger
Copy link
Collaborator Author

Yeah we resolved this by invalidating all other candidate pairs :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants