fix(ice): controlled agent follows nomination of same prio socket #496

thomaseizinger · 2024-04-15T07:51:14Z

If a candidate on the controlling IceAgent is invalidated but a same priority candidate is added, the controlling agent currently does not follow this nomination right away and instead runs into the ICE timeout before it switches over.

The controlling agent is the one that picks the nominated candidate and thus, only that one should bail early on nomination of same priority candidates to avoid socket hopping.

thomaseizinger · 2024-04-15T08:00:33Z

Hmm, this doesn't quite seem to work the way I expected because now, the controlled agent will emit a nomination where as the controlling one doesn't :(

algesten · 2024-04-15T08:07:44Z

Maybe it would be easier to change the priority of a invalidated candidate to 0?

thomaseizinger · 2024-04-15T08:07:46Z

The issue we are running into is that when one of the agents roams, it sends new candidates to the other party. At this point, the receiving end currently needs to wait for the ICE timeout before it nominates that new candidate.

I think this is what usually an ICE restart is used for but I find ICE restart to be too destructive and not really in the spirit of "trickle ICE". Would you be open to add an IceAgent::invalidate_remote_candidate function such that a signalling layer could also trickle candidates that should no longer be used, thus avoiding the ICE timeout?

algesten · 2024-04-15T08:09:10Z

Maybe it would be easier to change the priority of a invalidated candidate to 0?

or at least some penalty? 🤔

thomaseizinger · 2024-04-15T08:10:25Z

Maybe it would be easier to change the priority of a invalidated candidate to 0?

or at least some penalty? 🤔

I think there are two issues:

The roaming client cannot send anymore from the the old candidate.
The other party has no idea that the previous candidate was invalidated.

How would the adjusted priority get communicated to the other party?

algesten · 2024-04-15T08:11:18Z

So the old candidate is definitely gone already?

thomaseizinger · 2024-04-15T08:14:54Z

So the old candidate is definitely gone already?

In our case yes. I'd say that is a fair assumption when calling invalidate_candidate.

algesten · 2024-04-15T08:22:56Z

When you say that ice restart is too destructive, what do you mean? The way I think this is supposed to work (if you worked with the RTCPeerConnection API in javascript for instance), is that there are two cases:

IceAgent detects the disconnect, which results in moving to lesser priority candidates (that been kept alive by STUN packets in the background).
User detects the disconnect using some other API (connectivity API?), and forces a quicker restart by using Ice restart.

thomaseizinger · 2024-04-16T05:10:47Z

When you say that ice restart is too destructive, what do you mean?

It resets the credentials and requires exchange of new ones via the signalling layer. It also means that whatever API you are building can have serious consequences for connectivity if it is called in the wrong circumstances. Compare that to an approach where we are "just" signaling another candidate to the remote and as suggested above, also signalling if we have detected that we definitely don't want to be using a certain candidate anymore.

The way we detect roaming in our application is by actively performing STUN with a set of servers and refreshing all allocations. If we didn't actually roam, this is idempotent. We just refreshed our allocation and confirmed that we still have the same srvflx IP.

If we did roam, STUN will give us a new srflx IP and the refresh will fail with an allocation mismatch. In both cases, I can now invalidate the previous candidate and signal the new ones to the remote.

IceAgent detects the disconnect

This is bound to the ICE timeout though which is quite long. I'd like a solution that doesn't rely on timeouts.

which results in moving to lesser priority candidates (that been kept alive by STUN packets in the background).

When you are actually roaming, this isn't going to work because all your previous network paths will fail. We don't keep any additional paths alive to minimize chatter but even if we would, they would all start to fail around the same time (modulo some randomization of when the STUN packets get sent). What ends up happening is that the agent hops from one candidate pair to the next, failing after 1-2 seconds and going to the next, only prolonging the actual detection that all candidates are dead.

User detects the disconnect using some other API (connectivity API?), and forces a quicker restart by using Ice restart.

As mentioned above, it would be nice to build this in an idempotent way where this "detection" is allowed to yield false-positives and any resulting action can deal with that gracefully. An ICE restart isn't very graceful from all I understand but perhaps that is what I need to change my mind on?

Are you open to an API to invalidate remote candidates? I'd like to experiment with selectively invalidating candidates on the remote end to force nomination of a new pair before the ICE timeout but I don't want to run experiments with an API that will definitely not get accepted upstream.

lolgesten · 2024-04-16T07:08:31Z

Are you open to an API to invalidate remote candidates?

Reluctant since this isn't how the standard works.

Would another solution be to pre-negotiate the ice credentials for the restart, so that the server can be "prepped" with the new credentials ahead of time, meaning there is no signaling involved, just a new gathering phase?

thomaseizinger · 2024-04-16T07:54:52Z

Are you open to an API to invalidate remote candidates?

Reluctant since this isn't how the standard works.

Hmm, I was hoping that adding optional, non-standard APIs would be okay. Is invalidate_candidate a standardized API? I can't find anything in that regard on https://developer.mozilla.org/en-US/docs/Web/API/RTCPeerConnection.

Would another solution be to pre-negotiate the ice credentials for the restart, so that the server can be "prepped" with the new credentials ahead of time, meaning there is no signaling involved, just a new gathering phase?

Isn't this kind of equivalent to just invalidating all candidates and just not resetting the credentials? If we are invalidating all candidates, why not also just a specific one?

lolgesten · 2024-04-16T07:58:20Z

I mean sure, we can explore other non-standard methods, but I don't see how doing an ICE restart would take any amount of time if you take signaling out of the equation. It's like the round trip of a UDP packet which is what you get with the proposed method also, no?

thomaseizinger · 2024-04-16T08:14:04Z

I mean sure, we can explore other non-standard methods, but I don't see how doing an ICE restart would take any amount of time if you take signaling out of the equation. It's like the round trip of a UDP packet which is what you get with the proposed method also, no?

I am not so worried about the time it takes on the whole. My primary goal is to reduce complexity in state management and achieve a smooth transition that does not declare the connection "failed" at any point.

If I do an ICE restart without immediately adding new candidates, the connection will fail. To avoid that, I need to wait for the first new candidate of the new network interface. But that is an async operation (STUN binding / TURN refresh). If we are dealing with a false-positive, that async operation never yields new candidates. But if it does, I need to be able to check some state and issue an ICE restart because the binding / refresh was triggered as part of an indication by higher-level APIs that we are switching networks.

In addition to that, an ICE restart also requires signalling new credentials. Credentials that may have to be "parked" on the other end, in case they get used? Would that happen in the IceAgent or would I have to do that?

Maybe some of these quirks can be worked around with by always doing an ICE restart and accept the cost of a bit more chatter? But even that is sub-optimal. What if the previous connection was done through a relay and we now settle on a different relay? Probably fine as well.

I can't help it but think that it is quite complicated though compared to invalidating an existing candidate and thus force nomination of the next one that we trickled via the same mechanism as all initial ones.

thomaseizinger · 2024-04-16T08:21:07Z

non-standard methods

I did chat to some ICE-involved people at the last IETF meeting in Brisbane. If this idea works out, maybe we can put it into a draft and discuss it at the next one in Vancouver in July? I've been meaning to find an excuse to go there 😁

lolgesten · 2024-04-16T08:24:38Z

Sure. Explore non standard API ;)

thomaseizinger · 2024-04-16T08:49:25Z

Sure. Explore non standard API ;)

Reading a bit through https://datatracker.ietf.org/doc/html/rfc8839, I noticed that there is a section for extensions for attributes of ICE candidates: https://datatracker.ietf.org/doc/html/rfc8839#name-candidate-attribute-extensi

I wonder if it would make sense to add an attribute that signals "this candidate is now invalid"? It feels a bit odd but at the same time, it could find more adoption that way because the same signalling layer can be used. Some food for thought!

I'll attempt to implement with a dedicated message for now :)

thomaseizinger · 2024-04-17T02:32:31Z

Closing this PR because this approach is broken. Will open a new one once I have the invalidate_remote_candidate function ready.

thomaseizinger added 3 commits April 15, 2024 16:34

Add failing test

56b57b3

Only discard nomination on controlling agent

6c2911f

Update comment

5b1c079

thomaseizinger mentioned this pull request Apr 15, 2024

snownet: don't update socket based on WG activity firezone/firezone#4613

Closed

thomaseizinger mentioned this pull request Apr 15, 2024

refactor(snownet): only send relay candidates if hole-punching fails firezone/firezone#4268

Closed

Initialize logger scoped to test

3cffc5f

thomaseizinger mentioned this pull request Apr 16, 2024

control-plane: introduce a message for invalidating candidates firezone/firezone#4634

Closed

thomaseizinger closed this Apr 17, 2024

thomaseizinger mentioned this pull request Apr 18, 2024

feat: allow invalidating remote candidates #498

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ice): controlled agent follows nomination of same prio socket #496

fix(ice): controlled agent follows nomination of same prio socket #496

thomaseizinger commented Apr 15, 2024

thomaseizinger commented Apr 15, 2024

algesten commented Apr 15, 2024

thomaseizinger commented Apr 15, 2024 •

edited

Loading

algesten commented Apr 15, 2024

thomaseizinger commented Apr 15, 2024

algesten commented Apr 15, 2024

thomaseizinger commented Apr 15, 2024

algesten commented Apr 15, 2024 •

edited

Loading

thomaseizinger commented Apr 16, 2024

lolgesten commented Apr 16, 2024

thomaseizinger commented Apr 16, 2024 •

edited

Loading

lolgesten commented Apr 16, 2024

thomaseizinger commented Apr 16, 2024

thomaseizinger commented Apr 16, 2024

lolgesten commented Apr 16, 2024

thomaseizinger commented Apr 16, 2024

thomaseizinger commented Apr 17, 2024

fix(ice): controlled agent follows nomination of same prio socket #496

fix(ice): controlled agent follows nomination of same prio socket #496

Conversation

thomaseizinger commented Apr 15, 2024

thomaseizinger commented Apr 15, 2024

algesten commented Apr 15, 2024

thomaseizinger commented Apr 15, 2024 • edited Loading

algesten commented Apr 15, 2024

thomaseizinger commented Apr 15, 2024

algesten commented Apr 15, 2024

thomaseizinger commented Apr 15, 2024

algesten commented Apr 15, 2024 • edited Loading

thomaseizinger commented Apr 16, 2024

lolgesten commented Apr 16, 2024

thomaseizinger commented Apr 16, 2024 • edited Loading

lolgesten commented Apr 16, 2024

thomaseizinger commented Apr 16, 2024

thomaseizinger commented Apr 16, 2024

lolgesten commented Apr 16, 2024

thomaseizinger commented Apr 16, 2024

thomaseizinger commented Apr 17, 2024

thomaseizinger commented Apr 15, 2024 •

edited

Loading

algesten commented Apr 15, 2024 •

edited

Loading

thomaseizinger commented Apr 16, 2024 •

edited

Loading