Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ice): controlled agent follows nomination of same prio socket #496

Closed

Conversation

thomaseizinger
Copy link
Collaborator

If a candidate on the controlling IceAgent is invalidated but a same priority candidate is added, the controlling agent currently does not follow this nomination right away and instead runs into the ICE timeout before it switches over.

The controlling agent is the one that picks the nominated candidate and thus, only that one should bail early on nomination of same priority candidates to avoid socket hopping.

@thomaseizinger
Copy link
Collaborator Author

Hmm, this doesn't quite seem to work the way I expected because now, the controlled agent will emit a nomination where as the controlling one doesn't :(

@algesten
Copy link
Owner

Maybe it would be easier to change the priority of a invalidated candidate to 0?

@thomaseizinger
Copy link
Collaborator Author

thomaseizinger commented Apr 15, 2024

The issue we are running into is that when one of the agents roams, it sends new candidates to the other party. At this point, the receiving end currently needs to wait for the ICE timeout before it nominates that new candidate.

I think this is what usually an ICE restart is used for but I find ICE restart to be too destructive and not really in the spirit of "trickle ICE". Would you be open to add an IceAgent::invalidate_remote_candidate function such that a signalling layer could also trickle candidates that should no longer be used, thus avoiding the ICE timeout?

@algesten
Copy link
Owner

Maybe it would be easier to change the priority of a invalidated candidate to 0?

or at least some penalty? 🤔

@thomaseizinger
Copy link
Collaborator Author

Maybe it would be easier to change the priority of a invalidated candidate to 0?

or at least some penalty? 🤔

I think there are two issues:

  1. The roaming client cannot send anymore from the the old candidate.
  2. The other party has no idea that the previous candidate was invalidated.

How would the adjusted priority get communicated to the other party?

@algesten
Copy link
Owner

So the old candidate is definitely gone already?

@thomaseizinger
Copy link
Collaborator Author

So the old candidate is definitely gone already?

In our case yes. I'd say that is a fair assumption when calling invalidate_candidate.

@algesten
Copy link
Owner

algesten commented Apr 15, 2024

When you say that ice restart is too destructive, what do you mean? The way I think this is supposed to work (if you worked with the RTCPeerConnection API in javascript for instance), is that there are two cases:

  1. IceAgent detects the disconnect, which results in moving to lesser priority candidates (that been kept alive by STUN packets in the background).
  2. User detects the disconnect using some other API (connectivity API?), and forces a quicker restart by using Ice restart.

@thomaseizinger
Copy link
Collaborator Author

When you say that ice restart is too destructive, what do you mean?

It resets the credentials and requires exchange of new ones via the signalling layer. It also means that whatever API you are building can have serious consequences for connectivity if it is called in the wrong circumstances. Compare that to an approach where we are "just" signaling another candidate to the remote and as suggested above, also signalling if we have detected that we definitely don't want to be using a certain candidate anymore.

The way we detect roaming in our application is by actively performing STUN with a set of servers and refreshing all allocations. If we didn't actually roam, this is idempotent. We just refreshed our allocation and confirmed that we still have the same srvflx IP.

If we did roam, STUN will give us a new srflx IP and the refresh will fail with an allocation mismatch. In both cases, I can now invalidate the previous candidate and signal the new ones to the remote.

  1. IceAgent detects the disconnect

This is bound to the ICE timeout though which is quite long. I'd like a solution that doesn't rely on timeouts.

  1. which results in moving to lesser priority candidates (that been kept alive by STUN packets in the background).

When you are actually roaming, this isn't going to work because all your previous network paths will fail. We don't keep any additional paths alive to minimize chatter but even if we would, they would all start to fail around the same time (modulo some randomization of when the STUN packets get sent). What ends up happening is that the agent hops from one candidate pair to the next, failing after 1-2 seconds and going to the next, only prolonging the actual detection that all candidates are dead.

User detects the disconnect using some other API (connectivity API?), and forces a quicker restart by using Ice restart.

As mentioned above, it would be nice to build this in an idempotent way where this "detection" is allowed to yield false-positives and any resulting action can deal with that gracefully. An ICE restart isn't very graceful from all I understand but perhaps that is what I need to change my mind on?

Are you open to an API to invalidate remote candidates? I'd like to experiment with selectively invalidating candidates on the remote end to force nomination of a new pair before the ICE timeout but I don't want to run experiments with an API that will definitely not get accepted upstream.

@lolgesten
Copy link

Are you open to an API to invalidate remote candidates?

Reluctant since this isn't how the standard works.

Would another solution be to pre-negotiate the ice credentials for the restart, so that the server can be "prepped" with the new credentials ahead of time, meaning there is no signaling involved, just a new gathering phase?

@thomaseizinger
Copy link
Collaborator Author

thomaseizinger commented Apr 16, 2024

Are you open to an API to invalidate remote candidates?

Reluctant since this isn't how the standard works.

Hmm, I was hoping that adding optional, non-standard APIs would be okay. Is invalidate_candidate a standardized API? I can't find anything in that regard on https://developer.mozilla.org/en-US/docs/Web/API/RTCPeerConnection.

Would another solution be to pre-negotiate the ice credentials for the restart, so that the server can be "prepped" with the new credentials ahead of time, meaning there is no signaling involved, just a new gathering phase?

Isn't this kind of equivalent to just invalidating all candidates and just not resetting the credentials? If we are invalidating all candidates, why not also just a specific one?

@lolgesten
Copy link

I mean sure, we can explore other non-standard methods, but I don't see how doing an ICE restart would take any amount of time if you take signaling out of the equation. It's like the round trip of a UDP packet which is what you get with the proposed method also, no?

@thomaseizinger
Copy link
Collaborator Author

I mean sure, we can explore other non-standard methods, but I don't see how doing an ICE restart would take any amount of time if you take signaling out of the equation. It's like the round trip of a UDP packet which is what you get with the proposed method also, no?

I am not so worried about the time it takes on the whole. My primary goal is to reduce complexity in state management and achieve a smooth transition that does not declare the connection "failed" at any point.

If I do an ICE restart without immediately adding new candidates, the connection will fail. To avoid that, I need to wait for the first new candidate of the new network interface. But that is an async operation (STUN binding / TURN refresh). If we are dealing with a false-positive, that async operation never yields new candidates. But if it does, I need to be able to check some state and issue an ICE restart because the binding / refresh was triggered as part of an indication by higher-level APIs that we are switching networks.

In addition to that, an ICE restart also requires signalling new credentials. Credentials that may have to be "parked" on the other end, in case they get used? Would that happen in the IceAgent or would I have to do that?

Maybe some of these quirks can be worked around with by always doing an ICE restart and accept the cost of a bit more chatter? But even that is sub-optimal. What if the previous connection was done through a relay and we now settle on a different relay? Probably fine as well.

I can't help it but think that it is quite complicated though compared to invalidating an existing candidate and thus force nomination of the next one that we trickled via the same mechanism as all initial ones.

@thomaseizinger
Copy link
Collaborator Author

non-standard methods

I did chat to some ICE-involved people at the last IETF meeting in Brisbane. If this idea works out, maybe we can put it into a draft and discuss it at the next one in Vancouver in July? I've been meaning to find an excuse to go there 😁

@lolgesten
Copy link

Sure. Explore non standard API ;)

@thomaseizinger
Copy link
Collaborator Author

Sure. Explore non standard API ;)

Reading a bit through https://datatracker.ietf.org/doc/html/rfc8839, I noticed that there is a section for extensions for attributes of ICE candidates: https://datatracker.ietf.org/doc/html/rfc8839#name-candidate-attribute-extensi

I wonder if it would make sense to add an attribute that signals "this candidate is now invalid"? It feels a bit odd but at the same time, it could find more adoption that way because the same signalling layer can be used. Some food for thought!

I'll attempt to implement with a dedicated message for now :)

@thomaseizinger
Copy link
Collaborator Author

Closing this PR because this approach is broken. Will open a new one once I have the invalidate_remote_candidate function ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants