Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new candidates and invalidating all previous ones flaps instantly to disconnected #486

Closed
thomaseizinger opened this issue Mar 21, 2024 · 18 comments

Comments

@thomaseizinger
Copy link
Collaborator

When roaming networks, we learn about new interfaces and retire old ones. It appears that, if str0m is told about all of this information at once (i.e. invalidate all current candidates and add new ones), it instantly flaps to Disconnected instead of trying the new candidate pairs.

I think this is due to evaluate_state being called at the very top of handle_timeout:

self.evaluate_state(now);

Is this on purpose? I would have expected str0m first start testing all newly added candidates before concluding that we are Disconnected.

@algesten
Copy link
Owner

Notice that Disconnected doesn't mean it's over. The ICE spec has a Failed
state that is the end of the road that str0m deliberately doesn't have.

We can always come back from Disconnected, which means the app shouldn't do anything drastic in that state.

The question is whether this scenario you're testing has an unreasonable amount of time passing when coming back from Disconnected?

@thomaseizinger
Copy link
Collaborator Author

I see, we currently treat it as drastic so that is probably the bug here.

Let me test with delaying that for a bit.

@thomaseizinger
Copy link
Collaborator Author

Some preliminary testing suggests that this works quite well! Invalidating the old candidate and delaying acting on Disconnected for a few seconds means we can roam quite quickly! :)

@algesten
Copy link
Owner

Wihoo!

@thomaseizinger thomaseizinger closed this as not planned Won't fix, can't repro, duplicate, stale Mar 21, 2024
@thomaseizinger
Copy link
Collaborator Author

thomaseizinger commented Mar 21, 2024

One question though, why doesn't it immediately switch to Checking in this case? The "not immediately acting on Disconnected" works but harms other cases where e.g. the remote peer goes away and we need to fallback to a new one.

I've given it new candidates, shouldn't it switch to Checking instead?

@algesten
Copy link
Owner

I think it's because we nominated, and that nomination goes away. If I recall correctly Checking is maybe only seen before anything is connected.

@thomaseizinger
Copy link
Collaborator Author

Looking closer at the state machine, there seems to never be a transition to Checking unless we were in New or did ice_restart. Yet, in our case, we end up in Completed (we invalidate all candidates apart from the nominated one).

Once we are in Completed, the only transitions are Connected or Disconnected. However, it appears that it would be a valid transition to go from Completed to Checking if we have non-nominated candidates that were just added.

@algesten
Copy link
Owner

I'm wondering if this is a spec thing. I don't think I ended up here by accident.

@thomaseizinger
Copy link
Collaborator Author

I've grepped the spec for Running and couldn't find anything in that regards.

But outside of that, this looks like a bug to me. From str0m's behaviour we are in Checking mode. All the added candidates are being checked. It is just the reported state that says Disconnected.

@algesten
Copy link
Owner

@algesten
Copy link
Owner

Not that we need to follow how browsers do it, but I figured for someone using str0m, it's nice if the states corresponds as much as possible with what they see in the browser.

@thomaseizinger
Copy link
Collaborator Author

I see. This is somewhat of an edge-case though :)

If you invalidate all current candidates and add new ones, you are kind of going back to the start, as if the agent just got created. Technically none of the checks failed as referred to be the linked text.

I am currently testing with a fork. If this is successful, would you mind a patch for this?

@thomaseizinger
Copy link
Collaborator Author

Now I can't reproduce the disconnect anymore. Not sure what I was testing but the state change is now always Connected -> Completed and then Completed -> Connected as soon as I invalidate the old and add the new candidate.

@algesten
Copy link
Owner

I am currently testing with a fork. If this is successful, would you mind a patch for this?

Would a patch help you guys? I'm sort of thinking I like to preserve the (close to browser) behavior if I can. But I'm not feeling very strongly about it.

@thomaseizinger
Copy link
Collaborator Author

I am currently testing with a fork. If this is successful, would you mind a patch for this?

Would a patch help you guys? I'm sort of thinking I like to preserve the (close to browser) behavior if I can. But I'm not feeling very strongly about it.

I am still evaluating! :)
But yes, not flipping to Disconnected where possible would be nice. Currently, we treat Disconnected as a hard error and clean up our connection state. That is simple to understand which is why I'd like to keep it.

I'm sort of thinking I like to preserve the (close to browser) behavior if I can.

I can understand the sentiment! If I'd have to find an argument for it, then I'd probably say that I would put this into the "we are a WebRTC implementation with an unusual API and this is part of it". When looking at the actual behaviour in regards to the protocol then we are in checking state so it doesn't affect compatibility with anything but is more of an API decision towards the application.

@algesten
Copy link
Owner

Related discussion #416

Notice Peter Thatcher agrees with not treating Disconnected as a hard failure (and he wrote the spec 😄)

@thomaseizinger
Copy link
Collaborator Author

thomaseizinger commented Mar 21, 2024

Notice Peter Thatcher agrees with not treating Disconnected as a hard failure (and he wrote the spec 😄)

I am not disagreeing, just want to defer this complexity because it comes with other implications :)

It lengthens the overall timeout before we can reset all the state because the connection is actually broken. The layer above ICE in our app doesn't know about "temporary connection failures" so it can't act on it. While that could be refactored, it doesn't seem to have a lot of value right now.

It seems useful to entirely avoid going into any error state if we can, be it temporary or permanent. In this case, I would even argue that it is somewhat of a premature decision by the state machine. Yes, I've invalidated the last nominated candidate but I've also given you new ones and it is checking them. Why should that output Disconnected? Why doesn't it immediately also output Checking after? Or go to Connected directly?1 One could argue that invalidating != failing STUN bindings.

Maybe we are just hairsplitting the terminology here? It seems impossible to capture this accurately in a single state. Connected / Disconnected is one. Checking / Completed is orthogonal to that, isn't it?

Footnotes

  1. I need to do more testing on different networks but I've observed this in some situation so I need to narrow down further, when this behaviour occurs.

@thomaseizinger
Copy link
Collaborator Author

I've now implemented a timer and I am now more convinced that even though a timer might be correct in the general case, it is a workaround for this particular scenario.

When switching networks (i.e. invalidating current candidates and adding new ones), connectivity is likely going to be restored. How long will that take? 5 seconds? 10? 15? it is a guessing game. Whilst it may be appealing to have a generic handling of "Disconnected is not fatal" with timer, it doesn't work reliably in this case. Note that what I want to avoid here is triggering clean-up procedures of other connection related state which we currently do when we declare a connection as failed.

It seems to be a much more robust solution to have str0m test the new candidates and tell me when those fail. But for that, I need to first know that it is even testing new candidates (i.e. emit Checking). Currently, the state transitions are:

  • Moving to Comleted
  • Switch networks
  • Completed -> Disconnected
  • Maybe Connected but we don't know when

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants