-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresh relays messages #4548
Comments
We're all on the same page here. Currently wrapping up discussion w/ @thomaseizinger and @AndrewDryga on this topic and think we have a solution. TodoShould be backwards-compatible with existing APIs:
So with the above changes there are now three ways the client can receive new Relays to potentially use:
|
refs #4510 |
@jamilbk Do we need the relays in the init message if we are already going to recieve them as part of the response to the intent? |
|
We also discussed removing this option entirely so first thing to do will be to remove the allow-list of TURN servers and always use all relays. |
@AndrewDryga I had this differently, why return 200 initially? Shouldn't we return 400 until we are connected to the portal? |
@thomaseizinger we need to tell instance group manager and load balancer that relay can receive traffic before the portal starts to give its address to the clients |
Smashed out a series of PRs that move this issue forward: |
To seamlessly migrate relayed connections when relays get re-deployed, we will be introducing a new message from the portal that informs us regarding relays that are shutting down and new ones that became active. Currently, relays are scoped to a particular connection. With the introduction of the above message, it would be unclear, how these new relays should be added to these allow lists. To make this simpler, we remove these allow lists and always use all relays for all connections. Related: #4548.
Upon receiving a SIGTERM, we immediately disconnect from the websocket connection to the portal and set a flag that we are shutting down. Once we are disconnected from the portal and no longer have an active allocations, we exit with 0. A repeated SIGTERM signal will interrupt this process and force the relay to shutdown. Disconnecting from the portal will (eventually) trigger a message to clients and gateways that this relay should no longer be used. Thus, depending on the timeout our supervisor has configured after sending SIGTERM, the relay will continue all TURN operations until the number of allocations drops to 0. Currently, we also allow clients to make new allocations and refreshing existing allocations. In the future, it may make sense to implement a dedicated status code and refuse `ALLOCATE` and `REFRESH` messages whilst we are shutting down. Related: #4548. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
This is another step towards #4548. The portal now includes a list of relays as part of the "init" message. Any time we receive an "init", we will now upsert those relays based on their ID. This requires us to change our internal bookkeeping of relays from indexing them by address to indexing by ID. To ensure that this works correctly, the unit tests are rewritten to use the new `upsert_relays` API. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Whenever we receive a `relays_presence` message from the portal, we invalidate the candidates of all now disconnected relays and make allocations on the new ones. This triggers signalling of new candidates to the remote party and migrates the connection to the newly nominated socket. This still relies on #4613 until we have #4634. Resolves: #4548. --------- Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Not necessarily the cause behind #4521 but might partially be.
Right now we get the relays credentials after we generate a connection intent, we create a new connection and we use the given TURN servers with the given credentials throughout the lifetime of the connection.
However, when there is a redploy the relays can get restarted or deleted making those credentials or whole server invalid.
The portal already have a mechanism implemented where a new relay is deployed before taking down the old one, with a new message for the gateway and client we can greatly reduce or even prevent any downtime.
With the portal broadcasting the new servers for the given connections and with a message for invalidating the old server, we can inform this to snownet and invalidate the corresponding candidates and generate new ones.
Right now the highest priority would be just to be able to inform the new TURN candidates(new servers) since with that the old ones would just timeout, without this the peer connection would be terminated invalidating all DNS mappings which would take a long time to self-heal.
The text was updated successfully, but these errors were encountered: