Refresh relays messages #4548

conectado · 2024-04-09T00:36:59Z

Not necessarily the cause behind #4521 but might partially be.

Right now we get the relays credentials after we generate a connection intent, we create a new connection and we use the given TURN servers with the given credentials throughout the lifetime of the connection.

However, when there is a redploy the relays can get restarted or deleted making those credentials or whole server invalid.

The portal already have a mechanism implemented where a new relay is deployed before taking down the old one, with a new message for the gateway and client we can greatly reduce or even prevent any downtime.

With the portal broadcasting the new servers for the given connections and with a message for invalidating the old server, we can inform this to snownet and invalidate the corresponding candidates and generate new ones.

Right now the highest priority would be just to be able to inform the new TURN candidates(new servers) since with that the old ones would just timeout, without this the peer connection would be terminated invalidating all DNS mappings which would take a long time to self-heal.

jamilbk · 2024-04-09T01:38:53Z

We're all on the same page here. Currently wrapping up discussion w/ @thomaseizinger and @AndrewDryga on this topic and think we have a solution.

Todo

Should be backwards-compatible with existing APIs:

Relay: Disconnect websocket on SIGTERM but keep existing data connections active
Portal sends message that "X Relay is down, here are the N healthy ones you can use"
Portal sends N relays in the init message in addition to request_intent (cc @conectado)

So with the above changes there are now three ways the client can receive new Relays to potentially use:

init
relay going down
request_intent (receiving Relays this way is only for backwards compatibility and can go away once all clients have the above updates)

jamilbk · 2024-04-09T01:53:19Z

refs #4510

conectado · 2024-04-09T02:01:39Z

* [ ]  Portal sends N relays in the init message in addition to request_intent (cc @conectado)

@jamilbk Do we need the relays in the init message if we are already going to recieve them as part of the response to the intent?

AndrewDryga · 2024-04-09T02:09:21Z

Graceful shutdown of the relay: disconnect the WS connection on SIGTERM or other signals; if there are no allocations exit immediately;
Portal sends a message with disconnected relay ID and N relays to use; relays will be sent in init message (with backwards compatibility for now);
Relay health checks work in two states: when relay boots it starts returning HTTP 200 right away and then connects to the websocket and if it fails to connect to the portal stops returning 200; after it was connected to a portal but then misses heartbeats for PARTITION_DURIATION stop returning HTTP 200.

thomaseizinger · 2024-04-09T03:40:31Z

* [ ]  Portal sends N relays in the init message in addition to request_intent (cc @conectado)
@jamilbk Do we need the relays in the init message if we are already going to recieve them as part of the response to the intent?

We also discussed removing this option entirely so first thing to do will be to remove the allow-list of TURN servers and always use all relays.

thomaseizinger · 2024-04-09T03:41:51Z

when relay boots it starts returning HTTP 200 right away and then connects to the websocket

@AndrewDryga I had this differently, why return 200 initially? Shouldn't we return 400 until we are connected to the portal?

AndrewDryga · 2024-04-09T05:13:57Z

@thomaseizinger we need to tell instance group manager and load balancer that relay can receive traffic before the portal starts to give its address to the clients

thomaseizinger · 2024-04-09T07:08:12Z

Smashed out a series of PRs that move this issue forward:

To seamlessly migrate relayed connections when relays get re-deployed, we will be introducing a new message from the portal that informs us regarding relays that are shutting down and new ones that became active. Currently, relays are scoped to a particular connection. With the introduction of the above message, it would be unclear, how these new relays should be added to these allow lists. To make this simpler, we remove these allow lists and always use all relays for all connections. Related: #4548.

Upon receiving a SIGTERM, we immediately disconnect from the websocket connection to the portal and set a flag that we are shutting down. Once we are disconnected from the portal and no longer have an active allocations, we exit with 0. A repeated SIGTERM signal will interrupt this process and force the relay to shutdown. Disconnecting from the portal will (eventually) trigger a message to clients and gateways that this relay should no longer be used. Thus, depending on the timeout our supervisor has configured after sending SIGTERM, the relay will continue all TURN operations until the number of allocations drops to 0. Currently, we also allow clients to make new allocations and refreshing existing allocations. In the future, it may make sense to implement a dedicated status code and refuse `ALLOCATE` and `REFRESH` messages whilst we are shutting down. Related: #4548. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>

This is another step towards #4548. The portal now includes a list of relays as part of the "init" message. Any time we receive an "init", we will now upsert those relays based on their ID. This requires us to change our internal bookkeeping of relays from indexing them by address to indexing by ID. To ensure that this works correctly, the unit tests are rewritten to use the new `upsert_relays` API. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>

Whenever we receive a `relays_presence` message from the portal, we invalidate the candidates of all now disconnected relays and make allocations on the new ones. This triggers signalling of new candidates to the remote party and migrates the connection to the newly nominated socket. This still relies on #4613 until we have #4634. Resolves: #4548. --------- Co-authored-by: Jamil <jamilbk@users.noreply.github.com>

conectado added area/connlib Firezone's core connectivity library area/portal Portal, panel, web, control plane, you name it! labels Apr 9, 2024

conectado added this to the 1.0 GA milestone Apr 9, 2024

conectado assigned AndrewDryga, conectado and thomaseizinger Apr 9, 2024

jamilbk mentioned this issue Apr 9, 2024

Re-establish DNS mapping in client when packet from existing mapping is seen and not connected to existing peer #3369

Open

jamilbk added the business_value/critical Required by 100% of our customer base label Apr 9, 2024

thomaseizinger mentioned this issue Apr 9, 2024

refactor(snownet): remove allow-list of STUN and TURN servers #4551

Merged

thomaseizinger mentioned this issue Apr 9, 2024

chore(relay): perform graceful shutdown upon receiving SIGTERM #4552

Merged

jamilbk removed this from the 1.0 GA milestone Apr 10, 2024

jamilbk mentioned this issue Apr 11, 2024

DNS resource mappings break after redeploy #4521

Closed

thomaseizinger mentioned this issue Apr 12, 2024

snownet: detect NAT status #4060

Open

jamilbk mentioned this issue Apr 13, 2024

ci: Only build debug images for linux/amd64 #4612

Merged

thomaseizinger mentioned this issue Apr 16, 2024

control-plane: introduce a message for invalidating candidates #4634

Closed

jamilbk mentioned this issue Apr 16, 2024

test(client): add reconnection tests from a client using a headless browser #4569

Merged

jamilbk unassigned AndrewDryga and conectado Apr 16, 2024

jamilbk closed this as completed in #4568 Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh relays messages #4548

Refresh relays messages #4548

conectado commented Apr 9, 2024

jamilbk commented Apr 9, 2024 •

edited

jamilbk commented Apr 9, 2024

conectado commented Apr 9, 2024 •

edited

AndrewDryga commented Apr 9, 2024

thomaseizinger commented Apr 9, 2024

thomaseizinger commented Apr 9, 2024 •

edited

AndrewDryga commented Apr 9, 2024

thomaseizinger commented Apr 9, 2024

Refresh relays messages #4548

Refresh relays messages #4548

Comments

conectado commented Apr 9, 2024

jamilbk commented Apr 9, 2024 • edited

Todo

jamilbk commented Apr 9, 2024

conectado commented Apr 9, 2024 • edited

AndrewDryga commented Apr 9, 2024

thomaseizinger commented Apr 9, 2024

thomaseizinger commented Apr 9, 2024 • edited

AndrewDryga commented Apr 9, 2024

thomaseizinger commented Apr 9, 2024

jamilbk commented Apr 9, 2024 •

edited

conectado commented Apr 9, 2024 •

edited

thomaseizinger commented Apr 9, 2024 •

edited