Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh relays messages #4548

Closed
conectado opened this issue Apr 9, 2024 · 8 comments · Fixed by #4568
Closed

Refresh relays messages #4548

conectado opened this issue Apr 9, 2024 · 8 comments · Fixed by #4568
Assignees
Labels
area/connlib Firezone's core connectivity library area/portal Portal, panel, web, control plane, you name it! business_value/critical Required by 100% of our customer base

Comments

@conectado
Copy link
Collaborator

Not necessarily the cause behind #4521 but might partially be.

Right now we get the relays credentials after we generate a connection intent, we create a new connection and we use the given TURN servers with the given credentials throughout the lifetime of the connection.

However, when there is a redploy the relays can get restarted or deleted making those credentials or whole server invalid.

The portal already have a mechanism implemented where a new relay is deployed before taking down the old one, with a new message for the gateway and client we can greatly reduce or even prevent any downtime.

With the portal broadcasting the new servers for the given connections and with a message for invalidating the old server, we can inform this to snownet and invalidate the corresponding candidates and generate new ones.

Right now the highest priority would be just to be able to inform the new TURN candidates(new servers) since with that the old ones would just timeout, without this the peer connection would be terminated invalidating all DNS mappings which would take a long time to self-heal.

@conectado conectado added area/connlib Firezone's core connectivity library area/portal Portal, panel, web, control plane, you name it! labels Apr 9, 2024
@conectado conectado added this to the 1.0 GA milestone Apr 9, 2024
@jamilbk
Copy link
Member

jamilbk commented Apr 9, 2024

We're all on the same page here. Currently wrapping up discussion w/ @thomaseizinger and @AndrewDryga on this topic and think we have a solution.

Todo

Should be backwards-compatible with existing APIs:

  • Relay: Disconnect websocket on SIGTERM but keep existing data connections active
  • Portal sends message that "X Relay is down, here are the N healthy ones you can use"
  • Portal sends N relays in the init message in addition to request_intent (cc @conectado)

So with the above changes there are now three ways the client can receive new Relays to potentially use:

  • init
  • relay going down
  • request_intent (receiving Relays this way is only for backwards compatibility and can go away once all clients have the above updates)

@jamilbk
Copy link
Member

jamilbk commented Apr 9, 2024

refs #4510

@conectado
Copy link
Collaborator Author

conectado commented Apr 9, 2024

* [ ]  Portal sends N relays in the init message in addition to request_intent (cc @conectado)

@jamilbk Do we need the relays in the init message if we are already going to recieve them as part of the response to the intent?

@AndrewDryga
Copy link
Collaborator

  • Graceful shutdown of the relay: disconnect the WS connection on SIGTERM or other signals; if there are no allocations exit immediately;
  • Portal sends a message with disconnected relay ID and N relays to use; relays will be sent in init message (with backwards compatibility for now);
  • Relay health checks work in two states: when relay boots it starts returning HTTP 200 right away and then connects to the websocket and if it fails to connect to the portal stops returning 200; after it was connected to a portal but then misses heartbeats for PARTITION_DURIATION stop returning HTTP 200.

@thomaseizinger
Copy link
Member

* [ ]  Portal sends N relays in the init message in addition to request_intent (cc @conectado)

@jamilbk Do we need the relays in the init message if we are already going to recieve them as part of the response to the intent?

We also discussed removing this option entirely so first thing to do will be to remove the allow-list of TURN servers and always use all relays.

@thomaseizinger
Copy link
Member

thomaseizinger commented Apr 9, 2024

  • when relay boots it starts returning HTTP 200 right away and then connects to the websocket

@AndrewDryga I had this differently, why return 200 initially? Shouldn't we return 400 until we are connected to the portal?

@AndrewDryga
Copy link
Collaborator

@thomaseizinger we need to tell instance group manager and load balancer that relay can receive traffic before the portal starts to give its address to the clients

github-merge-queue bot pushed a commit that referenced this issue Apr 10, 2024
To seamlessly migrate relayed connections when relays get re-deployed,
we will be introducing a new message from the portal that informs us
regarding relays that are shutting down and new ones that became active.

Currently, relays are scoped to a particular connection. With the
introduction of the above message, it would be unclear, how these new
relays should be added to these allow lists.

To make this simpler, we remove these allow lists and always use all
relays for all connections.

Related: #4548.
@jamilbk jamilbk removed this from the 1.0 GA milestone Apr 10, 2024
github-merge-queue bot pushed a commit that referenced this issue Apr 12, 2024
Upon receiving a SIGTERM, we immediately disconnect from the websocket
connection to the portal and set a flag that we are shutting down.

Once we are disconnected from the portal and no longer have an active
allocations, we exit with 0. A repeated SIGTERM signal will interrupt
this process and force the relay to shutdown.

Disconnecting from the portal will (eventually) trigger a message to
clients and gateways that this relay should no longer be used. Thus,
depending on the timeout our supervisor has configured after sending
SIGTERM, the relay will continue all TURN operations until the number of
allocations drops to 0.

Currently, we also allow clients to make new allocations and refreshing
existing allocations. In the future, it may make sense to implement a
dedicated status code and refuse `ALLOCATE` and `REFRESH` messages
whilst we are shutting down.

Related: #4548.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this issue Apr 15, 2024
This is another step towards #4548. The portal now includes a list of
relays as part of the "init" message. Any time we receive an "init", we
will now upsert those relays based on their ID. This requires us to
change our internal bookkeeping of relays from indexing them by address
to indexing by ID.

To ensure that this works correctly, the unit tests are rewritten to use
the new `upsert_relays` API.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this issue Apr 20, 2024
Whenever we receive a `relays_presence` message from the portal, we
invalidate the candidates of all now disconnected relays and make
allocations on the new ones. This triggers signalling of new candidates
to the remote party and migrates the connection to the newly nominated
socket.

This still relies on #4613 until we have #4634.

Resolves: #4548.

---------

Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connlib Firezone's core connectivity library area/portal Portal, panel, web, control plane, you name it! business_value/critical Required by 100% of our customer base
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants