Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(connlib): introduce Session::reconnect #4116

Merged
merged 8 commits into from
Mar 14, 2024

Conversation

thomaseizinger
Copy link
Member

@thomaseizinger thomaseizinger commented Mar 13, 2024

I ended up calling it reconnect because that is really what we are doing:

  • We reconnect to the portal.
  • We "reconnect" to all relays, i.e. refresh the allocations.

I decided not to use an ICE restart. An ICE restart clears the local as well as the remote credentials, meaning we would need to run another instance of the signalling protocol. The current control plane does not support this and it is also unnecessary in our situation. In the case of an actual network change (e.g. WiFI to cellular), refreshing of the allocations will turn up new candidates as that is how we discovered our original ones in the first place. Because we constantly operate in ICE trickle mode, those will be sent to the remote via the control plane and we start testing them.

As those new paths become available, str0m will automatically nominate them in case the current one runs into an ICE timeout. Here is a screen-recording of the Linux CLI client where Session::refresh is triggered via the SIGHUP signal:

Screencast.from.2024-03-14.11-16-47.webm

Provides the infrastructure for: #4028.

Copy link

vercel bot commented Mar 13, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
firezone ⬜️ Ignored (Inspect) Visit Preview Mar 14, 2024 1:43am

Copy link

github-actions bot commented Mar 13, 2024

Terraform Cloud Plan Output

Plan: 8 to add, 7 to change, 8 to destroy.

Terraform Cloud Plan

Copy link

github-actions bot commented Mar 13, 2024

Performance Test Results

TCP

Test Name Received/s Sent/s Retransmits
direct-tcp-client2server 204.3 MiB (+2%) 204.8 MiB (+2%) 247 (+42%)
direct-tcp-server2client 198.9 MiB (-2%) 199.9 MiB (-2%) 605 (-14%)
relayed-tcp-client2server 138.5 MiB (-2%) 139.2 MiB (-2%) 126 (-32%)
relayed-tcp-server2client 138.1 MiB (-1%) 138.5 MiB (-1%) 181 (+21%)

UDP

Test Name Total/s Jitter Lost
direct-udp-client2server 50.0 MiB (-0%) 0.05ms (-20%) 0.00% (NaN%)
direct-udp-server2client 50.0 MiB (-0%) 0.03ms (+29%) 0.00% (NaN%)
relayed-udp-client2server 50.0 MiB (+0%) 0.11ms (-31%) 0.00% (NaN%)
relayed-udp-server2client 50.0 MiB (+0%) 0.06ms (-24%) 0.00% (NaN%)

@thomaseizinger thomaseizinger changed the base branch from refactor/connlib/commands to refactor/connlib/no-runtime March 13, 2024 03:33
Base automatically changed from refactor/connlib/no-runtime to main March 14, 2024 00:17
@thomaseizinger thomaseizinger marked this pull request as ready for review March 14, 2024 00:21
@thomaseizinger thomaseizinger requested review from ReactorScram, conectado and jamilbk and removed request for ReactorScram and conectado March 14, 2024 00:21
@ReactorScram
Copy link
Collaborator

Will this be okay if it's called multiple times quickly? The network change detection on Windows is "bouncy". I could probably de-bounce it with a timer if needed, e.g. when we get an event, wait until 1 full second of no events, then tell connlib to refresh / reconnect

@thomaseizinger
Copy link
Member Author

Will this be okay if it's called multiple times quickly? The network change detection on Windows is "bouncy". I could probably de-bounce it with a timer if needed, e.g. when we get an event, wait until 1 full second of no events, then tell connlib to refresh / reconnect

It should be okay, see the screen-recording above. We will reconnect to the portal each time it is called but that does not affect the data plane.

@thomaseizinger
Copy link
Member Author

Will this be okay if it's called multiple times quickly? The network change detection on Windows is "bouncy". I could probably de-bounce it with a timer if needed, e.g. when we get an event, wait until 1 full second of no events, then tell connlib to refresh / reconnect

It should be okay, see the screen-recording above. We will reconnect to the portal each time it is called but that does not affect the data plane.

I would like it to be okay, i.e. if anything, connlib should de-bounce and not the apps.

Copy link
Collaborator

@ReactorScram ReactorScram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Should I also call this when the DNS servers change? And after that next DNS refactor is ready, that will be replaced with some more specific "Update DNS only" command?

rust/connlib/snownet/src/node.rs Show resolved Hide resolved
let backoff = self
.backoff
.next_backoff()
.expect("to have backoff right after resetting");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.expect("to have backoff right after resetting");
.expect("should have backoff Instant right after resetting");

@thomaseizinger
Copy link
Member Author

LGTM. Should I also call this when the DNS servers change? And after that next DNS refactor is ready, that will be replaced with some more specific "Update DNS only" command?

We will have a dedicated function to update DNS servers.

With the new `reconnect` command, clients can initiate this directly so we don't need to change the backoff.
@@ -64,6 +65,12 @@ where
loop {
match self.rx.poll_recv(cx) {
Poll::Ready(Some(Command::Stop)) | Poll::Ready(None) => return Poll::Ready(Ok(())),
Poll::Ready(Some(Command::Reconnect)) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using a channel for this is the most convenient way to go about this.

We might want to use a different mechanism so that multiple reconnects aren't queued up I was thinking we can use a Notify, that way we don't need to worry about the bounded channel and there's no point on doing multiple reconnects in a row we want to just listen to the latest.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if they're guaranteed idempotent that would keep the channel from filling up. We did a similar thing for on_update_resources in the Tauri Client:

self.notify_controller.notify_one();

It was possible if I only allowed 5 items in the channel, and connlib rapidly sent on_update_resources events, that the channel might fill up and error (since it's not allowed to block the callbacks) before the GUI got around to dealing with them.

So the channel was replaced with a Notify and something that the reader can poll when it's notified, same as if it was a channel that dropped all but the most recent event.

I think we also considered tokio's watch and it wasn't a perfect fit. Notify has worked well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, Notify doesn't have a poll API so it would be a bit clunky to use. If debouncing is what we want, then I can add a small delay to the sending of the command through the channel and cancel the current send if we get another one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, yeah it doesn't. And it can't be replicated with AtomicWaker?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, a size 1 channel would also achieve the same effect as Notify

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm so it's a 1-sized channel that uses try_send for both commands, so if I did reconnect and stop in the same tick somehow, the stop will be silently ignored?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The Stop isn't super critical though. If you drop Session, the Runtime gets dropped and with it, all tasks should be stopped.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even the Stop does not need a command, as much as a "I want you to be running / not be running" flag and a way to notify when it's changed.

With channels we have to trade off between 3 problems: Sender may block, sends may fail silently, or sends may panic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting using shared memory instead of channels instead and just notify when to re-read the shared memory?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda. AtomicBool, so not like it has to lock.
But I wrote my last comment at the same time as you wrote "The Stop isn't super critical", so it might just be something to put as an issue and merge this PR anyway

Copy link
Collaborator

@conectado conectado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just left some non-blocking comments

rust/linux-client/src/main.rs Show resolved Hide resolved
@thomaseizinger thomaseizinger added this pull request to the merge queue Mar 14, 2024
@thomaseizinger
Copy link
Member Author

Merging for now, we can always act upon #4116 (comment) in a follow-up.

Merged via the queue into main with commit d092e22 Mar 14, 2024
134 checks passed
@thomaseizinger thomaseizinger deleted the feat/connlib/reconnect-command branch March 14, 2024 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants