Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(connlib): drop all connections when roaming #5308

Merged
merged 10 commits into from
Jun 25, 2024

Conversation

thomaseizinger
Copy link
Member

@thomaseizinger thomaseizinger commented Jun 11, 2024

Currently, snownet tries to be very clever in how it roams connections. This is/was necessary because we associated DNS-specific state with a connection. More specifically, the assigned proxy IPs for a DNS resource are stored as part of a connection with the gateway.

As a result, DNS resources would always break if the underlying connection in snownet failed. This is quite error prone and means, snownet must be very careful to never-ever fail a connection erroneously. With #5049, we no longer store any important state with a connection and thus, can implement roaming in much simpler way: Drop all connections and let the incoming packets create new ones. This is much more robust as we don't have to "patch" existing state in snownet as part of roaming.

We test this new functionality by adding a RoamClient transition to tunnel_test. This ensures roaming works in a lot of scenarios, including relayed and non-relayed situations as well as roaming between either of them. As a result, we can delete several of the more specific test cases of snownet.

Depends-On: #5049.
Replaces: #5060.
Resolves: #5080.

Copy link

vercel bot commented Jun 11, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
firezone ⬜️ Ignored (Inspect) Visit Preview Jun 25, 2024 0:30am

@github-actions github-actions bot added the kind/refactor Code refactoring label Jun 11, 2024
Copy link

github-actions bot commented Jun 11, 2024

Terraform Cloud Plan Output

Plan: 15 to add, 23 to change, 2 to destroy.

Terraform Cloud Plan

@thomaseizinger
Copy link
Member Author

This already works for CIDR resources but for DNS resources, we still need to wait on #5049.

Copy link

github-actions bot commented Jun 11, 2024

Performance Test Results

TCP

Test Name Received/s Sent/s Retransmits
direct-tcp-client2server 233.8 MiB (+2%) 235.6 MiB (+2%) 159 (-46%)
direct-tcp-server2client 239.1 MiB (+0%) 240.2 MiB (-0%) 222 (-30%)
relayed-tcp-client2server 241.5 MiB (+1%) 242.4 MiB (+0%) 351 (-15%)
relayed-tcp-server2client 245.2 MiB (+2%) 246.3 MiB (+2%) 722 (+1%)

UDP

Test Name Total/s Jitter Lost
direct-udp-client2server 500.0 MiB (+0%) 0.03ms (-57%) 39.68% (-7%)
direct-udp-server2client 500.0 MiB (-0%) 0.01ms (+34%) 22.37% (-6%)
relayed-udp-client2server 500.0 MiB (+0%) 0.04ms (-63%) 53.37% (-3%)
relayed-udp-server2client 500.0 MiB (+0%) 0.02ms (+15%) 37.37% (+2%)

@thomaseizinger thomaseizinger force-pushed the refactor/connlib/5080-drop-connections branch from 0625115 to 9b68bb1 Compare June 13, 2024 02:28
@thomaseizinger thomaseizinger force-pushed the refactor/connlib/5080-drop-connections branch from 9b68bb1 to 1b1fc6e Compare June 17, 2024 01:00
@thomaseizinger thomaseizinger changed the base branch from main to refactor/dns-translation-on-gateway June 17, 2024 01:00
@thomaseizinger thomaseizinger force-pushed the refactor/connlib/5080-drop-connections branch 4 times, most recently from 1fa1523 to 22ba907 Compare June 17, 2024 11:57
@thomaseizinger thomaseizinger changed the title refactor(connlib): drop all connections when roaming refactor(connlib): re-establish connections when roaming Jun 17, 2024
@thomaseizinger thomaseizinger force-pushed the refactor/connlib/5080-drop-connections branch from 22ba907 to 8607b64 Compare June 17, 2024 12:20
@thomaseizinger thomaseizinger changed the title refactor(connlib): re-establish connections when roaming refactor(connlib): drop all connections when roaming Jun 17, 2024
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to slightly adopt this download test: curl appears to not send any packets as part of downloading and thus once interrupted due to roaming, the connection does not get re-established. I added --retry 1 and --continue-at - to retry the download on failure. Previously, the file was just appended via stdout which failed the hash-check after the retry. Now we use curl's --output which means we also need to calculate the hash in the container.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a user-facing documentation of this behavior.

We want users to know that downloads won't resume after roaming automatically, I think this is okay and expected for TCP connections, but it's still good to document.

@thomaseizinger thomaseizinger force-pushed the refactor/dns-translation-on-gateway branch from cba6435 to 56eeaeb Compare June 19, 2024 23:41
@thomaseizinger thomaseizinger force-pushed the refactor/connlib/5080-drop-connections branch from 780eb8c to 66768c1 Compare June 20, 2024 00:20
@thomaseizinger thomaseizinger force-pushed the refactor/dns-translation-on-gateway branch from 56eeaeb to 98f567a Compare June 20, 2024 01:30
@thomaseizinger thomaseizinger force-pushed the refactor/connlib/5080-drop-connections branch from 66768c1 to ad5034d Compare June 20, 2024 03:52
@thomaseizinger thomaseizinger force-pushed the refactor/dns-translation-on-gateway branch from 98f567a to fe6cced Compare June 20, 2024 05:40
@thomaseizinger thomaseizinger force-pushed the refactor/connlib/5080-drop-connections branch from ad5034d to c57e07e Compare June 20, 2024 05:46
@thomaseizinger thomaseizinger force-pushed the refactor/dns-translation-on-gateway branch from fe6cced to f3e770f Compare June 21, 2024 04:09
@thomaseizinger thomaseizinger force-pushed the refactor/connlib/5080-drop-connections branch from c57e07e to 18231be Compare June 21, 2024 04:10
@thomaseizinger thomaseizinger force-pushed the refactor/dns-translation-on-gateway branch from f3e770f to 5e2a885 Compare June 24, 2024 00:14
@thomaseizinger thomaseizinger force-pushed the refactor/connlib/5080-drop-connections branch 2 times, most recently from 59eb3dd to cc74a8b Compare June 24, 2024 01:15
Base automatically changed from refactor/dns-translation-on-gateway to main June 24, 2024 23:57
@thomaseizinger thomaseizinger marked this pull request as ready for review June 25, 2024 00:30
@thomaseizinger thomaseizinger force-pushed the refactor/connlib/5080-drop-connections branch from cc74a8b to 412db5b Compare June 25, 2024 00:30
Copy link
Collaborator

@conectado conectado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +822 to +824
for id in ids {
self.cleanup_connected_gateway(&id);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a debug_assert! here that all peers are empty would be nice I think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a user-facing documentation of this behavior.

We want users to know that downloads won't resume after roaming automatically, I think this is okay and expected for TCP connections, but it's still good to document.

@thomaseizinger thomaseizinger added this pull request to the merge queue Jun 25, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 25, 2024
@thomaseizinger thomaseizinger added this pull request to the merge queue Jun 25, 2024
Merged via the queue into main with commit eec615e Jun 25, 2024
156 checks passed
@thomaseizinger thomaseizinger deleted the refactor/connlib/5080-drop-connections branch June 25, 2024 04:07
github-merge-queue bot pushed a commit that referenced this pull request Jun 27, 2024
In order to handle DNS resources, connlib intercepts all DNS requests on
the system once it has started up. The DNS queries are then forwarded to
the original DNS resolver in case the query isn't for one of the
configured DNS resources _except_ if the configured DNS resovler is also
a CIDR resource.

In that case, the DNS query will be tunneled to a gateway and forwarded
to the DNS resolver from there.

Exactly this configuration results in a dead-lock when roaming networks.
To make roaming more reliable, we now drop all connections when
detecting a network change (see #5308). As a result, DNS queries cannot
be tunneled right away. This isn't usually a problem: We just send a
connection intent to the portal to connect to the gateway. Upon a
network change, we also reconnect the websocket to the portal which also
requires to resolve the domain name. Connlib's DNS resolver is still
active at the point and thus, we end up deadlocking ourselves because
the DNS query to resolve the portal's domain is waiting for a connection
to a gateway that can only be established once we are connected to the
portal.

To prevent this, we extend connlib with a "known hosts" feature. These
are DNS records that are defined statically for the lifetime of a
connlib session and can thus always be resolved, regardless of the
connection state with the portal or the gateways. We populate these
records with the portal's API, allowing the reconnect to work without
having connected gateways.

---------

Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/refactor Code refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

connlib: implement reconnect as "drop all connections and wait for new packets to trigger new ones"
2 participants