-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(connlib): drop all connections when roaming #5308
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
Terraform Cloud Plan Output
|
This already works for CIDR resources but for DNS resources, we still need to wait on #5049. |
Performance Test ResultsTCP
UDP
|
0625115
to
9b68bb1
Compare
9b68bb1
to
1b1fc6e
Compare
1fa1523
to
22ba907
Compare
22ba907
to
8607b64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to slightly adopt this download test: curl
appears to not send any packets as part of downloading and thus once interrupted due to roaming, the connection does not get re-established. I added --retry 1
and --continue-at -
to retry the download on failure. Previously, the file was just appended via stdout which failed the hash-check after the retry. Now we use curl
's --output
which means we also need to calculate the hash in the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a user-facing documentation of this behavior.
We want users to know that downloads won't resume after roaming automatically, I think this is okay and expected for TCP connections, but it's still good to document.
cba6435
to
56eeaeb
Compare
780eb8c
to
66768c1
Compare
56eeaeb
to
98f567a
Compare
66768c1
to
ad5034d
Compare
98f567a
to
fe6cced
Compare
ad5034d
to
c57e07e
Compare
fe6cced
to
f3e770f
Compare
c57e07e
to
18231be
Compare
f3e770f
to
5e2a885
Compare
59eb3dd
to
cc74a8b
Compare
cc74a8b
to
412db5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
for id in ids { | ||
self.cleanup_connected_gateway(&id); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a debug_assert!
here that all peers are empty would be nice I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a user-facing documentation of this behavior.
We want users to know that downloads won't resume after roaming automatically, I think this is okay and expected for TCP connections, but it's still good to document.
In order to handle DNS resources, connlib intercepts all DNS requests on the system once it has started up. The DNS queries are then forwarded to the original DNS resolver in case the query isn't for one of the configured DNS resources _except_ if the configured DNS resovler is also a CIDR resource. In that case, the DNS query will be tunneled to a gateway and forwarded to the DNS resolver from there. Exactly this configuration results in a dead-lock when roaming networks. To make roaming more reliable, we now drop all connections when detecting a network change (see #5308). As a result, DNS queries cannot be tunneled right away. This isn't usually a problem: We just send a connection intent to the portal to connect to the gateway. Upon a network change, we also reconnect the websocket to the portal which also requires to resolve the domain name. Connlib's DNS resolver is still active at the point and thus, we end up deadlocking ourselves because the DNS query to resolve the portal's domain is waiting for a connection to a gateway that can only be established once we are connected to the portal. To prevent this, we extend connlib with a "known hosts" feature. These are DNS records that are defined statically for the lifetime of a connlib session and can thus always be resolved, regardless of the connection state with the portal or the gateways. We populate these records with the portal's API, allowing the reconnect to work without having connected gateways. --------- Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Currently,
snownet
tries to be very clever in how it roams connections. This is/was necessary because we associated DNS-specific state with a connection. More specifically, the assigned proxy IPs for a DNS resource are stored as part of a connection with the gateway.As a result, DNS resources would always break if the underlying connection in
snownet
failed. This is quite error prone and means,snownet
must be very careful to never-ever fail a connection erroneously. With #5049, we no longer store any important state with a connection and thus, can implement roaming in much simpler way: Drop all connections and let the incoming packets create new ones. This is much more robust as we don't have to "patch" existing state insnownet
as part of roaming.We test this new functionality by adding a
RoamClient
transition totunnel_test
. This ensures roaming works in a lot of scenarios, including relayed and non-relayed situations as well as roaming between either of them. As a result, we can delete several of the more specific test cases ofsnownet
.Depends-On: #5049.
Replaces: #5060.
Resolves: #5080.