Skip to content

feat(connlib): implement idempotent control protocol for gateway#6941

Merged
thomaseizinger merged 4 commits intomainfrom
feat/control-protocol-gateway
Oct 18, 2024
Merged

feat(connlib): implement idempotent control protocol for gateway#6941
thomaseizinger merged 4 commits intomainfrom
feat/control-protocol-gateway

Conversation

@thomaseizinger
Copy link
Copy Markdown
Member

@thomaseizinger thomaseizinger commented Oct 4, 2024

This PR implements the new idempotent control protocol for the gateway. We retain backwards-compatibility with old clients to allow admins to perform a disruption-free update to the latest version.

With this new control protocol, we are moving the responsibility of exchanging the proxy IPs we assigned to DNS resources to a p2p protocol between client and gateway. As a result, wildcard DNS resources only get authorized on the first access. Accessing a new domain within the same resource will thus no longer require a roundtrip to the portal.

Overall, users will see a greatly decreased connection setup latency. On top of that, the new protocol will allow us to more easily implement packet buffering which will be another UX boost for Firezone.

@vercel
Copy link
Copy Markdown

vercel bot commented Oct 4, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
firezone ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 18, 2024 5:17am

@thomaseizinger thomaseizinger self-assigned this Oct 4, 2024
@thomaseizinger thomaseizinger force-pushed the feat/control-protocol-gateway branch from 908c7b7 to a04d4bc Compare October 8, 2024 05:49
@thomaseizinger thomaseizinger force-pushed the feat/control-protocol-gateway branch from a04d4bc to 5cebba8 Compare October 8, 2024 06:08
@thomaseizinger thomaseizinger force-pushed the feat/control-protocol-gateway branch from d211416 to cbab854 Compare October 17, 2024 01:07
@thomaseizinger thomaseizinger force-pushed the feat/control-protocol-gateway branch from cbab854 to f313946 Compare October 17, 2024 05:54
@thomaseizinger thomaseizinger force-pushed the feat/control-protocol-gateway branch from f313946 to d5bd2c4 Compare October 17, 2024 23:30

def psk do
random_token(@wg_psk_length, encoder: :base64)
|> String.slice(0, @wg_psk_length)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndrewDryga I had to remove this otherwise the string isn't valid base64 and we can't decode it. If we are receiving this as a string, the encoding needs to be well-defined.

@thomaseizinger thomaseizinger force-pushed the feat/control-protocol-gateway branch from 926b84c to 9a61f39 Compare October 18, 2024 00:24
@thomaseizinger thomaseizinger marked this pull request as ready for review October 18, 2024 00:43
@thomaseizinger
Copy link
Copy Markdown
Member Author

CI is green! Ready for review @conectado !

Copy link
Copy Markdown
Contributor

@conectado conectado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

};

if !peer.is_allowed(req.resource) {
tracing::debug!(cid = %peer.id(), resource = %req.resource, "Received `AssignedIpsEvent` for resource that is not allowed");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't happen during normal operation right? Client would wait until the client receive the ok response for the resource before sending this assigned ip event. So I think we should bump this log since it means something weird is going on?

return Some(packet);
}

// TODO: Should we throttle concurrent events for the same domain?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we address this TODO, should we just not process an event for a domain if there's another being processed in the upper layer.

@thomaseizinger thomaseizinger added this pull request to the merge queue Oct 18, 2024
Merged via the queue into main with commit ce1e59c Oct 18, 2024
@thomaseizinger thomaseizinger deleted the feat/control-protocol-gateway branch October 18, 2024 16:17
github-merge-queue bot pushed a commit that referenced this pull request Oct 21, 2024
This log should only ever happen if clients are buggy or someone is
using a custom client. Thus worth a `warn`.

Follow-up from #6941.
github-merge-queue bot pushed a commit that referenced this pull request Oct 29, 2024
In order to release #6941, we need to bump the gateway's version to
1.4.0. The portal has a version gate that only allows connection clients
which have version >= 1.4.0. Thus, in order to test #6941 on staging,
the version must not yet be bumped and is thus split out into this PR.
github-merge-queue bot pushed a commit that referenced this pull request Dec 4, 2024
Building on top of the gateway PR (#6941), this PR transitions the
clients to the new control protocol. Clients are **not**
backwards-compatible with old gateways. As a result, a certain customer
environment MUST have at least one gateway with the above PR running in
order for clients to be able to establish connections.

With this transition, Clients send explicit events to Gateways whenever
they assign IPs to a DNS resource name. The actual assignment only
happens once and the IPs then remain stable for the duration of the
client session.

When the Gateway receives such an event, it will perform a DNS
resolution of the requested domain name and set up the NAT between the
assigned proxy IPs and the IPs the domain actually resolves to. In order
to support self-healing of any problems that happen during this process,
the client will send an "Assigned IPs" event every time it receives a
DNS query for a particular domain. This in turn will trigger another DNS
resolution on the Gateway. Effectively, this means that DNS queries for
DNS resources propagate to the Gateway, triggering a DNS resolution
there. In case the domain resolves to the same set of IPs, no state is
changed to ensure existing connections are not interrupted.

With this new functionality in place, we can delete the old logic around
detecting "expired" IPs. This is considered a bugfix as this logic isn't
currently working as intended. It has been observed multiple times that
the Gateway can loop on this behaviour and resolving the same domain
over and over again. The only theoretical "incompatibility" here is that
pre-1.4.0 clients won't have access to this functionality of triggering
DNS refreshes on a Gateway 1.4.2+ Gateway. However, as soon as this PR
merges, we expect all admins to have already upgraded to a 1.4.0+
Gateway anyway which already mandates clients to be on 1.4.0+.

Resolves: #7391.
Resolves: #6828.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants