fix(gateway): don't invalidate active NAT sessions#8937
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Pull Request Overview
This PR addresses an issue with active NAT sessions being inadvertently invalidated when the Gateway recreates a DNS NAT resource. Key changes include:
- Adding a new function in NatTable to check for existing entries for an "inside" IP.
- Integrating the check into peer setup to skip NAT creation when active sessions exist.
- Updating tests to validate that new NAT setups do not clear established sessions.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| rust/connlib/tunnel/src/peer/nat_table.rs | Introduces the has_entry_for_inside function for querying existing NAT sessions. |
| rust/connlib/tunnel/src/peer.rs | Updates NAT setup logic to skip DNS resource updates if active NAT sessions exist, along with corresponding test modifications. |
| /// Returns true if the NAT table has any entries with the given "inside" IP address. | ||
| pub(crate) fn has_entry_for_inside(&self, ip: IpAddr) -> bool { | ||
| self.table.left_values().any(|(_, c)| c == &ip) | ||
| } |
There was a problem hiding this comment.
This is O(N) but it shouldn't happen very often because we only use it whenever a resolve a DNS query (and it is local to a particular connection of a client which usually doesn't have that many NAT sessions open).
jamilbk
left a comment
There was a problem hiding this comment.
LGTM - should alleviate some issues.
The NAT sessions have a TTL of 1 minute, meaning there needs to be at least 1 outgoing packet from the Client every minute to keep it open.
Ok so I hate to be the bearer of bad news, but this means we still have an issue in the TCP case where DNS records are changing.
Consider the fictional scenario of SSH'ing to github.com. We'll establish the connection successfully, but (as we saw with a previous pilot customer) TCP connections can be held open for up to 2 hours with no packets transmitted by either side. Additionally, the next packet that comes at the 2-hour mark can very well be from server -> client (as we saw with SSH in their environment).
So the fix here might have to be to hold NAT sessions open for up to 2 hours with bidirectional reset triggers (or sigh - inspect the packet for TCP keepalive). But we can save that for another PR.
Yeah, this TTL has been in place for 10 months though, ever since we shipped the reworked DNS resources that created the need for this NAT. |
Yeah low likelihood, but I'll open an issue just in case so we don't lose track. |
Whenever the Gateway is instructed to (re)create the NAT for a DNS resource, it performs a DNS query and then overwrite the existing entries in the NAT table. Depending on how the DNS records are defined, this may lead to a very bad user experience where connections are cut regularly.
In particular, if a service utilises round-robin DNS where a DNS query only ever returns a single entry yet that entry may change as soon as the TTL expires, all connections for this particular DNS resource for a Client get cut.
To fix this, we now first check for active NAT sessions for a given proxy IP and only replace it if we don't have an open NAT session. The NAT sessions have a TTL of 1 minute, meaning there needs to be at least 1 outgoing packet from the Client every minute to keep it open.