Skip to content

feat(gateway): cache DNS queries for resources#8225

Merged
thomaseizinger merged 5 commits intomainfrom
chore/add-dns-cache
Feb 23, 2025
Merged

feat(gateway): cache DNS queries for resources#8225
thomaseizinger merged 5 commits intomainfrom
chore/add-dns-cache

Conversation

@thomaseizinger
Copy link
Member

With the addition of the Firezone Control Protocol, we are now issuing a lot more DNS queries on the Gateway. Specifically, every DNS query for a DNS resource name always triggers a DNS query on the Gateway. This ensures that changes to DNS entries for resources are picked up without having to build any sort of "stale detection" in the Gateway itself. As a result though, a Gateway has to issue a lot of DNS queries to upstream resolvers which in 99% or more cases will return the same result.

To reduce the load on these upstream, we cache successful results of DNS queries for 5 minutes.

@vercel
Copy link

vercel bot commented Feb 21, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
firezone ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 23, 2025 3:05am

@sentry
Copy link

sentry bot commented Feb 21, 2025

Sentry Issue: GATEWAY-A9

@thomaseizinger
Copy link
Member Author

Sentry Issue: GATEWAY-A9

My hypothesis is that we are simply overloading libc or the upstream resolver here which is why those queries fail. It happens a lot so I think it is worth addressing. A cache can't hurt so we can always see if that issue re-appears after this ships.

@jamilbk
Copy link
Member

jamilbk commented Feb 21, 2025

What not cache for the record's TTL? Isn't that meant to hint to caching resolvers how long to cache for?

@thomaseizinger
Copy link
Member Author

What not cache for the record's TTL? Isn't that meant to hint to caching resolvers how long to cache for?

libc::getaddrinfo which is what we use to resolve the hostname doesn't expose that information.

@jamilbk
Copy link
Member

jamilbk commented Feb 21, 2025

What not cache for the record's TTL? Isn't that meant to hint to caching resolvers how long to cache for?

libc::getaddrinfo which is what we use to resolve the hostname doesn't expose that information.

I have a sneaking suspicion this will come up to bite us. Consider an admin debugging why DNS resources don't work in their environment. They realize they need to add / remove records on their Route 53 config, or some other internal DNS server. They set the TTL to 60s (or even shorter), try again for a few minutes and wonder why they still don't work.

If we implement this change, what do you think of making it shorter, like 1m? We should also probably document it in https://www.firezone.dev/kb/architecture/critical-sequences#dns-resolution and then make sure to link to that from /kb/deploy/gateways and /kb/deploy/dns.

@jamilbk
Copy link
Member

jamilbk commented Feb 21, 2025

It should also probably match what we return for the dummy IPs in the client?

@thomaseizinger
Copy link
Member Author

It should also probably match what we return for the dummy IPs in the client?

That wouldn't necessarily match because at the time the query runs on the client, the entry may be half-way through its expiry already.

@thomaseizinger
Copy link
Member Author

thomaseizinger commented Feb 21, 2025

They set the TTL to 60s (or even shorter), try again for a few minutes and wonder why they still don't work.

Sure, I am not set on a particular TTL. I do think we need one to handle the bursts and N-to-1 nature of Clients to Gateways.

@jamilbk
Copy link
Member

jamilbk commented Feb 21, 2025

They set the TTL to 60s (or even shorter), try again for a few minutes and wonder why they still don't work.

Sure, I am not set on a particular TTL. I do think we need one to handle the bursts and N-to-1 nature of Clients to Gateways.

Yeah I imagine even a short cache here will be immensely helpful, even 15s or something.

@thomaseizinger
Copy link
Member Author

Well, 15s seems a bit extreme. In 99% of cases, these records won't change.

Can we just document the 5min and tell admins to either wait that out or restart the Gateway if they really need to flush them? If really required, we can also implement a runtime toggle to flush these caches.

@thomaseizinger
Copy link
Member Author

@jamilbk Let me know what you think of the docs update.

@jamilbk
Copy link
Member

jamilbk commented Feb 23, 2025

@jamilbk Let me know what you think of the docs update.

Hmm, I'm still a bit hesitant about the arbitrary 5m timeout. I think this is going to cause issues more often than we think. Agreed 99% of the time the records won't change, but in the 1% of the time, where the admin is likely setting up Firezone for the first time to determine whether it's a fit for their organization, it's likely they'll be tweaking their DNS configuration internally and testing the changes with their local client(s).

Case in point - consider the recent request for SRV records to be supported. The admin will likely hit this 5m timeout, and since the records on the clients themselves don't reflect any changes, they'll need a shell on the Gateway to debug what the actual IP is resolving to. By the time they get a shell on the Gateway and test again, things might be working again, leaving the admin scratching their heads.

Then when they discover the arbitrary 5m timeout our "caching resolver" has (by digging through architecture docs or more likely asking support) that does not respect the commonly used TTL, that might leave them scratching their heads even further. An admin would likely expect any caching resolvers in the path from client -> resource to respect the record's TTL.

At this point the admin might be reading the deploy/gateways doc, or most likely is just skimming it, and trying to get things working. They're likely already troubleshooting why something's not working and probably a bit annoyed, and at that moment in time, this is likely to add to that.

The RFC mentions the TTL returned from an authoritative server should be considered an upper bound, so I believe we are violating the RFC with the 5m hardcoded timeout here.

Would it be too difficult just to respect the TTL received by the upstream resolver for this? Then we can just say, "the Gateway caches DNS responses according to the record's TTL" which is common for any caching DNS resolver.

Anyway, sorry for the long-winded response, but since DNS is such a notoriously problematic system as it is, I would hope to avoid adding to that.

@jamilbk
Copy link
Member

jamilbk commented Feb 23, 2025

Also, this may break things on Kubernetes for example where service IPs change frequently, or any service that uses Hashicorp's Consul to provide resolution.

Cursory research suggests many mature caching resolvers actually return you a "live" TTL that decrements from the moment it first received a reply from the authoritative server. So if the record's TTL is 3600, you may get something like 3000 when you query it (BIND).

(Note: a TTL of 0 would mean to "disable caching", so we would want to aim to respect that too I think so that we function as expected in environments where the administrator explicitly wants to disable all DNS caching but has the resolver capacity to actually support that)

@jamilbk
Copy link
Member

jamilbk commented Feb 23, 2025

If we are going with a hard-coded TTL then it might be better to pick something likely under the minimum allowed TTL on popular registrars, like 30s. With the caveat this may break k8s services occasionally.

Though Azure, Cloudflare and many others allow 1s to support "fast-changing environments".

@thomaseizinger
Copy link
Member Author

Case in point - consider the recent request for SRV records to be supported. The admin will likely hit this 5m timeout, and since the records on the clients themselves don't reflect any changes, they'll need a shell on the Gateway to debug what the actual IP is resolving to. By the time they get a shell on the Gateway and test again, things might be working again, leaving the admin scratching their heads.

With the current intended design, such SRV records would not be intercepted on the Client but directly forwarded to the DNS server on the Gateway and thus bypass this cache.

@thomaseizinger
Copy link
Member Author

thomaseizinger commented Feb 23, 2025

@jamilbk Let me know what you think of the docs update.

Hmm, I'm still a bit hesitant about the arbitrary 5m timeout. I think this is going to cause issues more often than we think. Agreed 99% of the time the records won't change, but in the 1% of the time, where the admin is likely setting up Firezone for the first time to determine whether it's a fit for their organization, it's likely they'll be tweaking their DNS configuration internally and testing the changes with their local client(s).

This is an interesting point. I wouldn't have expected that admins make changes to their DNS configuration while they set up Firezone. My assumption was that organisations have certain DNS records already in place and when adopting Firezone, they mostly configure those in the portal to ensure they get routed but don't necessarily update their DNS records.

We could add a command-line flag to disable or configure the cache for this setup period?

Would it be too difficult just to respect the TTL received by the upstream resolver for this? Then we can just say, "the Gateway caches DNS responses according to the record's TTL" which is common for any caching DNS resolver.

Unfortunately, libc doesn't expose this information at all. We'd have to refactor the Gateway to have its own DNS client instead of relying on libc in order to learn the actual TTL of the record. We can do that with hickory-resolver which we can also initialise based on the system's /etc/resolv.conf config. Shouldn't be too complicated.

If we are going with a hard-coded TTL then it might be better to pick something likely under the minimum allowed TTL on popular registrars, like 30s. With the caveat this may break k8s services occasionally.

Yes we can start with a more conservative TTL, it is better than what we currently have.

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
@jamilbk
Copy link
Member

jamilbk commented Feb 23, 2025

it is better than what we currently have.

Well yes for the vast majority of cases. There is this edge case now though:

  1. company internal app is (re)deployed to k8s
  2. k8s dns records are updated with new container IPs, ttl is very low or even 0 to disable cache (dns is intra-cluster after all)
  3. healthcheck on containers is green, service is advertised to cluster load balancers
  4. there's now some level of downtime (up to 30s) where the service is down to end users or other services using the Firezone Gateway

I agree this is probably rare. We were bitten by stale DNS cache bugs relating to k8s service deployments on the old project @bmanifold worked on at Cisco so I may be a bit biased here though.

@jamilbk
Copy link
Member

jamilbk commented Feb 23, 2025

What might be a good idea here is to evict the cache on a negative lookup to mitigate the above.

Long term we probably need just proper (predictable) cache behavior.

@jamilbk
Copy link
Member

jamilbk commented Feb 23, 2025

I wouldn't have expected that admins make changes to their DNS configuration while they set up Firezone.

A surprisingly common pattern I've noticed admins follow:

  1. They deploy Firezone
  2. They start adding DNS resources with their existing DNS name(s)
  3. They then realize the power of our wildcard matching and go update their internal DNS records to to more easily manage access for their DNS resources, so they align with the matching rules they create

@thomaseizinger thomaseizinger added this pull request to the merge queue Feb 23, 2025
Merged via the queue into main with commit 57ce0ee Feb 23, 2025
116 checks passed
@thomaseizinger thomaseizinger deleted the chore/add-dns-cache branch February 23, 2025 04:43
@thomaseizinger
Copy link
Member Author

thomaseizinger commented Feb 23, 2025

What might be a good idea here is to evict the cache on a negative lookup to mitigate the above.

I think that may counter what this patch is trying to solve. Currently, we receive lookup errors which I think are due to an overloaded resolver. If we evict the cache on errors, we essentially go back to what we have now where there is essentially no easing of the load on the downstream resolver.

@thomaseizinger
Copy link
Member Author

thomaseizinger commented Feb 23, 2025

4. there's now some level of downtime (up to 30s) where the service is down to end users or other services using the Firezone Gateway

That isn't as bad (and a lot less likely) than the disruption we currently create due to the lookup errors. Clients will not route packets for DNS resources if the NAT hasn't been setup successfully on the Gateway which it can't do if we fail to lookup the domain.

@thomaseizinger
Copy link
Member Author

thomaseizinger commented Feb 23, 2025

Long term we probably need just proper (predictable) cache behavior.

DNS is cached aggressively across the Internet so everyone kind of "knows" that it takes a while to propagate. I'd be very surprised if a 30s cache creates an actual problem.

@jamilbk
Copy link
Member

jamilbk commented Feb 23, 2025

DNS is cached aggressively across the Internet so everyone kind of "knows" that it takes a while to propagate.

There are two types of DNS resources: internal and external. For external resources I agree with you - the resolver is likely to be overloaded if we don't cache. For internal ones (the majority of what customers are using), I think there could be a problem. In those environments, it could be the case that the resolver happily would have answered our queries.

I'd be very surprised if a 30s cache creates an actual problem.

Kubernetes uses a default TTL of 5s. We are creating a stale cache there. Whether that becomes an issue I suppose we'll see.

ttl allows you to set a custom TTL for responses. The default is 5 seconds. The minimum TTL allowed is 0 seconds, and the maximum is capped at 3600 seconds. Setting TTL to 0 will prevent records from being cached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants