feat(gateway): cache DNS queries for resources#8225
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
|
Sentry Issue: GATEWAY-A9 |
My hypothesis is that we are simply overloading |
|
What not cache for the record's TTL? Isn't that meant to hint to caching resolvers how long to cache for? |
|
I have a sneaking suspicion this will come up to bite us. Consider an admin debugging why DNS resources don't work in their environment. They realize they need to add / remove records on their Route 53 config, or some other internal DNS server. They set the TTL to If we implement this change, what do you think of making it shorter, like |
|
It should also probably match what we return for the dummy IPs in the client? |
That wouldn't necessarily match because at the time the query runs on the client, the entry may be half-way through its expiry already. |
Sure, I am not set on a particular TTL. I do think we need one to handle the bursts and N-to-1 nature of Clients to Gateways. |
Yeah I imagine even a short cache here will be immensely helpful, even 15s or something. |
|
Well, 15s seems a bit extreme. In 99% of cases, these records won't change. Can we just document the 5min and tell admins to either wait that out or restart the Gateway if they really need to flush them? If really required, we can also implement a runtime toggle to flush these caches. |
|
@jamilbk Let me know what you think of the docs update. |
Hmm, I'm still a bit hesitant about the arbitrary 5m timeout. I think this is going to cause issues more often than we think. Agreed 99% of the time the records won't change, but in the 1% of the time, where the admin is likely setting up Firezone for the first time to determine whether it's a fit for their organization, it's likely they'll be tweaking their DNS configuration internally and testing the changes with their local client(s). Case in point - consider the recent request for SRV records to be supported. The admin will likely hit this 5m timeout, and since the records on the clients themselves don't reflect any changes, they'll need a shell on the Gateway to debug what the actual IP is resolving to. By the time they get a shell on the Gateway and test again, things might be working again, leaving the admin scratching their heads. Then when they discover the arbitrary 5m timeout our "caching resolver" has (by digging through architecture docs or more likely asking support) that does not respect the commonly used TTL, that might leave them scratching their heads even further. An admin would likely expect any caching resolvers in the path from client -> resource to respect the record's TTL. At this point the admin might be reading the deploy/gateways doc, or most likely is just skimming it, and trying to get things working. They're likely already troubleshooting why something's not working and probably a bit annoyed, and at that moment in time, this is likely to add to that. The RFC mentions the TTL returned from an authoritative server should be considered an upper bound, so I believe we are violating the RFC with the 5m hardcoded timeout here. Would it be too difficult just to respect the TTL received by the upstream resolver for this? Then we can just say, "the Gateway caches DNS responses according to the record's TTL" which is common for any caching DNS resolver. Anyway, sorry for the long-winded response, but since DNS is such a notoriously problematic system as it is, I would hope to avoid adding to that. |
|
Also, this may break things on Kubernetes for example where service IPs change frequently, or any service that uses Hashicorp's Consul to provide resolution. Cursory research suggests many mature caching resolvers actually return you a "live" TTL that decrements from the moment it first received a reply from the authoritative server. So if the record's TTL is (Note: a |
|
If we are going with a hard-coded TTL then it might be better to pick something likely under the minimum allowed TTL on popular registrars, like 30s. With the caveat this may break k8s services occasionally. Though Azure, Cloudflare and many others allow |
With the current intended design, such SRV records would not be intercepted on the Client but directly forwarded to the DNS server on the Gateway and thus bypass this cache. |
This is an interesting point. I wouldn't have expected that admins make changes to their DNS configuration while they set up Firezone. My assumption was that organisations have certain DNS records already in place and when adopting Firezone, they mostly configure those in the portal to ensure they get routed but don't necessarily update their DNS records. We could add a command-line flag to disable or configure the cache for this setup period?
Unfortunately,
Yes we can start with a more conservative TTL, it is better than what we currently have. |
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Well yes for the vast majority of cases. There is this edge case now though:
I agree this is probably rare. We were bitten by stale DNS cache bugs relating to k8s service deployments on the old project @bmanifold worked on at Cisco so I may be a bit biased here though. |
|
What might be a good idea here is to evict the cache on a negative lookup to mitigate the above. Long term we probably need just proper (predictable) cache behavior. |
A surprisingly common pattern I've noticed admins follow:
|
I think that may counter what this patch is trying to solve. Currently, we receive lookup errors which I think are due to an overloaded resolver. If we evict the cache on errors, we essentially go back to what we have now where there is essentially no easing of the load on the downstream resolver. |
That isn't as bad (and a lot less likely) than the disruption we currently create due to the lookup errors. Clients will not route packets for DNS resources if the NAT hasn't been setup successfully on the Gateway which it can't do if we fail to lookup the domain. |
DNS is cached aggressively across the Internet so everyone kind of "knows" that it takes a while to propagate. I'd be very surprised if a 30s cache creates an actual problem. |
There are two types of DNS resources: internal and external. For external resources I agree with you - the resolver is likely to be overloaded if we don't cache. For internal ones (the majority of what customers are using), I think there could be a problem. In those environments, it could be the case that the resolver happily would have answered our queries.
Kubernetes uses a default TTL of 5s. We are creating a stale cache there. Whether that becomes an issue I suppose we'll see.
|
With the addition of the Firezone Control Protocol, we are now issuing a lot more DNS queries on the Gateway. Specifically, every DNS query for a DNS resource name always triggers a DNS query on the Gateway. This ensures that changes to DNS entries for resources are picked up without having to build any sort of "stale detection" in the Gateway itself. As a result though, a Gateway has to issue a lot of DNS queries to upstream resolvers which in 99% or more cases will return the same result.
To reduce the load on these upstream, we cache successful results of DNS queries for 5 minutes.