-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fqdn: Fix GC for dead IPs on live names over limit #22510
fqdn: Fix GC for dead IPs on live names over limit #22510
Conversation
A user observed that Cilium's FQDN name cache can accumulate many IP addresses over time for DNS names that have frequent IP recycling (such as Amazon S3). In particular, the "--tofqdns-endpoint-max-ip-per-hostname" maximum did not seem to apply to such names as long as there is an active connection for one of the IPs associated with the name. This could cause an increase in memory usage and identity allocations in the cluster, which consumes policymap entries as well. The problem was that the FQDN garbage collection routines would apply the limits to active IPs for active names and clean up inactive IPs and inactive names, but it would not apply to the inactive IPs that correspond to names with other active IPs. These would hence slip through the garbage collection selection and remain in the cache indefinitely. This patch fixes it by including the inactive IPs along with active IPs in the list of live names to evaluate, which then ensures they are considered for deletion if they exceed the max-ip-per-hostname limits. Those limits are then enforced based on how recently the IPs have been used, favouring IPs with more recently active connections over IPs that have not been recently used by applications on the node. Signed-off-by: Joe Stringer <joe@cilium.io>
12c4be6
to
dcd64c1
Compare
/test |
Not sure what happened with test-runtime not-failing failure, maybe I triggered the build too soon. Will re-kick it. |
/test-runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix makes sense to me, thanks. There seems to be a pattern popping up with these bugs. It seems we are missing a sort of "matrix" / "permutations" testing, like how do we handle GC when there's a mix alive & dead IPs associated with names or a mix of lookups & zombies. I see in this PR we are adding that sort of test, but I wonder if we have this coverage in other tests.
There is a bit of that yeah. There's not so many permutations though and it seemed like the rest was already covered. Particularly when it comes to enforcing the limit, it's more just about "alive" names and how the limits get enforced against those. Once we do the |
CI Triage:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just adding another data point, i was hitting this issue with connections to S3 in one of my clusters. i applied this patch, and the number of connection entries in cilium fqdn cache list
has been kept under tofqdns-endpoint-max-ip-per-hostname
for the past 3 days 🥳
A user observed that Cilium's FQDN name cache can accumulate many IP
addresses over time for DNS names that have frequent IP recycling (such as
Amazon S3). In particular, the
--tofqdns-endpoint-max-ip-per-hostname
maximum did not seem to apply to such names as long as there is an
active connection for one of the IPs associated with the name. This
could cause an increase in memory usage and identity allocations in the
cluster, which consumes policymap entries as well.
The problem was that the FQDN garbage collection routines would apply
the limits to active IPs for active names and clean up inactive IPs and
inactive names, but it would not apply to the inactive IPs that
correspond to names with other active IPs. These would hence slip
through the garbage collection selection and remain in the cache
indefinitely.
This patch fixes it by including the inactive IPs along with active IPs
in the list of live names to evaluate, which then ensures they are
considered for deletion if they exceed the max-ip-per-hostname limits.
Those limits are then enforced based on how recently the IPs have been
used, favouring IPs with more recently active connections over IPs that
have not been recently used by applications on the node.
Related: #13914
Related: #19452