fqdn: Fix GC for dead IPs on live names over limit #22510

joestringer · 2022-12-02T05:27:20Z

A user observed that Cilium's FQDN name cache can accumulate many IP
addresses over time for DNS names that have frequent IP recycling (such as
Amazon S3). In particular, the --tofqdns-endpoint-max-ip-per-hostname
maximum did not seem to apply to such names as long as there is an
active connection for one of the IPs associated with the name. This
could cause an increase in memory usage and identity allocations in the
cluster, which consumes policymap entries as well.

The problem was that the FQDN garbage collection routines would apply
the limits to active IPs for active names and clean up inactive IPs and
inactive names, but it would not apply to the inactive IPs that
correspond to names with other active IPs. These would hence slip
through the garbage collection selection and remain in the cache
indefinitely.

This patch fixes it by including the inactive IPs along with active IPs
in the list of live names to evaluate, which then ensures they are
considered for deletion if they exceed the max-ip-per-hostname limits.
Those limits are then enforced based on how recently the IPs have been
used, favouring IPs with more recently active connections over IPs that
have not been recently used by applications on the node.

Related: #13914
Related: #19452

Improve garbage collection for FQDNs particularly with high-churn IP names such as Amazon S3.

A user observed that Cilium's FQDN name cache can accumulate many IP addresses over time for DNS names that have frequent IP recycling (such as Amazon S3). In particular, the "--tofqdns-endpoint-max-ip-per-hostname" maximum did not seem to apply to such names as long as there is an active connection for one of the IPs associated with the name. This could cause an increase in memory usage and identity allocations in the cluster, which consumes policymap entries as well. The problem was that the FQDN garbage collection routines would apply the limits to active IPs for active names and clean up inactive IPs and inactive names, but it would not apply to the inactive IPs that correspond to names with other active IPs. These would hence slip through the garbage collection selection and remain in the cache indefinitely. This patch fixes it by including the inactive IPs along with active IPs in the list of live names to evaluate, which then ensures they are considered for deletion if they exceed the max-ip-per-hostname limits. Those limits are then enforced based on how recently the IPs have been used, favouring IPs with more recently active connections over IPs that have not been recently used by applications on the node. Signed-off-by: Joe Stringer <joe@cilium.io>

joestringer · 2022-12-02T05:43:56Z

/test

joestringer · 2022-12-02T05:50:50Z

Not sure what happened with test-runtime not-failing failure, maybe I triggered the build too soon. Will re-kick it.

joestringer · 2022-12-02T05:51:00Z

/test-runtime

christarazi

Fix makes sense to me, thanks. There seems to be a pattern popping up with these bugs. It seems we are missing a sort of "matrix" / "permutations" testing, like how do we handle GC when there's a mix alive & dead IPs associated with names or a mix of lookups & zombies. I see in this PR we are adding that sort of test, but I wonder if we have this coverage in other tests.

joestringer · 2022-12-03T01:08:40Z

There is a bit of that yeah. There's not so many permutations though and it seemed like the rest was already covered. Particularly when it comes to enforcing the limit, it's more just about "alive" names and how the limits get enforced against those. Once we do the GC() call we just get back the lists of alive + dead names and clean up the dead ones, so I think at this point we probably have all the coverage we need for this? I'm open to looking into some other pattern if there's a particular test you have in mind.

joestringer · 2022-12-04T20:07:57Z

CI Triage:

ConformanceAKS - Looks like a variation on CI: ConformanceKind1.19: Install Cilium with Encryption: Unable to execute "kubectl port-forward ..." #22435.
ConformanceGKE - Same test failure as CI: ConformanceAKS: client-egress-to-echo-expression-deny: command terminated with exit code 28 #22529 , though the individual command that fails is not the same. Unfortunately the Hubble flow output is not available to confirm (CI 3.0 workflows may not gather hubble flow output for the target failure #22539)
External workloads - Hit CI: External Workloads Conformance tests are broken with cilium-cli v0.12.10 #22451.

michi-covalent

just adding another data point, i was hitting this issue with connections to S3 in one of my clusters. i applied this patch, and the number of connection entries in cilium fqdn cache list has been kept under tofqdns-endpoint-max-ip-per-hostname for the past 3 days 🥳

joestringer added affects/v1.10 This issue affects v1.10 branch affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch needs-backport/1.12 release-note/bug This PR fixes an issue in a previous release of Cilium. labels Dec 2, 2022

maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Dec 2, 2022

maintainer-s-little-helper bot added this to Needs backport from master in 1.12.5 Dec 2, 2022

joestringer added the sig/agent Cilium agent related. label Dec 2, 2022

joestringer force-pushed the submit/zombie-limits-for-inactive-ips-of-active-names branch from 12c4be6 to dcd64c1 Compare December 2, 2022 05:43

joestringer marked this pull request as ready for review December 2, 2022 09:12

joestringer requested review from a team as code owners December 2, 2022 09:12

joestringer requested review from jibi and jrajahalme December 2, 2022 09:12

christarazi approved these changes Dec 3, 2022

View reviewed changes

michi-covalent approved these changes Dec 5, 2022

View reviewed changes

joestringer merged commit 335b0ae into cilium:master Dec 5, 2022

joestringer deleted the submit/zombie-limits-for-inactive-ips-of-active-names branch December 5, 2022 21:20

joestringer mentioned this pull request Dec 14, 2022

[v1.12] fqdn: Fix GC for dead IPs on live names over limit #22730

Merged

joestringer added backport-pending/1.12 and removed needs-backport/1.12 labels Dec 14, 2022

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.12 in 1.12.5 Dec 14, 2022

joaoubaldo mentioned this pull request Dec 15, 2022

Partial traffic drop when CNP allows it #20650

Closed

2 tasks

joestringer moved this from Backport pending to v1.12 to Backport done to v1.12 in 1.12.5 Dec 15, 2022

joestringer added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Dec 15, 2022

This was referenced Dec 16, 2022

Prepare for release v1.12.5 #22767

Merged

Prepare for release v1.13.0-rc4 #22851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fqdn: Fix GC for dead IPs on live names over limit #22510

fqdn: Fix GC for dead IPs on live names over limit #22510

joestringer commented Dec 2, 2022 •

edited

Loading

joestringer commented Dec 2, 2022

joestringer commented Dec 2, 2022 •

edited

Loading

joestringer commented Dec 2, 2022

christarazi left a comment

joestringer commented Dec 3, 2022

joestringer commented Dec 4, 2022 •

edited

Loading

michi-covalent left a comment

fqdn: Fix GC for dead IPs on live names over limit #22510

fqdn: Fix GC for dead IPs on live names over limit #22510

Conversation

joestringer commented Dec 2, 2022 • edited Loading

joestringer commented Dec 2, 2022

joestringer commented Dec 2, 2022 • edited Loading

joestringer commented Dec 2, 2022

christarazi left a comment

Choose a reason for hiding this comment

joestringer commented Dec 3, 2022

joestringer commented Dec 4, 2022 • edited Loading

michi-covalent left a comment

Choose a reason for hiding this comment

joestringer commented Dec 2, 2022 •

edited

Loading

joestringer commented Dec 2, 2022 •

edited

Loading

joestringer commented Dec 4, 2022 •

edited

Loading