Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix FQDN memory leak #17432

Merged
merged 2 commits into from Sep 22, 2021
Merged

Fix FQDN memory leak #17432

merged 2 commits into from Sep 22, 2021

Conversation

aanm
Copy link
Member

@aanm aanm commented Sep 17, 2021

In the FQDN architecture there's a DNS Cache per endpoint, used to track
which domain names each endpoint makes DNS requests, and a global DNS
Cache where its main functionality is to help tracking which
api.FQDNSelector present in the policy applies to locally running
endpoints. The latter, as opposed to the former, didn't have any
cleanup mechanism for the map that tracked which entries should be
garbage collected, making the global DNS Cache to grow.
This commit prevents those entries from being tracked for Garbage
Collection in the global DNS Cache.

Fixes #16300

Fix memory leak that can occur with the presence of FQDN policies

The public function ForceExpiredByNames is not executed from anywhere so
this function can be safely removed.

Signed-off-by: André Martins <andre@cilium.io>
In the FQDN architecture there's a DNS Cache per endpoint, used to track
which domain names each endpoint makes DNS requests, and a global DNS
Cache where its main functionality is to help tracking which
api.FQDNSelector present in the policy applies to locally running
endpoints. The latter, as opposed to the former, didn't have any
cleanup mechanism for the map that tracked which entries should be
garbage collected, making the global DNS Cache to grow.
This commit prevents those entries from being tracked for Garbage
Collection in the global DNS Cache.

Signed-off-by: André Martins <andre@cilium.io>
@aanm aanm added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.10 labels Sep 17, 2021
@aanm aanm requested review from a team and joamaki September 17, 2021 22:20
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.5 Sep 17, 2021
@aanm
Copy link
Member Author

aanm commented Sep 17, 2021

test-me-please

Copy link
Contributor

@joamaki joamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double check that I understood this correctly:

  1. The memory leak was due to the "cleanup" map growing without bounds due to the default DNS cache not running a GC on the cache
  2. The default DNS cache only contains a few entries which is on the order of number of local endpoints
  3. These entries are cleaned up when the endpoints change, but we were left with the "cleanup" map entries?

If I understood correctly, then LGTM.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 20, 2021
Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spotting! I have some reservations about the temporary fix put in in this PR. /cc @jrajahalme who might have more context.

Edit: Also, is this a regression in recent releases? I'm surprised that the issue wasn't reported until now.

daemon/cmd/fqdn.go Show resolved Hide resolved
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 20, 2021
@aanm
Copy link
Member Author

aanm commented Sep 20, 2021

Just to double check that I understood this correctly:

  1. The memory leak was due to the "cleanup" map growing without bounds due to the default DNS cache not running a GC on the cache

yes

  1. The default DNS cache only contains a few entries which is on the order of number of local endpoints

not quite, from my experiments it contains as many entries as the number of DNS requests performed by each endpoint

  1. These entries are cleaned up when the endpoints change, but we were left with the "cleanup" map entries?

Correct, that is my understanding as well.

If I understood correctly, then LGTM.

@joamaki replied inline. Let me know if this doesn't answer your questions.

@aanm
Copy link
Member Author

aanm commented Sep 20, 2021

Edit: Also, is this a regression in recent releases? I'm surprised that the issue wasn't reported until now.

@aditighag I don't believe it's a regression since this code exists since the feature was introduced.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change (depending on my understanding) LGTM. One minor nit on the naming.

Similar to Aditi's comments though, why is this particular change safe. AFAIU, we are setting the map to nil so that we never insert entries into it, effectively disabling the DNS cache for expired FQDNs altogether. Is that safe? I think this is probably what @aditighag was trying to get at.

I assume that it's safe, but would be good to document the "why" in the commit msg.

@@ -160,6 +160,12 @@ func NewDNSCacheWithLimit(minTTL int, limit int) *DNSCache {
return c
}

func (c *DNSCache) DisableCleanupTrack() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (c *DNSCache) DisableCleanupTrack() {
func (c *DNSCache) DisableCleanupTracking() {

@aanm
Copy link
Member Author

aanm commented Sep 20, 2021

I think the change (depending on my understanding) LGTM. One minor nit on the naming.

Similar to Aditi's comments though, why is this particular change safe. AFAIU, we are setting the map to nil so that we never insert entries into it, effectively disabling the DNS cache for expired FQDNs altogether. Is that safe? I think this is probably what @aditighag was trying to get at.

@christarazi AFAICT we are not disabling the DNS cache for expired FQDNs, we are not tracking which entries should be cleaned up by the GC because these entries are cleaned up when the endpoints change.

I assume that it's safe, but would be good to document the "why" in the commit msg.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 21, 2021
@christarazi christarazi merged commit 8983227 into cilium:master Sep 22, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.5 Sep 29, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.5 Sep 29, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.5 Oct 2, 2021
@aanm aanm deleted the pr/fix-fqdn-memory-leak branch October 15, 2021 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.10.5
Backport done to v1.10
Development

Successfully merging this pull request may close these issues.

Memory leak with FQDN policies
5 participants