Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fqdn: fix broken sorting algorithm for DNSZombies #27572

Merged
merged 1 commit into from Aug 29, 2023

Conversation

bimmlerd
Copy link
Member

The sorting function for DNSZombies was subtly broken, and didn't do what it advertised. Write some tests to confirm the suspicion and fix the function.

Reported-by: Jussi Maki joamaki@isovalent.com

Prioritization of which DNS mappings to keep was suboptimal, leading to evictions of mappings related to alive connections, worsening performance of fqdn policies and causing spurious logging.

@bimmlerd bimmlerd added kind/bug This is a bug in the Cilium logic. area/fqdn Affects the FQDN policies feature labels Aug 17, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 17, 2023
@bimmlerd bimmlerd added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Aug 17, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 17, 2023
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/fix-sorting-function branch 4 times, most recently from 1d2c827 to f9d1640 Compare August 17, 2023 14:02
@bimmlerd bimmlerd marked this pull request as ready for review August 17, 2023 14:13
@bimmlerd bimmlerd requested review from a team as code owners August 17, 2023 14:13
@bimmlerd
Copy link
Member Author

/test

@bimmlerd
Copy link
Member Author

Given this is a bug, I guess we backport it to at least 1.14. I assess the risk of this breaking something as pretty minimal, so might be worth backporting even further, as it's not great to throw away these mappings for live connections 😓.

@bimmlerd bimmlerd added sig/agent Cilium agent related. needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Aug 17, 2023
@joestringer joestringer added affects/v1.12 This issue affects v1.12 branch affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch labels Aug 17, 2023
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm, nice catch!

pkg/fqdn/cache_test.go Outdated Show resolved Hide resolved
pkg/fqdn/cache_test.go Show resolved Hide resolved
Copy link
Member

@jrajahalme jrajahalme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the bug fix! Could fix the nit by @giorio94 in the test.

@bimmlerd bimmlerd force-pushed the pr/bimmlerd/fix-sorting-function branch from f9d1640 to a5f9713 Compare August 29, 2023 11:33
The sorting function for DNSZombies was subtly broken, and didn't do
what it advertised. Write some tests to confirm the suspicion and fix
the function.

Reported-by: Jussi Maki <joamaki@isovalent.com>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/fix-sorting-function branch from a5f9713 to 1483680 Compare August 29, 2023 11:34
@bimmlerd
Copy link
Member Author

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 29, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.2 Aug 29, 2023
@aditighag aditighag merged commit b46867c into cilium:main Aug 29, 2023
60 checks passed
@bimmlerd
Copy link
Member Author

Nominating this for backports all the way to 1.12, as previously mentioned, I assess the risk as low, and this can cause connection drops.

@bimmlerd bimmlerd deleted the pr/bimmlerd/fix-sorting-function branch August 30, 2023 07:58
@wimnat
Copy link

wimnat commented Aug 31, 2023

How would you expect this bug to manifest itself? I'm wondering if it could serve as possible explanation to a problem we have seen recently reported in Slack.

@joamaki
Copy link
Contributor

joamaki commented Sep 4, 2023

When over tofqdns-endpoint-max-ip-per-hostname IPs exist for a name, then this sorting kicks in and tries to evict the oldest entries. Due to the sorting being broken, it may have evicted alive IPs instead leading to connection drops. This was made worse by the conntrack GC interval becoming too long, possibly many hours (this is what marks the aliveness for these entries), leading to this being the only mechanism via which these entries are evicted.

#27870 adds the option of setting the maximum interval for conntrack GC, which should allow configuring it in a way to allow GCing before the max-ips-per-hostname is hit and thus lower the peak amount of allocated identities.

@jibi jibi mentioned this pull request Sep 4, 2023
16 tasks
@jibi jibi added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Sep 4, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.14 in 1.14.2 Sep 4, 2023
@jibi jibi mentioned this pull request Sep 4, 2023
10 tasks
@jibi jibi added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Sep 4, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.7 Sep 4, 2023
@jibi jibi mentioned this pull request Sep 5, 2023
5 tasks
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.12 in 1.12.14 Sep 5, 2023
@jibi jibi added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Sep 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.14 to Backport done to v1.14 in 1.14.2 Sep 7, 2023
@michi-covalent michi-covalent added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Sep 9, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.7 Sep 9, 2023
@michi-covalent michi-covalent added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Sep 9, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.12 to Backport done to v1.12 in 1.12.14 Sep 9, 2023
squeed added a commit to squeed/cilium that referenced this pull request Feb 23, 2024
The FQDN proxy will GC IP addresses that are both:
- past their DNS TTLs, and
- no longer in use by active connections

However, many applications do not immediately re-resolve names between
connections, even if the TTL has expired. This can cause traffic to be
dropped unexpectedly.

Previously, this was not an issue, as FQDN GC happened very rarely. This
has been improved, however, by cilium#27572 and cilium#27870. Now, end-users
occasionally being surprised by this.

This sets the default grace period to 60 seconds, in an attempt to find
a good balance between application needs and security.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.12 This issue affects v1.12 branch affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch area/fqdn Affects the FQDN policies feature backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/agent Cilium agent related.
Projects
No open projects
1.12.14
Backport done to v1.12
1.13.7
Backport done to v1.13
1.14.2
Backport done to v1.14
Development

Successfully merging this pull request may close these issues.

None yet

9 participants