Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.14] Backport FQDN: fix incorrect reaping of newly-expired IPs #31337

Merged
merged 2 commits into from Mar 12, 2024

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Mar 12, 2024

Once this PR is merged, a GitHub action will update the labels of these PRs:

$ for pr in 31205; do contrib/backporting/set-labels.py $pr done 1.14; done

[ upstream commit 03f8c85 ]

[ backporter's notes: needed to adapt to some gc signature changes, but
  logic unchanged. ]

The FQDN GC subsystem waits before a successful CT GC run before marking
IPs as stale. However, we were erroneously marking CT GC as successful
even on failure, or when only run for a single family.

So, only mark notify FQDN  when we've done a successful GC pass for all
configured families.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
[ upstream commit b284170 ]

[ backporter's notes: one signature change, passing string instead of
  IP. Otherwise unchanged ]

A bug was found where a low-TTL name was incorrectly reaped despite
being part of an active connection. After looking at logs, and
reproducing locally, it was determined that there is an unfortunate
interleaving between the DNS and CT GC loops.

The code attempts to prevent this issue by ensuring that names inserted
after CT GC has started are exempt from reaping. However, we don't
actually track the insertion time, we track the DNS TTL expiration time,
which is strictly in the past. In fact, it can be up to a minute in the
past. We shouldn't rely on timestamps anyways, as the scheduler can
always play tricks on us.

So, if a CT GC run has started and finished in the time between name
expiration and insertion in to zombies, the IP address is immediately
considered dead and unnecessarily reaped.

Timeline:

T1. name expires
T2. CT GC starts and finishes
T3. Zombies.SetCTGCTime(T2)
T4. Zombies.Upsert(name, T1)
T5. Zombies.GC()

At T5, zombies.GC will remove IPs associated with name, because T2 > T1.

The solution is to use an explicit serial number to ensure that CTGC has
completed a full run before we are allowed to delete an IP. We actually
need to let CT GC run twice, as it may have started before this zombie
was added and thus not marked it alive.

Additionally, we already have a grace period, the idle connection
timeout, that gives applications a chance to re-use an expired IP.
However, we did not respect this grace period if the IP in question did
not have an entry in conntrack. So, pad deletion time by this grace
period as well, just to be sure this grace period applies to all
possible deletions.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@squeed squeed requested a review from a team as a code owner March 12, 2024 08:53
@squeed squeed added kind/backports This PR provides functionality previously merged into master. backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. release-blocker/1.14 This issue will prevent the release of the next version of Cilium. labels Mar 12, 2024
@squeed
Copy link
Contributor Author

squeed commented Mar 12, 2024

/test-backport-1.14

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 12, 2024
@dylandreimerink dylandreimerink merged commit 6a989a2 into cilium:v1.14 Mar 12, 2024
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.14 This issue will prevent the release of the next version of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants