-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
policy: Fix selector identity release for FQDN #18166
Conversation
Alexander reports in GitHub issue 18023 that establishing a connection via an FQDN policy, then modifying that FQDN policy, would cause subsequent traffic to the FQDN to be dropped, even if the new policy still allowed the same traffic via a toFQDN statement. This was caused by overzealous release of CIDR identities while generating a new policy. Although the policy calculation itself keeps all selectorcache entries alive during the policy generation phase (see cachedSelectorPolicy.setPolicy() ), after the new policy is inserted into the PolicyCache, the distillery package would clean up the old policy. As part of that cleanup, it would call into the individual selector to call the RemoveSelectors() function. The previous implementation of this logic unintentionally released the underlying identities any time a user of a selector was released, rather than only releasing the underlying identities when the number of users reached zero and the selector itself would be released. This meant that rather than the selectorcache retaining references to the underlying identities when a policy was updated, instead the references would be released (and all corresponding BPF resources cleaned up) at the end of the process. This then triggered subsequent connectivity outages. Fix it by only releasing the identity references once the cached selector itself is removed from the SelectorCache. Fixes: f559cf1 ("selectorcache: Release identities on selector removal") Reported-by: Alexander Block <ablock84@gmail.com> Suggested-by: Jarno Rajahalme <jarno@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
/test |
Test that when FQDN policy is updated to a new policy which still selects the same old FQDN destination, connectivity continues to work. Validated to fail without the previous commit: K8sFQDNTest /home/joe/git/cilium/test/ginkgo-ext/scopes.go:473 Validate that FQDN policy continues to work after being updated [It] /home/joe/git/cilium/test/ginkgo-ext/scopes.go:527 Can't connect to to a valid target when it should work Expected command: kubectl exec -n default app2-58757b7dd5-rh7dd -- curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 --retry 5 http://vagrant-cache.ci.cilium.io -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}', Transfer '%{time_starttransfer}', total '%{time_total}'" To succeed, but it failed: Exitcode: 28 Err: exit status 28 Stdout: time-> DNS: '0.000016()', Connect: '0.000000',Transfer '0.000000', total '5.000415' Stderr: command terminated with exit code 28 Signed-off-by: Joe Stringer <joe@cilium.io>
6e243e5
to
96844eb
Compare
Forgot to include the new policy since it was all working locally, new version includes the replacement policy (same as the original one, with just one extra |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
/test-gke GKE had failed with timeouts while retrieving the Quay images, retriggering. |
K8s 1.23 / 4.9 hit an issue that looks similar to #13552, though the full stacktrace is a bit different: Stacktrace
I have never seen this one, does it ring a bell to anyone? |
CI 3.0 GKE workflow hit a Hubble flow listener timeout similar to #17907. |
Travis ARM build hit #17444, retriggering as this looks like a transient infra issue. |
@nbusseneau I looked into that failure, taking the actual error report, unfortunately ginkgo decides to ignore the formatting primitives inside the string and print it as one giant long line that's hard to read... however if we just interpret those formatting primitives (and add a bit more tasty, tasty spacing 😋 ) then we get:
So, it's failing (in pre-flight checks, ie before the actual test) due to error output in the |
Given that the only failing jobs have been pinpointed as already existing, ie not introduced by the PR, and the new test in the PR passes, and that this is a bugfix, I'll merge this and kick off the backport. |
I'll fix up the checkpatch complaint in a separate PR, that one is trivial and also affects existing tests. |
Alexander reports in issue #18023 that establishing a connection
via an FQDN policy, then modifying that FQDN policy, would cause
subsequent traffic to the FQDN to be dropped, even if the new policy
still allowed the same traffic via a
toFQDNs
statement.This was caused by overzealous release of CIDR identities while
generating a new policy. Although the policy calculation itself keeps
all
SelectorCache
entries alive during the policy generation phase (seecachedSelectorPolicy.setPolicy()
), after the new policy is insertedinto the
PolicyCache
, thedistillery
would clean up the oldpolicy. As part of that cleanup, it would call into the individual
selector to call the
RemoveSelectors()
function.The previous implementation of this logic unintentionally released the
underlying identities any time a user of a selector was released, rather
than only releasing the underlying identities when the number of users
reached zero and the selector itself would be released. This meant that
rather than the
SelectorCache
retaining references to the underlyingidentities when a policy was updated, instead the references would be
released (and all corresponding eBPF resources cleaned up) at the end of
the process. This then triggered subsequent connectivity outages.
Fix it by only releasing the identity references once the cached
selector itself is removed from the
SelectorCache
.Fixes: f559cf1 ("selectorcache: Release identities on selector removal")
Reported-by: Alexander Block ablock84@gmail.com
Suggested-by: Jarno Rajahalme jarno@cilium.io
Fixes: #18023