policy: Fix selector identity release for FQDN #18166

joestringer · 2021-12-08T02:54:31Z

Alexander reports in issue #18023 that establishing a connection
via an FQDN policy, then modifying that FQDN policy, would cause
subsequent traffic to the FQDN to be dropped, even if the new policy
still allowed the same traffic via a toFQDNs statement.

This was caused by overzealous release of CIDR identities while
generating a new policy. Although the policy calculation itself keeps
all SelectorCache entries alive during the policy generation phase (see
cachedSelectorPolicy.setPolicy()), after the new policy is inserted
into the PolicyCache, the distillery would clean up the old
policy. As part of that cleanup, it would call into the individual
selector to call the RemoveSelectors() function.

The previous implementation of this logic unintentionally released the
underlying identities any time a user of a selector was released, rather
than only releasing the underlying identities when the number of users
reached zero and the selector itself would be released. This meant that
rather than the SelectorCache retaining references to the underlying
identities when a policy was updated, instead the references would be
released (and all corresponding eBPF resources cleaned up) at the end of
the process. This then triggered subsequent connectivity outages.

Fix it by only releasing the identity references once the cached
selector itself is removed from the SelectorCache.

Fixes: f559cf1 ("selectorcache: Release identities on selector removal")
Reported-by: Alexander Block ablock84@gmail.com
Suggested-by: Jarno Rajahalme jarno@cilium.io

Fixes: #18023

Alexander reports in GitHub issue 18023 that establishing a connection via an FQDN policy, then modifying that FQDN policy, would cause subsequent traffic to the FQDN to be dropped, even if the new policy still allowed the same traffic via a toFQDN statement. This was caused by overzealous release of CIDR identities while generating a new policy. Although the policy calculation itself keeps all selectorcache entries alive during the policy generation phase (see cachedSelectorPolicy.setPolicy() ), after the new policy is inserted into the PolicyCache, the distillery package would clean up the old policy. As part of that cleanup, it would call into the individual selector to call the RemoveSelectors() function. The previous implementation of this logic unintentionally released the underlying identities any time a user of a selector was released, rather than only releasing the underlying identities when the number of users reached zero and the selector itself would be released. This meant that rather than the selectorcache retaining references to the underlying identities when a policy was updated, instead the references would be released (and all corresponding BPF resources cleaned up) at the end of the process. This then triggered subsequent connectivity outages. Fix it by only releasing the identity references once the cached selector itself is removed from the SelectorCache. Fixes: f559cf1 ("selectorcache: Release identities on selector removal") Reported-by: Alexander Block <ablock84@gmail.com> Suggested-by: Jarno Rajahalme <jarno@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>

joestringer · 2021-12-08T02:56:28Z

/test

Test that when FQDN policy is updated to a new policy which still selects the same old FQDN destination, connectivity continues to work. Validated to fail without the previous commit: K8sFQDNTest /home/joe/git/cilium/test/ginkgo-ext/scopes.go:473 Validate that FQDN policy continues to work after being updated [It] /home/joe/git/cilium/test/ginkgo-ext/scopes.go:527 Can't connect to to a valid target when it should work Expected command: kubectl exec -n default app2-58757b7dd5-rh7dd -- curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 --retry 5 http://vagrant-cache.ci.cilium.io -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}', Transfer '%{time_starttransfer}', total '%{time_total}'" To succeed, but it failed: Exitcode: 28 Err: exit status 28 Stdout: time-> DNS: '0.000016()', Connect: '0.000000',Transfer '0.000000', total '5.000415' Stderr: command terminated with exit code 28 Signed-off-by: Joe Stringer <joe@cilium.io>

joestringer · 2021-12-08T03:00:05Z

Forgot to include the new policy since it was all working locally, new version includes the replacement policy (same as the original one, with just one extra toFQDNs statement.)

joestringer · 2021-12-08T03:00:13Z

/test

christarazi

🚀

nbusseneau · 2021-12-08T14:48:30Z

/test-gke

GKE had failed with timeouts while retrieving the Quay images, retriggering.

nbusseneau · 2021-12-08T14:53:28Z

K8s 1.23 / 4.9 hit an issue that looks similar to #13552, though the full stacktrace is a bit different:

Stacktrace

/home/jenkins/workspace/Cilium-PR-K8s-1.23-kernel-4.9/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:465
cilium pre-flight checks failed
Expected
    <*errors.errorString | 0xc002160ca0>: {
        s: "Cilium validation failed: 4m0s timeout expired: Last polled error: controllers are failing: cilium-agent 'cilium-7cszc': controller ipcache-inject-labels is failing: Exitcode: 0 \nStdout:\n \t KVStore:                Ok   Disabled\n\t Kubernetes:             Ok   1.23+ (v1.23.0-rc.0) [linux/amd64]\n\t Kubernetes APIs:        [\"cilium/v2::CiliumClusterwideNetworkPolicy\", \"cilium/v2::CiliumNetworkPolicy\", \"cilium/v2::CiliumNode\", \"cilium/v2alpha1::CiliumEndpointSlice\", \"core/v1::Namespace\", \"core/v1::Node\", \"core/v1::Pods\", \"core/v1::Service\", \"discovery/v1::EndpointSlice\", \"networking.k8s.io/v1::NetworkPolicy\"]\n\t KubeProxyReplacement:   Disabled   \n\t Host firewall:          Disabled\n\t Cilium:                 Ok   1.11.90 (v1.11.90-96844eb)\n\t NodeMonitor:            Listening for events on 3 CPUs with 64x4096 of shared memory\n\t Cilium health daemon:   Ok   \n\t IPAM:                   IPv4: 6/254 allocated from 10.0.0.0/24, IPv6: 6/254 allocated from fd02::/120\n\t BandwidthManager:       Disabled\n\t Host Routing:           Legacy\n\t Masquerading:           IPTables [IPv4: Disabled, IPv6: Enabled]\n\t Controller Status:      39/40 healthy\n\t   Name                                          Last success   Last error   Count   Message\n\t   bpf-map-sync-cilium_lxc                       5s ago         never        0       no error                                                                                                                                                                    \n\t   cilium-health-ep                              59s ago        never        0       no error                                                                                                                                                                    \n\t   dns-garbage-collector-job                     2s ago         never        0       no error                                                                                                                                                                    \n\t   endpoint-1217-regeneration-recovery           never          never        0       no error                                                                                                                                                                    \n\t   endpoint-1248-regeneration-recovery           never          never        0       no error                                                                                                                                                                    \n\t   endpoint-1474-regeneration-recovery           never          never        0       no error                                                                                                                                                                    \n\t   endpoint-2615-regeneration-recovery           never          never        0       no error                                                                                                                                                                    \n\t   endpoint-2810-regeneration-recovery           never          never        0       no error                                                                                                                                                                    \n\t   endpoint-817-regeneration-recovery            never          never        0       no error                                                                                                                                                                    \n\t   endpoint-gc                                   4m3s ago       never        0       no error                                                                                                                                                                    \n\t   ipcache-inject-labels                         4m0s ago       9s ago       22      faile...

Gomega truncated this representation as it exceeds 'format.MaxLength'.
Consider having the object provide a custom 'GomegaStringer' representation
or adjust the parameters in Gomega's 'format' package.

Learn more here: https://onsi.github.io/gomega/#adjusting-output

to be nil
/home/jenkins/workspace/Cilium-PR-K8s-1.23-kernel-4.9/src/github.com/cilium/cilium/test/k8sT/assertionHelpers.go:157

I have never seen this one, does it ring a bell to anyone?

nbusseneau · 2021-12-08T14:55:29Z

CI 3.0 GKE workflow hit a Hubble flow listener timeout similar to #17907.

nbusseneau · 2021-12-08T14:56:55Z

Travis ARM build hit #17444, retriggering as this looks like a transient infra issue.

joestringer · 2021-12-09T00:38:41Z

@nbusseneau I looked into that failure, taking the actual error report, unfortunately ginkgo decides to ignore the formatting primitives inside the string and print it as one giant long line that's hard to read... however if we just interpret those formatting primitives (and add a bit more tasty, tasty spacing 😋 ) then we get:

        s: "Cilium validation failed:
        4m0s timeout expired:
        Last polled error: controllers are failing:
        cilium-agent 'cilium-7cszc':
        controller ipcache-inject-labels is failing: Exitcode: 0 
Stdout:
 	 KVStore:                Ok   Disabled
	 Kubernetes:             Ok   1.23+ (v1.23.0-rc.0) [linux/amd64]
	 Kubernetes APIs:        [\"cilium/v2::CiliumClusterwideNetworkPolicy\", \"cilium/v2::CiliumNetworkPolicy\", \"cilium/v2::CiliumNode\", \"cilium/v2alpha1::CiliumEndpointSlice\", \"core/v1::Namespace\", \"core/v1::Node\", \"core/v1::Pods\", \"core/v1::Service\", \"discovery/v1::EndpointSlice\", \"networking.k8s.io/v1::NetworkPolicy\"]
	 KubeProxyReplacement:   Disabled   
	 Host firewall:          Disabled
	 Cilium:                 Ok   1.11.90 (v1.11.90-96844eb)
	 NodeMonitor:            Listening for events on 3 CPUs with 64x4096 of shared memory
	 Cilium health daemon:   Ok   
	 IPAM:                   IPv4: 6/254 allocated from 10.0.0.0/24, IPv6: 6/254 allocated from fd02::/120
	 BandwidthManager:       Disabled
	 Host Routing:           Legacy
	 Masquerading:           IPTables [IPv4: Disabled, IPv6: Enabled]
	 Controller Status:      39/40 healthy
	   Name                                          Last success   Last error   Count   Message
	   bpf-map-sync-cilium_lxc                       5s ago         never        0       no error                                                                                                                                                                    
	   cilium-health-ep                              59s ago        never        0       no error                                                                                                                                                                    
	   dns-garbage-collector-job                     2s ago         never        0       no error                                                                                                                                                                    
	   endpoint-1217-regeneration-recovery           never          never        0       no error                                                                                                                                                                    
	   endpoint-1248-regeneration-recovery           never          never        0       no error                                                                                                                                                                    
	   endpoint-1474-regeneration-recovery           never          never        0       no error                                                                                                                                                                    
	   endpoint-2615-regeneration-recovery           never          never        0       no error                                                                                                                                                                    
	   endpoint-2810-regeneration-recovery           never          never        0       no error                                                                                                                                                                    
	   endpoint-817-regeneration-recovery            never          never        0       no error                                                                                                                                                                    
	   endpoint-gc                                   4m3s ago       never        0       no error                                                                                                                                                                    
	   ipcache-inject-labels                         4m0s ago       9s ago       22      faile...

So, it's failing (in pre-flight checks, ie before the actual test) due to error output in the cilium status controller fields. This is related to the policy for kube-apiserver feature, and Chris has been iterating to improve on that feature lately and we expect that #18150 will fix this issue.

joestringer · 2021-12-09T00:55:07Z

Given that the only failing jobs have been pinpointed as already existing, ie not introduced by the PR, and the new test in the PR passes, and that this is a bugfix, I'll merge this and kick off the backport.

joestringer · 2021-12-09T01:03:08Z

I'll fix up the checkpatch complaint in a separate PR, that one is trivial and also affects existing tests.

joestringer requested a review from a team as a code owner December 8, 2021 02:54

joestringer requested a review from a team December 8, 2021 02:54

joestringer requested a review from a team as a code owner December 8, 2021 02:54

joestringer requested a review from christarazi December 8, 2021 02:54

joestringer added needs-backport/1.10 release-note/bug This PR fixes an issue in a previous release of Cilium. labels Dec 8, 2021

joestringer requested a review from jrajahalme December 8, 2021 02:54

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 8, 2021

maintainer-s-little-helper bot assigned christarazi Dec 8, 2021

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 8, 2021

maintainer-s-little-helper bot assigned jrajahalme Dec 8, 2021

maintainer-s-little-helper bot added this to Needs backport from master in 1.10.6 Dec 8, 2021

jrajahalme approved these changes Dec 8, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned jrajahalme Dec 8, 2021

joestringer force-pushed the submit/fix-18023 branch from 6e243e5 to 96844eb Compare December 8, 2021 02:59

aanm approved these changes Dec 8, 2021

View reviewed changes

christarazi approved these changes Dec 8, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned christarazi Dec 8, 2021

michi-covalent approved these changes Dec 8, 2021

View reviewed changes

joestringer mentioned this pull request Dec 9, 2021

CI: Improve preflight error reporting #18187

Open

joestringer merged commit 80ff05a into cilium:master Dec 9, 2021

joestringer deleted the submit/fix-18023 branch December 9, 2021 01:03

joestringer mentioned this pull request Dec 9, 2021

v1.10 backports 2021-12-08 #18189

Merged

joestringer added backport-pending/1.10 and removed needs-backport/1.10 labels Dec 9, 2021

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.6 Dec 9, 2021

nbusseneau added backport-done/1.10 and removed backport-pending/1.10 labels Dec 9, 2021

maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.6 Dec 9, 2021

joestringer mentioned this pull request Dec 10, 2021

Prepare for release v1.10.6 #18214

Merged

nbusseneau added backport-pending/1.11 and removed needs-backport/1.11 labels Dec 10, 2021

nbusseneau mentioned this pull request Dec 10, 2021

v1.11 backports 2021-12-10 #18232

Merged

aditighag mentioned this pull request Dec 10, 2021

CI: K8sFQDNTest Validate that FQDN policy continues to work after being updated #18218

Closed

tklauser added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Dec 16, 2021

joestringer added this to Backport done to v1.11 in 1.11.1 Jan 5, 2022

joestringer mentioned this pull request Jan 18, 2022

Prepare for release v1.11.1 #18535

Merged

christarazi added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Jun 10, 2022

christarazi mentioned this pull request Jun 10, 2022

update networkpolicy then can not reach ip address #20122

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

policy: Fix selector identity release for FQDN #18166

policy: Fix selector identity release for FQDN #18166

joestringer commented Dec 8, 2021 •

edited

joestringer commented Dec 8, 2021

joestringer commented Dec 8, 2021

joestringer commented Dec 8, 2021

christarazi left a comment

nbusseneau commented Dec 8, 2021 •

edited

nbusseneau commented Dec 8, 2021

nbusseneau commented Dec 8, 2021

nbusseneau commented Dec 8, 2021

joestringer commented Dec 9, 2021 •

edited

joestringer commented Dec 9, 2021

joestringer commented Dec 9, 2021

policy: Fix selector identity release for FQDN #18166

policy: Fix selector identity release for FQDN #18166

Conversation

joestringer commented Dec 8, 2021 • edited

joestringer commented Dec 8, 2021

joestringer commented Dec 8, 2021

joestringer commented Dec 8, 2021

christarazi left a comment

Choose a reason for hiding this comment

nbusseneau commented Dec 8, 2021 • edited

nbusseneau commented Dec 8, 2021

nbusseneau commented Dec 8, 2021

nbusseneau commented Dec 8, 2021

joestringer commented Dec 9, 2021 • edited

joestringer commented Dec 9, 2021

joestringer commented Dec 9, 2021

joestringer commented Dec 8, 2021 •

edited

nbusseneau commented Dec 8, 2021 •

edited

joestringer commented Dec 9, 2021 •

edited