PolicyRepository: index and replace rules by resource. #32703

squeed · 2024-05-24T09:57:28Z

(Note for reviewers: this also contains a commit fixing some improper use of shared state in tests; it is mostly mechanical but required for this change).

The PolicyRepository is a database of all known network policies (KNP / CNP / CCNP / gRPC Policies). It references policies by Labels, which are an arbitrary set of identifiers for a policy. When an "upstream" policy is updated, a synthetic set of labels is created and used as a key for replacement.

This label-oriented referencing is inefficient. When a policy is upserted, all existing rules must be scanned to se if they are candidates for replacement. Furthermore, this doesn't reflect how policy rules are actually created. Namely, they have an upstream owning resource, which creates one or more downstream rules in the repository.

This change reorients the PolicyRepository to index and replace rules on a per-resource basis. This pattern is already well established within Cilium, most notably the IPCache, and has proven itself. It also means that the standard policy actions do not require a whole-repository scan or evaluating labels.

The existing label-based replacement mechanism must be preserved for policies managed by the local API, so the existing code cannot be entirely removed, but it will seldom be used in the field.

(This PR was inspired by #27163, but takes a different tack).

Starting cilium-agent with large numbers of network policies should be much faster.

squeed · 2024-05-24T09:59:42Z

/test

squeed · 2024-05-27T12:38:57Z

Tagging @aanm for review too; you understand this bit of code somewhat.

pkg/policy/fuzz_test.go

pkg/policy/l4_filter_test.go

pkg/policy/repository.go

pkg/policy/repository_test.go

pkg/policy/rule.go

christarazi

LGTM, very nice cleanup! Do we have any performance benchmark numbers that can tell us how much this improves the startup time?

The stateful test scaffolding was getting in the way of future changes, especially surrounding the shared SelectorCache. Tests were adding identities, which affected other tests. Furthermore, testing the package with --count=2 reliably failed due to left-behind state. This mechanical change aggregates all useful variables behind a single struct. No test logic has been changed. Signed-off-by: Casey Callendrello <cdc@isovalent.com>

UID is not a safe resource for indexing; multiple rules may have the same UID, and UID is not guaranteed to be unique across different resources. Furthermore, informers may coalesce delete + add events in to an update, thus losing the UID edge regardless. Signed-off-by: Casey Callendrello <cdc@isovalent.com>

squeed · 2024-05-30T10:05:42Z

/test

squeed · 2024-05-30T10:24:58Z

Do we have any performance benchmark numbers that can tell us how much this improves the startup time?

@christarazi I threw together a quick benchmark:

$ benchstat bench-before.txt bench-after.txt 
goos: linux
goarch: amd64
pkg: github.com/cilium/cilium/pkg/policy
cpu: 12th Gen Intel(R) Core(TM) i7-1250U
            │ bench-before.txt │           bench-after.txt            │
            │      sec/op      │    sec/op     vs base                │
AddRules-12      241.740µ ± 7%   1.538µ ± 49%  -99.36% (p=0.000 n=10)

Good enough for me :-)

When upserting a CNP or KNP, we identify existing rules in the repository by a set of labels. However, evaluating this set of labels is expensive, especially as we must check against all label selectors every time we want to add or remove a policy. Rather than using label selectors internally, track policies by owning resource, much the way that prefixes are tracked in the ipcache. Then, when upserting policies, the set of existing rules attached to a given resource can be easily retrieved. The existing behavior is preserved, as it is also exposed via the local gRPC API. However, the k8s handlers no longer use it. Signed-off-by: Casey Callendrello <cdc@isovalent.com>

squeed · 2024-05-30T11:28:03Z

Tests caught a flake -- just some test code that expected a certain ordering. Fixed.

squeed · 2024-05-30T11:28:05Z

/test

aanm

@squeed I'm 15 mins too late but I left a comment that I believe it should be addressed.

aanm · 2024-05-30T12:59:57Z

pkg/policy/rule.go

 	slim_metav1 "github.com/cilium/cilium/pkg/k8s/slim/k8s/apis/meta/v1"
 	"github.com/cilium/cilium/pkg/labels"
 	"github.com/cilium/cilium/pkg/lock"
 	"github.com/cilium/cilium/pkg/option"
 	"github.com/cilium/cilium/pkg/policy/api"
 )

+type ruleKey struct {
+	resource ipcachetypes.ResourceID
+	idx      uint


Index in reference to what? A comment here would be good.

sounds good, I'll add a comment in an incoming PR.

aanm · 2024-05-30T13:01:23Z

Do we have any performance benchmark numbers that can tell us how much this improves the startup time?

@christarazi I threw together a quick benchmark:
$ benchstat bench-before.txt bench-after.txt 
goos: linux
goarch: amd64
pkg: github.com/cilium/cilium/pkg/policy
cpu: 12th Gen Intel(R) Core(TM) i7-1250U
            │ bench-before.txt │           bench-after.txt            │
            │      sec/op      │    sec/op     vs base                │
AddRules-12      241.740µ ± 7%   1.538µ ± 49%  -99.36% (p=0.000 n=10)
Good enough for me :-)

I also ran a local benchmark to compare the PR against current main and I didn't see any regressions.

squeed requested a review from a team as a code owner May 24, 2024 09:57

squeed requested a review from doniacld May 24, 2024 09:57

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 24, 2024

squeed mentioned this pull request May 24, 2024

repository: shard rules by namespace #27163

Draft

squeed requested a review from aanm May 27, 2024 12:38