-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repository: shard rules by namespace #27163
base: main
Are you sure you want to change the base?
Conversation
This central ruleSlice has for us been one of the main bottle necks for throughput, and one of the main contributors to cpu usage. With this patch, we will now shard it by namespace transparently, so that we don't need to evaluate rules unnecessary. The main benefit is that when calculating rules for pod X in namespace A, we don't need to evaluate rules only applying to pods in namespace B. The same applies to the rule add and remove; when a rule is removed in namespace B, we don't need to evaluate rules in namespace A. This is all transparent to users, and an implementation detail, so the sharding mechanism can easily be changed at a later stage. Based on the BenchmarkParseLabel with 10 namespaces/shards, this is the benchstat diff; name old time/op new time/op delta ParseLabel-7 5.21s ±33% 1.02s ±40% -80.36% (p=0.000 n=10+10) name old alloc/op new alloc/op delta ParseLabel-7 26.9MB ± 0% 26.9MB ± 0% -0.33% (p=0.001 n=9+10) name old allocs/op new allocs/op delta ParseLabel-7 520k ± 0% 520k ± 0% +0.02% (p=0.000 n=9+9) Signed-off-by: Odin Ugedal <ougedal@palantir.com>
cc @christarazi. I'll do some more testing, but would be nice with some feedback on the overall idea. I think this should reduce our cilium-agent cpu load quite significantly, and it should also reduce the lock contention quite significantly as well. |
Commit c9919e9f7ae70eb5fbb79abddd5e407e50f26ce1 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
Signed-off-by: Odin Ugedal <ougedal@palantir.com>
Commit c9919e9f7ae70eb5fbb79abddd5e407e50f26ce1 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, this is a great idea!
A few comments on initial approach.
pkg/policy/repository.go
Outdated
@@ -139,6 +141,18 @@ type Repository struct { | |||
getEnvoyHTTPRules func(certificatemanager.SecretManager, *api.L7Rules, string) (*cilium.HttpNetworkPolicyRules, bool) | |||
} | |||
|
|||
// getShardKey returns the key to be used for the sharing of Repository.rulesByShard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// getShardKey returns the key to be used for the sharing of Repository.rulesByShard. | |
// getShardKey returns the key to be used for the sharding of Repository.rulesByShard. |
Suble typo or did you mean sharing
? If the latter, then somehow the sentence could be clarified as it will potentially confuse other readers if said typo was made. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, yeah, sharding
it should be!
pkg/policy/repository.go
Outdated
func getShardKey(lbls labels.LabelArray) (string, bool) { | ||
for _, lbl := range lbls { | ||
if lbl.Key == k8sConst.PolicyLabelNamespace && lbl.Source == labels.LabelSourceK8s { | ||
return lbl.Value, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, this returns the label value, but the function is called getShardKey
. So the result returned from
k8s:foo=bar
k8s:baz=net
these labels would be bar
and net
, depending on the order of the iteration.
Is the the expected behavior? Wouldn't this cause a problem for [1]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
I think it should be fine given it doesn't really make sense having more than one label with k8s:io.cilium.k8s.policy.namespace=<xyz>
. It is set based on the namespace here;
cilium/pkg/k8s/apis/cilium.io/utils/utils.go
Lines 345 to 346 in 6ba173d
policyLbls := GetPolicyLabels(namespace, name, uid, resourceType) | |
return append(policyLbls, ruleLbs...).Sort() |
NetworkPolicy
). For CCNP
its not present, and we can fallback to ""
.
But maybe the users can manually set the spec.Labels
field manually with same source and key, resulting in two of them. I'll look at that case, and I guess we have to take a look at it and handle it properly if thats the case. 😄 The same applies to non-namespaced CCNPs then, if users can manually add a label to those "tricking" us into thinking its a namespaced policy while its not. This should not as I understand it affect what the policy applies to, just how its stored in this policy "repo".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah it's a strange edge case. Maybe we can just document that if there are ever >1 namespace label, then the one that appears first will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Cilium prepends the namespace label and we use the first one, then there may be no need to document it at all?
On one hand the threat actor who could set these labels is a fairly privileged Kubernetes API server Attacker and if they could customize the labels of the policy then they already have significant privileges to allow actual traffic. But on the other hand if they only have access to policies in one namespace, they shouldn't be able to "poison" policy shards for other namespaces, even if the result is only some form of additional policy cost (a cost which is currently borne by all users in all namespaces in current versions of Cilium)
pkg/policy/repository.go
Outdated
key, ok := getShardKey(lbls) | ||
relevantShards := p.rulesByShard | ||
if ok { | ||
relevantShards = map[string]ruleSlice{ | ||
key: p.rulesByShard[key], | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat proposal! Couple of brief comments from initial look.
Also, the danger here would be if there could be a bug for some corner case of rule format that somehow ends up not getting selected by the rule engine and hence the policy doesn't apply. I don't have a good grasp on the corner cases but it'd be nice to see some testing to evaluate those cases. One basic one would be to continue to ensure that CCNPs that select all namespaces will apply to all Pods.
@@ -109,7 +109,9 @@ func (p *policyContext) SetDeny(deny bool) bool { | |||
type Repository struct { | |||
// Mutex protects the whole policy tree | |||
Mutex lock.RWMutex | |||
rules ruleSlice | |||
// Internally we shard the rule list to improve performance and maximize throughput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is specifically targeted for policy update performance, is that correct? Also, what's the difference between improving update performance vs. maximizing throughput? Aren't they the same thing?
// Internally we shard the rule list to improve performance and maximize throughput | |
// Internally we shard the rule list to improve update performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, based on the changes in getMatchingRules()
it looks like this should also improve policy evaluation performance by excluding rule iteration for namespaces that are known to be different from the "current" namespace under iteration.
So maybe 'update and evaluation performance'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this stated as an experiment where I wanted to improve the policy addition/update/deletion flow, but yeah, it will also reduce the policy evaluation as well! I'll try updating the description when we all agree on the approach and implementation. 😄
Thanks for looking! And thanks for the initial feedback (both @joestringer and @christarazi)!
Yeah, this is my main concern now; so I really want to poke at all those. My main concern is related to #27163 (comment) where users can add extra "namespace" labels to the |
TODO: will squash when time comes. Signed-off-by: Odin Ugedal <ougedal@palantir.com>
SGTM. Also when we've narrowed down on the user value it'll be worth updating the release note to say something about improving policy processing performance. |
key := "" | ||
matched := false | ||
for _, lbl := range lbls { | ||
if lbl.Key == k8sConst.PolicyLabelNamespace && lbl.Source == labels.LabelSourceK8s { | ||
// If we find two, ignore then and put them into the non-namespaced for easier housekeeping | ||
if matched { | ||
return "", false | ||
} | ||
key = lbl.Value | ||
matched = true | ||
} | ||
} | ||
return key, matched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in getNamespaceFromLabels
and getShardKey
seems duplicative, might be worth implementing one in terms of the other.
key := "" | |
matched := false | |
for _, lbl := range lbls { | |
if lbl.Key == k8sConst.PolicyLabelNamespace && lbl.Source == labels.LabelSourceK8s { | |
// If we find two, ignore then and put them into the non-namespaced for easier housekeeping | |
if matched { | |
return "", false | |
} | |
key = lbl.Value | |
matched = true | |
} | |
} | |
return key, matched | |
if key := getNamespaceFromLabels(lbls); key != "" { | |
return key, true | |
} | |
return "", false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly those two functions use different labels; since there is one label that is present in endpoint selectors (io.kubernetes.pod.namespace
), while there is another one in the rule.Labels
field (io.cilium.k8s.policy.namespace
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we could reuse the k8s label search logic by adjusting the function definition for reuse. The refactor I had in mind is a bit more than fits in github suggestion, so I created odinuge#94
This pull request has been automatically marked as stale because it |
I would still like to see this PR move forward |
This pull request has been automatically marked as stale because it |
Hey! Yes, will rebase and fix this as soon as we get #30338 sorted. :) |
This pull request has been automatically marked as stale because it |
This pull request has been automatically marked as stale because it |
This pull request has been automatically marked as stale because it |
@odinuge I'm picking this up to try and get it merged for v1.16. |
The easy alternative is reverting 4cb0940, but not too sure how happy people would be with that. Imo. we should not introduce more "indexes" here until we are 100% the current state handles all the edge cases, with tests ensuring that is the case. |
This seems like a good chance to convert the PolicyRepository to StateDB over the coming months. |
When converting more critical components over to StateDB, we'll need to exercise a lot of care and constraint. I'm open to the idea, but we should be rolling StateDB out in incremental manners through the tree that start non-critical components and move towards critical components, so we can ensure correctness and properly understand the performance characteristics. |
Also, @odinuge, it looks like what this PR does is improve the time to insert / replace rules in the repository, and doesn't particularly improve endpoint validation. Which is fine, but the description is misleading ;-). From your tests, which is more critical? Rule addition / deletion time, or endpoint policy calculation? |
I filed #32703, which was inspired by this but takes a different tack. |
This pull request has been automatically marked as stale because it |
This central ruleSlice has for us been one of the main bottle necks for throughput, and one of the main contributors to cpu usage. With this patch, we will now shard it by namespace transparently, so that we don't need to evaluate rules unnecessary. The main benefit is that when calculating rules for pod X in namespace A, we don't need to evaluate rules only applying to pods in namespace B. The same applies to the rule add and remove; when a rule is removed in namespace B, we don't need to evaluate rules in namespace A.
This is all transparent to users, and an implementation detail, so the sharding mechanism can easily be changed at a later stage.
TBD: Still some stuff missing here, so mostly a proof-of-concept.
Based on the BenchmarkParseLabel with 10 namespaces/shards, this is the benchstat diff;
The more rules, the higher bigger the gain will be! We also see that the locking
(*Repository) getMatchingRules
is also a huge chuck of cpu time, as well as a lot of lock contention.