New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
policy: Fix concurrent access of SelectorCache #24322
Conversation
6c94a50
to
18ebb46
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix makes sense to me.
-
Should we go a step further in this PR as an additional commit to rename
GetLabels
here toGetLabelsLocked
to document the assumption that the SelectorCache is intended to be locked? -
What do you think of clarifying the locking expectations for the functions in common between the write-locked SelectorCache (
ConsumeMapChanges
) and the read-locked SelectorCache (DistillPolicy
)?
Also, as an FYI the call graph in the commit msg omits ToMapState
as a potential caller; it's a sibling of DenyPreferredInsert
, btw. Not sure if its omission was intended or not, however I don't think it fundamentally impacts the conclusions drawn from the call graph. It was more that my expectation of the call graph was that it was exhaustive, so when I double-checked locally, I noticed this gap.
Yeah, sounds like (1) and (2) above would be good follow-ups. As it turns out, if you start the callgraph from |
Marco Iorio reports that with previous code, Cilium could crash at runtime after importing a network policy, with the following error printed to the logs: fatal error: concurrent map read and map write The path for this issue is printed also in the logs, with the following call stack: pkg/policy.(*SelectorCache).GetLabels(...) pkg/policy.(*MapStateEntry).getNets(...) pkg/policy.entryIdentityIsSupersetOf(...) pkg/policy.MapState.denyPreferredInsertWithChanges(...) pkg/policy.MapState.DenyPreferredInsert(...) pkg/policy.(*EndpointPolicy).computeDirectionL4PolicyMapEntries(...) pkg/policy.(*EndpointPolicy).computeDesiredL4PolicyMapEntries(...) pkg/policy.(*selectorPolicy).DistillPolicy(...) pkg/policy.(*cachedSelectorPolicy).Consume(...) pkg/endpoint.(*Endpoint).regeneratePolicy(...) ... Upon further inspection, this call path is not grabbing the SelectorCache lock at any point. If we check all of the incoming calls to this function, we can see multiple higher level functions calling into this function. The following tree starts from the deepest level of the call stack and increasing indentation represents one level higher in the call stack. INCOMING CALLS - f GetLabels github.com/cilium/cilium/pkg/policy • selectorcache.go - f getNets github.com/cilium/cilium/pkg/policy • mapstate.go - f entryIdentityIsSupersetOf github.com/cilium/cilium/pkg/policy • mapstate.go - f denyPreferredInsertWithChanges github.com/cilium/cilium/pkg/policy • mapstate.go - f DenyPreferredInsert github.com/cilium/cilium/pkg/policy • mapstate.go - f computeDirectionL4PolicyMapEntries github.com/cilium/cilium/pkg/policy • resolve.go - f computeDesiredL4PolicyMapEntries github.com/cilium/cilium/pkg/policy • resolve.go + f DistillPolicy github.com/cilium/cilium/pkg/policy • resolve.go <--- No SelectorCache lock - f DetermineAllowLocalhostIngress github.com/cilium/cilium/pkg/policy • mapstate.go + f DistillPolicy github.com/cilium/cilium/pkg/policy • resolve.go <--- No SelectorCache lock - f consumeMapChanges github.com/cilium/cilium/pkg/policy • mapstate.go + f ConsumeMapChanges github.com/cilium/cilium/pkg/policy • resolve.go <--- Already locks the SelectorCache Read the above tree as "GetLabels() is called by getNets()", "getNets() is called by entryIdentityIsSupersetOf()", and so on. Siblings at the same level of indent represent alternate callers of the function that is one level of indentation less in the tree, ie DenyPreferredInsert() and consumeMapChanges() both call denyPreferredInsertWithChanges(). As annotated above, we see that calls through DistillPolicy() do not grab the SelectorCache lock. Given that ConsumeMapChanges() grabs the SelectorCache lock, we cannot introduce a new lock acquisition in any descendent function, otherwise it would introduce a deadlock in goroutines that follow that call path. This provides us the option to lock at some point from the sibling of consumeMapChanges() or higher in the call stack. Given that the ancestors of DenyPreferredInsert() are all from DistillPolicy(), we can amortize the cost of grabbing the SelectorCache lock by grabbing it once for the policy distillation phase rather than putting the lock into DenyPreferredInsert() where the SelectorCache could be locked and unlocked for each map state entry. Future work could investigate whether these call paths could make use of the IdentityAllocator's cache of local identities for the GetLabels() call rather than relying on the SelectorCache, but for now this patch should address the immediate locking issue that triggers agent crashes. CC: Nate Sweet <nathanjsweet@pm.me> Fixes: c9f0def ("policy: Fix Deny Precedence Bug") Reported-by: Marco Iorio <marco.iorio@isovalent.com> Co-authored-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
18ebb46
to
5bb0706
Compare
/test |
We are running cilium We are getting same error as mentioned in #24021
|
In L4Filter.ToMapState a reference to the SelectorCache is passed down to denyPreferredInsertWithChanges, in order to get the most specific CIDR for a given identity with GetNetsLocked. GetNetsLocked requires the SelectorCache to be read-locked. denyPreferredInsertWithChanges is also called from EndpointPolicy.ConsumeMapChanges, where the SelectorCache must already be locked to avoid concurrent identity updates. Instead, in the L4Filter.ToMapState, it is possible to lock the SelectorCache just before its usage in GetNetsLocked. Narrowing the critical section reduces the contention on the lock, hopefully improving concurrency. A SelectorCacheWrapper wraps a reference to a SelectorCache and overloads the GetNetsLocked in order to lock the cache for the execution of the method. The new type satisfies the policy.Identities interface and can be used interchangeably with the wrapped SelectorCache. This allows to remove the locking of the SelectorCache for the entire duration of selectorPolicy.DistillPolicy and L4DirectionPolicy.updateRedirects. Related: cilium#24322 Related: cilium#22966 Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Marco Iorio reports that with previous code, Cilium could crash at
runtime after importing a network policy, with the following error
printed to the logs:
The path for this issue is printed also in the logs, with the following
call stack:
Upon further inspection, this call path is not grabbing the
SelectorCache lock at any point. If we check all of the incoming calls
to this function, we can see multiple higher level functions calling
into this function. The following tree starts from the deepest level of
the call stack and increasing indentation represents one level higher in
the call stack.
INCOMING CALLS
Read the above tree as "GetLabels() is called by getNets()",
"getNets() is called by entryIdentityIsSupersetOf()", and so on.
Siblings at the same level of indent represent alternate callers of the
function that is one level of indentation less in the tree, ie
DenyPreferredInsert() and consumeMapChanges() both call
denyPreferredInsertWithChanges().
As annotated above, we see that calls through DistillPolicy() do not
grab the SelectorCache lock. Given that ConsumeMapChanges() grabs the
SelectorCache lock, we cannot introduce a new lock acquisition in any
descendent function, otherwise it would introduce a deadlock in
goroutines that follow that call path. This provides us the option to
lock at some point from the sibling of consumeMapChanges() or higher in
the call stack.
Given that the ancestors of DenyPreferredInsert() are all from
DistillPolicy(), we can amortize the cost of grabbing the SelectorCache
lock by grabbing it once for the policy distillation phase rather than
putting the lock into DenyPreferredInsert() where the SelectorCache
could be locked and unlocked for each map state entry.
Future work could investigate whether these call paths could make use of
the IdentityAllocator's cache of local identities for the GetLabels()
call rather than relying on the SelectorCache, but for now this patch
should address the immediate locking issue that triggers agent crashes.
Fixes: #24021
Supersedes: #24032
Supersedes: #24283