Skip to content

Commit

Permalink
endpoint: lock the selector cache for reading
Browse files Browse the repository at this point in the history
As in 52ace8e, there have been reports of a
runtime crash with with following error:

    fatal error: concurrent map read and map write

(This commit message is heavily inspired by the above commit in the
hopes of achieving a similar level of clarity.)

The path for this issue is printed also in the logs, with the following
call stack:

github.com/cilium/cilium/pkg/policy.(*SelectorCache).GetNetsLocked(...)
github.com/cilium/cilium/pkg/policy.getNets(...)
github.com/cilium/cilium/pkg/policy.identityIsSupersetOf(...)
github.com/cilium/cilium/pkg/policy.(*mapState).DenyPreferredInsertWithChanges(...)

Given the abovementioned commit has fixed the issue on main, there must be a
different call stack present in 1.12. Indeed, as shown below, there's the flow
involving `addNewRedirects` which isn't covered by locking in `DistillPolicy`.

INCOMING CALLS
GetNetsLocked github.com/cilium/cilium/pkg/policy • selectorcache.go
  getNets github.com/cilium/cilium/pkg/policy • mapstate.go
    identityIsSupersetOf github.com/cilium/cilium/pkg/policy • mapstate.go
      DenyPreferredInsertWithChanges github.com/cilium/cilium/pkg/policy • mapstate.go
        addNewRedirectsFromDesiredPolicy github.com/cilium/cilium/pkg/policy • bpf.go
          addNewRedirects github.com/cilium/cilium/pkg/endpoint • bpf.go <--- No SelectorCache lock

        DenyPreferredInsert github.com/cilium/cilium/pkg/policy • mapstate.go
          ToMapState github.com/cilium/cilium/pkg/policy • l4.go
            addNewRedirectsFromDesiredPolicy github.com/cilium/cilium/pkg/policy • bpf.go
              addNewRedirects github.com/cilium/cilium/pkg/endpoint • bpf.go <--- No SelectorCache lock

            computeDirectionL4PolicyMapEntries github.com/cilium/cilium/pkg/policy • resolve.go
              computeDesiredL4PolicyMapEntries github.com/cilium/cilium/pkg/policy • resolve.go
                DistillPolicy github.com/cilium/cilium/pkg/policy • resolve.go <--- SelectorCache.RLock()

          DetermineAllowLocalhostIngress github.com/cilium/cilium/pkg/policy • mapstate.go
            DistillPolicy github.com/cilium/cilium/pkg/policy • resolve.go <--- SelectorCache.RLock()

          computeDirectionL4PolicyMapEntries github.com/cilium/cilium/pkg/policy • resolve.go
            computeDesiredL4PolicyMapEntries github.com/cilium/cilium/pkg/policy • resolve.go
              DistillPolicy github.com/cilium/cilium/pkg/policy • resolve.go <--- SelectorCache.RLock()

        consumeMapChanges github.com/cilium/cilium/pkg/policy • mapstate.go
          ConsumeMapChanges github.com/cilium/cilium/pkg/policy • resolve.go <--- SelectorCache.Lock()

Read the above tree as "GetNetsLocked() is called by getNets()",
"getNets() is called by entryIdentityIsSupersetOf()", and so on.
Siblings at the same level of indent represent alternate callers of the
function that is one level of indentation less in the tree, ie
addNewRedirectsFromDesiredPolicy and DenyPreferredInsert() both call
DenyPreferredInsertWithChanges().

As annotated above, we see that calls through addNewRedirects() do not
grab the SelectorCache lock. Given that consumeMapChanges() grabs the
SelectorCache lock, we cannot introduce a new lock acquisition in any
descendent function, otherwise it would introduce a deadlock in
goroutines that follow that call path. This provides us the option to
lock at some point from the sibling of consumeMapChanges() or higher in
the call stack.

As addNewRedirectsFromDesiredPolicy calls
DenyPreferredInsertWithChanges both directly and indirectly (via
ToMapState), it represents the lower bound of the call stack range where
we can lock. The further up we traverse in the call stack, the greater
the risk of a deadlock becomes, as more and more code now runs under the
SelectorCache lock. Hence, we opt for locking at the lower bound, in
addNewRedirectsFromDesiredPolicy.

Unfortunately, this function lives in pkg endpoint, not policy, and we
thus need to allow pkg-external access to the SelectorCache mutex, which
is not great. The only silver lining is that this hack has an expiration
time, as it is not present on main.

Fixes: d68efba (endpoint: Remove GetLabelsLocked)

Co-authored-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
  • Loading branch information
bimmlerd and tklauser committed Nov 16, 2023
1 parent df576f3 commit 4764e4f
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 0 deletions.
6 changes: 6 additions & 0 deletions pkg/endpoint/bpf.go
Original file line number Diff line number Diff line change
Expand Up @@ -296,13 +296,19 @@ func (e *Endpoint) addNewRedirectsFromDesiredPolicy(ingress bool, desiredRedirec
direction = trafficdirection.Egress
}

// We need to hold the SelectorCache (read) lock here, as DenyPreferredInsertWithChanges
// will call the getNetsLocked function on the selector cache deeper down in the call
// stack. Similarly, ToMapState calls DenyPreferredInsertWithChanges, hence we take the
// lock at this point already.
selectorCache.RLock()
keysFromFilter := l4.ToMapState(e, direction, selectorCache)
for keyFromFilter, entry := range keysFromFilter {
if entry.IsRedirectEntry() {
entry.ProxyPort = redirectPort
}
e.desiredPolicy.PolicyMapState.DenyPreferredInsertWithChanges(keyFromFilter, entry, adds, nil, old, selectorCache)
}
selectorCache.RUnlock()
}
}

Expand Down
8 changes: 8 additions & 0 deletions pkg/policy/selectorcache.go
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,14 @@ type SelectorCache struct {
userNotes []userNotification
}

func (sc *SelectorCache) RLock() {
sc.mutex.RLock()
}

func (sc *SelectorCache) RUnlock() {
sc.mutex.RUnlock()
}

// GetModel returns the API model of the SelectorCache.
func (sc *SelectorCache) GetModel() models.SelectorCache {
sc.mutex.RLock()
Expand Down

0 comments on commit 4764e4f

Please sign in to comment.