Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.13] Address selectorcache concurrent read/write #29186

Merged
merged 1 commit into from
Nov 16, 2023

Conversation

tklauser
Copy link
Member

See commit message for details.

v1.13 version of #29167

@tklauser tklauser added kind/bug This is a bug in the Cilium logic. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. kind/backports This PR provides functionality previously merged into master. sig/agent Cilium agent related. backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. labels Nov 15, 2023
@tklauser
Copy link
Member Author

/test-backport-1.13

As in 52ace8e, there have been reports of a
runtime crash with with following error:

    fatal error: concurrent map read and map write

(This commit message is heavily inspired by the above commit in the
hopes of achieving a similar level of clarity.)

The path for this issue is printed also in the logs, with the following
call stack:

github.com/cilium/cilium/pkg/policy.(*SelectorCache).GetNetsLocked(...)
github.com/cilium/cilium/pkg/policy.getNets(...)
github.com/cilium/cilium/pkg/policy.identityIsSupersetOf(...)
github.com/cilium/cilium/pkg/policy.(*mapState).DenyPreferredInsertWithChanges(...)

Given the abovementioned commit has fixed the issue on main, there must be a
different call stack present in 1.13. Indeed, as shown below, there's the flow
involving `addNewRedirects` which isn't covered by locking in `DistillPolicy`.

INCOMING CALLS
GetNetsLocked github.com/cilium/cilium/pkg/policy • selectorcache.go
  getNets github.com/cilium/cilium/pkg/policy • mapstate.go
    identityIsSupersetOf github.com/cilium/cilium/pkg/policy • mapstate.go
      DenyPreferredInsertWithChanges github.com/cilium/cilium/pkg/policy • mapstate.go
        addNewRedirectsFromDesiredPolicy github.com/cilium/cilium/pkg/policy • bpf.go
          addNewRedirects github.com/cilium/cilium/pkg/endpoint • bpf.go <--- No SelectorCache lock

        DenyPreferredInsert github.com/cilium/cilium/pkg/policy • mapstate.go
          ToMapState github.com/cilium/cilium/pkg/policy • l4.go
            addNewRedirectsFromDesiredPolicy github.com/cilium/cilium/pkg/policy • bpf.go
              addNewRedirects github.com/cilium/cilium/pkg/endpoint • bpf.go <--- No SelectorCache lock

            computeDirectionL4PolicyMapEntries github.com/cilium/cilium/pkg/policy • resolve.go
              computeDesiredL4PolicyMapEntries github.com/cilium/cilium/pkg/policy • resolve.go
                DistillPolicy github.com/cilium/cilium/pkg/policy • resolve.go <--- SelectorCache.RLock()

          DetermineAllowLocalhostIngress github.com/cilium/cilium/pkg/policy • mapstate.go
            DistillPolicy github.com/cilium/cilium/pkg/policy • resolve.go <--- SelectorCache.RLock()

          computeDirectionL4PolicyMapEntries github.com/cilium/cilium/pkg/policy • resolve.go
            computeDesiredL4PolicyMapEntries github.com/cilium/cilium/pkg/policy • resolve.go
              DistillPolicy github.com/cilium/cilium/pkg/policy • resolve.go <--- SelectorCache.RLock()

        consumeMapChanges github.com/cilium/cilium/pkg/policy • mapstate.go
          ConsumeMapChanges github.com/cilium/cilium/pkg/policy • resolve.go <--- SelectorCache.Lock()

Read the above tree as "GetNetsLocked() is called by getNets()",
"getNets() is called by entryIdentityIsSupersetOf()", and so on.
Siblings at the same level of indent represent alternate callers of the
function that is one level of indentation less in the tree, ie
addNewRedirectsFromDesiredPolicy and DenyPreferredInsert() both call
DenyPreferredInsertWithChanges().

As annotated above, we see that calls through addNewRedirects() do not
grab the SelectorCache lock. Given that consumeMapChanges() grabs the
SelectorCache lock, we cannot introduce a new lock acquisition in any
descendent function, otherwise it would introduce a deadlock in
goroutines that follow that call path. This provides us the option to
lock at some point from the sibling of consumeMapChanges() or higher in
the call stack.

As addNewRedirectsFromDesiredPolicy calls
DenyPreferredInsertWithChanges both directly and indirectly (via
ToMapState), it represents the lower bound of the call stack range where
we can lock. The further up we traverse in the call stack, the greater
the risk of a deadlock becomes, as more and more code now runs under the
SelectorCache lock. Hence, we opt for locking at the lower bound, in
addNewRedirectsFromDesiredPolicy.

Unfortunately, this function lives in pkg endpoint, not policy, and we
thus need to allow pkg-external access to the SelectorCache mutex, which
is not great. The only silver lining is that this hack has an expiration
time, as it is not present on main.

Fixes: 90f8859 ("endpoint: Remove GetLabelsLocked")

Co-authored-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@tklauser tklauser force-pushed the pr/tklauser/v1.13-fix-selectorcache-locking branch from c2d0cb4 to 2dba4dd Compare November 15, 2023 14:04
@tklauser
Copy link
Member Author

tklauser commented Nov 15, 2023

/test-backport-1.13

Reworded commit message and added a comment, following #29167 (comment)

@tklauser tklauser requested review from nathanjsweet and a team November 16, 2023 08:58
@tklauser tklauser marked this pull request as ready for review November 16, 2023 08:58
@tklauser tklauser requested a review from a team as a code owner November 16, 2023 08:58
@tklauser
Copy link
Member Author

Blocking merge until @cilium/sig-policy reviewed.

@nathanjsweet nathanjsweet merged commit bb96a59 into v1.13 Nov 16, 2023
129 checks passed
@nathanjsweet nathanjsweet deleted the pr/tklauser/v1.13-fix-selectorcache-locking branch November 16, 2023 15:09
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. sig/agent Cilium agent related. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
No open projects
Status: Released
Development

Successfully merging this pull request may close these issues.

None yet

4 participants