-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: fix data race in MemberOfWithAdminOption #98469
Conversation
This patch ensures that if the leader of a singleflight flight has its context cancelled while waiting for the result of the flight closure, it will block until the closure returns to prevent any possible data races on any data shared with the closure. Release note: None
This patch changes the role members cache singleflight to inherit cancellation since the flight closure uses the caller's transaction. Release note: None
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
@@ -251,7 +251,13 @@ func (c *call) result(ctx context.Context, leader bool) Result { | |||
case <-c.c: | |||
case <-ctx.Done(): | |||
op := fmt.Sprintf("%s:%s", c.opName, c.key) | |||
if !leader { | |||
if leader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I this that this should only happen if InheritCancellation
is true. Another thing that might be good to do here is to log if we wait a long time here. Maybe:
var timer timeutil.Timer
defer timer.Stop()
const slowLogTimeout = time.Second
timer.Reset(slowLogTimeout)
start := timeutil.Now()
select {
case <-c.c:
return
case <-timer.C:
timer.Read = true
log.Infof(ctx, "have been waiting for singleflight %s:%s for %v after cancelation", c.opName, c.key, timeutil.Since(start))
}
<-c.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I this that this should only happen if InheritCancellation is true.
There would still be a potential for a data race even if InheritCancelation
is false, right? Unless we rely on the fact that all users of the singleflight
package set InheritCancelation: true
if they're sharing data between the caller and the closure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. I think I'd argue instead that the singleflights users which share state and want to exit early are abusing the API. Maybe we should just comment that and not build this into the singleflight package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @rafiss)
pkg/util/syncutil/singleflight/singleflight.go
line 254 at r2 (raw file):
I this that this should only happen if InheritCancellation is true.
There would still be a potential for a data race even if InheritCancelation is false, right?
It seems to me that the current race is possible both with and without InheritCancelation
, but the bad outcome of the race is much more likely without InheritCancelation
; with InheritCancelation
, both Do
and WaitForResult
are interrupted at the same time. Without InheritCancelation
only WaitForResult
is interrupted, thus allowing the closure to continue and past the caller unwinding.
I think I'd argue instead that the singleflights users which share state and want to exit early are abusing the API.
The problem, as I see it, is that with the current API it's kinda hard for a caller to opt out of exiting early in the WaitForResult()
call. I think maybe it'd be reasonable to make it easy to opt-out, but it needs to be an option and not forced like in the current patch. But,
This patch changes the role members cache singleflight to inherit
cancellation since the flight closure uses the caller's transaction.
I want to understand if this makes good sense, and in turn if we should worry about "sharing data with the closure" at all.
Can you explain to me whether it's a good thing for the leader's txn to be used in the closure by the leader, seeing how the flight is possibly joined by others, with other transactions, and those guy are presumably just fine with the closure having used an unrelated txn? Can we make the closure use a new txn?
Are there other uses of singleflight, perhaps in descriptor lease acquisitions, where there's a similar pattern?
In descriptor leasing we create a new transaction. I argue that we shouldn't be using a shared transaction in the singleflight. It was done here only because it was expedient, IIRC. I argued against it at the time. |
Doesn't it need to use the same transaction so that you can be sure that the singleflight lookup is using the same system.role_members table version as the transaction that is trying to check role membership? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the leader needs the same txn, then how come the followers don't?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @rafiss)
I don't think that's the only way. Consider:
|
Because part of the singleflight request key is the system.role_members table version: cockroach/pkg/sql/authorization.go Line 546 in bb3871b
So even though the followers use a different transaction, they know that they will read from the same table version as if they used their own transaction. |
ah ok, so we already do most of this, but you're saying we could also start a different transaction inside of the singleflight and manually set it's transaction timestamp. what's the interface for manually changing the timestamp? |
Lines 1393 to 1401 in 736a67e
|
I actually had a question about how this works. If the table version is only bumped when the schema changes, wouldn't the table version actually remain the same if new GRANT/REVOKE ROLEs happen? In which case, isn't it possible that the data in the |
We increment the table version here in the GRANT code: cockroach/pkg/sql/grant_role.go Line 209 in 36d39aa
And similar for REVOKE. So I wonder if something like this would work:
|
You don't want |
I'll throw more of my 2c to doubt that using something about the query to check permissions doesn't makes any sense; it seems like an obvious security hole (permissions are not supposed to be immutable). I remember pestering Marc about it many years ago, and I kinda assumed that we changed it. @ajwerner pointed to #51861 tracking it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @rafiss)
pkg/util/syncutil/singleflight/singleflight.go
line 254 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I this that this should only happen if InheritCancellation is true.
There would still be a potential for a data race even if InheritCancelation is false, right?It seems to me that the current race is possible both with and without
InheritCancelation
, but the bad outcome of the race is much more likely withoutInheritCancelation
; withInheritCancelation
, bothDo
andWaitForResult
are interrupted at the same time. WithoutInheritCancelation
onlyWaitForResult
is interrupted, thus allowing the closure to continue and past the caller unwinding.I think I'd argue instead that the singleflights users which share state and want to exit early are abusing the API.
The problem, as I see it, is that with the current API it's kinda hard for a caller to opt out of exiting early in the
WaitForResult()
call. I think maybe it'd be reasonable to make it easy to opt-out, but it needs to be an option and not forced like in the current patch. But,This patch changes the role members cache singleflight to inherit
cancellation since the flight closure uses the caller's transaction.I want to understand if this makes good sense, and in turn if we should worry about "sharing data with the closure" at all.
Can you explain to me whether it's a good thing for the leader's txn to be used in the closure by the leader, seeing how the flight is possibly joined by others, with other transactions, and those guy are presumably just fine with the closure having used an unrelated txn? Can we make the closure use a new txn?
Are there other uses of singleflight, perhaps in descriptor lease acquisitions, where there's a similar pattern?
Just in case this was not considered as an option - the caller of WaitForResult
can pass in a context without cancelation (e.g. context.TODO()
). This will be trivial in the next Go version; for now you gotta build a context by hand and you lose the tracing, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @rafiss)
pkg/util/syncutil/singleflight/singleflight.go
line 254 at r2 (raw file):
Just in case this was not considered as an option - the caller of WaitForResult can pass in a context without cancelation (e.g. context.TODO()). This will be trivial in the next Go version; for now you gotta build a context by hand and you lose the tracing, etc.
Yeah, we did consider that and #98376 has an implementation for that solution. Just curious, do you mind elaborating on why this will be trivial in the next Go version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @rafiss)
pkg/util/syncutil/singleflight/singleflight.go
line 254 at r2 (raw file):
do you mind elaborating on why this will be trivial in the next Go version?
I've created #98617 with an implementation of the idea discussed above to use a fresh transaction within the singleflight. There are more details on that PR of an issue I've observed with this approach. |
Closed in favor of #98617 |
singleflight: mitigate potential data races from context cancellation
This patch ensures that if the leader of a singleflight flight has its
context cancelled while waiting for the result of the flight closure,
it will block until the closure returns to prevent any possible data
races on any data shared with the closure.
Release note: None
sql: change role members cache singleflight to inherit cancellation
This patch changes the role members cache singleflight to inherit
cancellation since the flight closure uses the caller's transaction.
Release note: None
Fixes #95642
Fixes #96539