-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: fix a number of bugs in schema change commit #28381
Conversation
bf283c7
to
9447626
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what's going on with the first commit. I see @knz added that code recently. I certainly hope it can go away, but I don't know what it was about.
LGTM for the second, but left various comments
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/conn_executor.go, line 1853 at r2 (raw file):
case txnStart: case txnCommit: // If we have schema changers to run, release leases early so that schema
how come this can go away? Was it never needed?
pkg/sql/conn_executor_exec.go, line 487 at r2 (raw file):
This is also required to be done
I'm not succeeding in parsing this comment. One part says "required to be done after commit", another part "before the transaction" commits.
The "also" seems to not be tied to anything.
Perhaps the comment could start with what we're doing and why, and then go into why it's safe. And btw, we're inferring that CommitTimestamp()
must have been called since tables != nil
, right? That seems subtle to me, I think it'd be good to assert it and also spell it out in the comment.
pkg/sql/conn_executor_exec.go, line 492 at r2 (raw file):
// Since releasing leases is needed before the transaction commits // it is done once here. ex.extraTxnState.tables.releaseLeases(ex.Ctx())
was there a particular reason why this moved up? Does it make taking the if count == 0
early return below more likely? (in fact, was that early return ever taken before?)
pkg/sql/lease.go, line 945 at r2 (raw file):
} // releaseLease associated with the table version. Guarantee
"guarantee" as opposed to what? I think this comment should be more precise. The release operation is still async.
Also, what was the problem with the old draining check? Was it racy?
I haven't looked into how draining the LeaseManager
works, but I think that if RunAsyncTask
returns an error it means that the stopper is already quiescing and I think that means you can can no longer do any KV ops - they'll just err out.
I think You should test this new code; or, rather, I think the draining check was good and that we should propagate an error from RunAsyncTask()
(which error should hopefully unwind the stack enough and prevent whatever deadlock you're worried about here).
nit: I think "Guarantee the release..." is not a sentence and is a bit ambiguous / hard to read (it's using an infinitive mood verb). Who does the guaranteeing - the function or the caller?
pkg/sql/schema_changer_test.go, line 2096 at r2 (raw file):
// process might not be able to reconstruct the index and thus // recreates the column as non-UNIQUE. For now this is considered // aceeptable.
aceeptable
pkg/sql/table.go, line 371 at r2 (raw file):
// releaseOpt specifies options for tc.releaseTables(). type releaseOpt bool
this is now unused
pkg/sql/table.go, line 408 at r2 (raw file):
dbID, err := dc.getDatabaseID(ctx, tc.leaseMgr.execCfg.DB.Txn, uc.name, false /*required*/) if err != nil || dbID == 0 {
wait, how come we're swallowing this error here?
I know we've discussed replacing this getDatabaseID
with something else, which hopefully wouldn't return errors, but until then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't disagree with the 1st commit either, but I would appreciate if you would extend the commit message with a motivation for the change. I distinctly recall that this was needed to ensure that the table collection was updated with the most recent version of the descriptor upon acquiring a lease. How come this is not needed any more?
After reviewing the 2nd commit, I see that it compensates for the 1st commit. But have you tried running CI on the PR with just the 1st commit? Perhaps the two changes cannot really be separated.
Reviewed 2 of 2 files at r1, 8 of 8 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained
de4882a
to
bdf6015
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knz it's a mystery to me why that was added in the first place. It's not required.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/conn_executor.go, line 1853 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
how come this can go away? Was it never needed?
I've updated the comments to explain this better
pkg/sql/conn_executor_exec.go, line 487 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
This is also required to be done
I'm not succeeding in parsing this comment. One part says "required to be done after commit", another part "before the transaction" commits.
The "also" seems to not be tied to anything.Perhaps the comment could start with what we're doing and why, and then go into why it's safe. And btw, we're inferring that
CommitTimestamp()
must have been called sincetables != nil
, right? That seems subtle to me, I think it'd be good to assert it and also spell it out in the comment.
I've updated the comments here. Do you think I should also export an API from txn to assert that CommitTimestamp() was called?
pkg/sql/conn_executor_exec.go, line 492 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
was there a particular reason why this moved up? Does it make taking the
if count == 0
early return below more likely? (in fact, was that early return ever taken before?)
It's been moved to "kill two birds with one stone" See the comment.
pkg/sql/lease.go, line 945 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
"guarantee" as opposed to what? I think this comment should be more precise. The release operation is still async.
Also, what was the problem with the old draining check? Was it racy?
I haven't looked into how draining theLeaseManager
works, but I think that ifRunAsyncTask
returns an error it means that the stopper is already quiescing and I think that means you can can no longer do any KV ops - they'll just err out.I think You should test this new code; or, rather, I think the draining check was good and that we should propagate an error from
RunAsyncTask()
(which error should hopefully unwind the stack enough and prevent whatever deadlock you're worried about here).nit: I think "Guarantee the release..." is not a sentence and is a bit ambiguous / hard to read (it's using an infinitive mood verb). Who does the guaranteeing - the function or the caller?
I've reverted this change and added a comment to the call to releaseLeases talking about how some leases might not get released and what that means. I'm not happy with this nor happy with the alternative that a user might see a "lease not released" error
pkg/sql/schema_changer_test.go, line 2096 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
aceeptable
Done.
pkg/sql/table.go, line 371 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
this is now unused
Done.
pkg/sql/table.go, line 408 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
wait, how come we're swallowing this error here?
I know we've discussed replacing thisgetDatabaseID
with something else, which hopefully wouldn't return errors, but until then?
This is swallowing the error here because it's better to do so here and keep the API cleaner rather than report it up and swallow it at a higher layer.
We had discussed replacing getDatabaseID because it was thought at the time that this call should not call out to the store and cause a deadlock. We at the time didn't understand why the deadlock was happening. Now we do and fixing it (this change) doesn't require changing this any more. So I'm leaving it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/conn_executor_exec.go, line 487 at r2 (raw file):
Previously, vivekmenezes wrote…
I've updated the comments here. Do you think I should also export an API from txn to assert that CommitTimestamp() was called?
i think that'd be nice. Something like TimestampLocked()
perhaps.
pkg/sql/table.go, line 408 at r2 (raw file):
Previously, vivekmenezes wrote…
This is swallowing the error here because it's better to do so here and keep the API cleaner rather than report it up and swallow it at a higher layer.
We had discussed replacing getDatabaseID because it was thought at the time that this call should not call out to the store and cause a deadlock. We at the time didn't understand why the deadlock was happening. Now we do and fixing it (this change) doesn't require changing this any more. So I'm leaving it as is.
well at least comment the method saying that it swallows errors. I for one would much prefer swallowing higher up.
bdf6015
to
3195b4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/conn_executor_exec.go, line 487 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
i think that'd be nice. Something like
TimestampLocked()
perhaps.
Added another commit that adds method OrigTimestampWasObserved() to match the value in the txn protobuf
pkg/sql/table.go, line 408 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
well at least comment the method saying that it swallows errors. I for one would much prefer swallowing higher up.
Added a comment. In this case since this method is called only from one place it's better to swallow it at the bottom.
3195b4a
to
9585c0e
Compare
@andreimatei is the last new commit okay? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I think my preference would be for the "was the ts observed" question to be answered by the client.Txn
, not by the TxnSender
. Even though the TxnCoordSender
naturally knows the answer, I don't think we should burden other implementors of the interface.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/internal/client/txn.go, line 278 at r5 (raw file):
} // OrigTimestampWasObserved returns true if CommitTimestamp() was
as much as possible I think we should strive for the concept of the OrigTimestamp
to not be visible to the client. I'd call CommitTimestampObserved()
or, better yet, CommitTimestampFixed()
(I'd say "locked" is even better, but I guess that clashes with our conventions for "Locked" methods").
Also, this will also return true if SetFixedTimestamp
was called. And I think it makes sense for callers of this method to rely on that. That's another reason why I think "fixed/locked" is a better name (and with documentation that talks about what it means - that the txn cannot change its timestamp later) rather than talking about what other methods were called)
Release note: None
A schema change failure could result in a TableCollection associated with a transaction not being reset properly. A future transaction on the same session could see uncommitted descriptors from a previous transaction. Added TestSchemaUniqueColumnDropFailure that fails before this change. A schema change transaction abort/reset would always call waitForCacheState() to check that dropped databases were dropped from the cache. But a DROP DATABASE schema change that fails will keep the database in the cache resulting in the system locking up. Consolidated call to resetLeases() to before a schema change commit. The call is needed always after a schema change commit and occasionally before a commit, so it is felt it's better to always call it before commit. Also clarified through comments why releaseLeases() will not deadlock. Fixes cockroachdb#27847 Fixes cockroachdb#26961 Release note (sql change): Fix node freeze after DROP DATABASE when command aborts. Release note (sql change): Fix rare use of an older descriptor after DROP INDEX.
Release note: None
9585c0e
to
3f0ae48
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/internal/client/txn.go, line 278 at r5 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
as much as possible I think we should strive for the concept of the
OrigTimestamp
to not be visible to the client. I'd callCommitTimestampObserved()
or, better yet,CommitTimestampFixed()
(I'd say "locked" is even better, but I guess that clashes with our conventions for "Locked" methods").
Also, this will also return true ifSetFixedTimestamp
was called. And I think it makes sense for callers of this method to rely on that. That's another reason why I think "fixed/locked" is a better name (and with documentation that talks about what it means - that the txn cannot change its timestamp later) rather than talking about what other methods were called)
I've called it CommitTimestampFixed and added a comment explaining it
bors r+ |
👎 Rejected by code reviews |
bors r=andreimatei |
28381: sql: fix a number of bugs in schema change commit r=andreimatei a=vivekmenezes A schema change failure could result in a TableCollection associated with a transaction not being reset properly. A future transaction on the same session could see uncommitted descriptors from a previous transaction. Added TestSchemaUniqueColumnDropFailure that fails before this change. A schema change transaction abort/reset would always call waitForCacheState() to check that dropped databases were dropped from the cache. But a DROP DATABASE schema change that fails will keep the database in the cache resulting in the system locking up. Consolidated call to resetLeases() to before a schema change commit. The call is needed always after a schema change commit and occasionally before a commit, so it is felt it's better to always call it before commit. Also clarified through comments why releaseLeases() will not deadlock. Fixes #27847 Fixes #26961 Release notes (sql change): Fix node freeze after DROP DATABASE when command aborts. Release notes (sql change): Fix rare use of an older descriptor after DROP INDEX. Co-authored-by: Vivek Menezes <vivek@cockroachlabs.com>
Build succeeded |
change cockroachdb#25313 introduced getDatabaseID as the callback used in waitForCacheState. However, waitForCacheState holds on to a lock while calling its callback and can get stuck on an intent written by another transaction. The other transaction having written an intent can be trying to acquire the databaseCacheHolder lock, resulting in the system getting stuck in a deadlock. Originally, the deadlock was thought to be cause by another bug which got fixed through cockroachdb#28381, but now we know of an actual legitimate situation in which the deadlock can happen. The fix is to use another callback that will fix cockroachdb#25313 while not going to the store. fixes cockroachdb#29090 Release note: None
change cockroachdb#25313 introduced getDatabaseID as the callback used in waitForCacheState. However, waitForCacheState holds on to a lock while calling its callback and can get stuck on an intent written by another transaction. The other transaction having written an intent can be trying to acquire the databaseCacheHolder lock, resulting in the system getting stuck in a deadlock. Originally, the deadlock was thought to be cause by another bug which got fixed through cockroachdb#28381, but now we know of an actual legitimate situation in which the deadlock can happen. The fix is to use another callback that will fix cockroachdb#25313 while not going to the store. fixes cockroachdb#29090 Release note: None
29291: sql: resolve deadlock in waitForCacheState r=vivekmenezes a=vivekmenezes change #25313 introduced getDatabaseID as the callback used in waitForCacheState. However, waitForCacheState holds on to a lock while calling its callback and can get stuck on an intent written by another transaction. The other transaction having written an intent can be trying to acquire the databaseCacheHolder lock, resulting in the system getting stuck in a deadlock. Originally, the deadlock was thought to be cause by another bug which got fixed through #28381, but now we know of an actual legitimate situation in which the deadlock can happen. The fix is to use another callback that will fix #25313 while not going to the store. fixes #29090 Release note: None Co-authored-by: Vivek Menezes <vivek@cockroachlabs.com>
change cockroachdb#25313 introduced getDatabaseID as the callback used in waitForCacheState. However, waitForCacheState holds on to a lock while calling its callback and can get stuck on an intent written by another transaction. The other transaction having written an intent can be trying to acquire the databaseCacheHolder lock, resulting in the system getting stuck in a deadlock. Originally, the deadlock was thought to be cause by another bug which got fixed through cockroachdb#28381, but now we know of an actual legitimate situation in which the deadlock can happen. The fix is to use another callback that will fix cockroachdb#25313 while not going to the store. fixes cockroachdb#29090 Release note: None
A schema change failure could result in a TableCollection
associated with a transaction not being reset properly. A
future transaction on the same session could see
uncommitted descriptors from a previous transaction.
Added TestSchemaUniqueColumnDropFailure
that fails before this change.
A schema change transaction abort/reset would always call
waitForCacheState() to check that dropped databases
were dropped from the cache. But a DROP DATABASE schema
change that fails will keep the database in the cache
resulting in the system locking up.
Consolidated call to resetLeases() to before a schema change
commit. The call is needed always after a schema change commit
and occasionally before a commit, so it is felt it's
better to always call it before commit. Also clarified
through comments why releaseLeases() will not deadlock.
Fixes #27847
Fixes #26961
Release notes (sql change): Fix node freeze after DROP DATABASE
when command aborts.
Release notes (sql change): Fix rare use of an older descriptor after
DROP INDEX.