Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: fix a number of bugs in schema change commit #28381

Merged
merged 3 commits into from
Aug 11, 2018

Conversation

vivekmenezes
Copy link
Contributor

@vivekmenezes vivekmenezes commented Aug 8, 2018

A schema change failure could result in a TableCollection
associated with a transaction not being reset properly. A
future transaction on the same session could see
uncommitted descriptors from a previous transaction.
Added TestSchemaUniqueColumnDropFailure
that fails before this change.

A schema change transaction abort/reset would always call
waitForCacheState() to check that dropped databases
were dropped from the cache. But a DROP DATABASE schema
change that fails will keep the database in the cache
resulting in the system locking up.

Consolidated call to resetLeases() to before a schema change
commit. The call is needed always after a schema change commit
and occasionally before a commit, so it is felt it's
better to always call it before commit. Also clarified
through comments why releaseLeases() will not deadlock.

Fixes #27847
Fixes #26961

Release notes (sql change): Fix node freeze after DROP DATABASE
when command aborts.
Release notes (sql change): Fix rare use of an older descriptor after
DROP INDEX.

@vivekmenezes vivekmenezes requested review from andreimatei and a team August 8, 2018 15:10
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@vivekmenezes vivekmenezes changed the title Dropfailure sql: fix a number of bugs in schema change commit Aug 8, 2018
@vivekmenezes vivekmenezes force-pushed the dropfailure branch 2 times, most recently from bf283c7 to 9447626 Compare August 8, 2018 15:57
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what's going on with the first commit. I see @knz added that code recently. I certainly hope it can go away, but I don't know what it was about.

LGTM for the second, but left various comments

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/conn_executor.go, line 1853 at r2 (raw file):

	case txnStart:
	case txnCommit:
		// If we have schema changers to run, release leases early so that schema

how come this can go away? Was it never needed?


pkg/sql/conn_executor_exec.go, line 487 at r2 (raw file):

This is also required to be done

I'm not succeeding in parsing this comment. One part says "required to be done after commit", another part "before the transaction" commits.
The "also" seems to not be tied to anything.

Perhaps the comment could start with what we're doing and why, and then go into why it's safe. And btw, we're inferring that CommitTimestamp() must have been called since tables != nil, right? That seems subtle to me, I think it'd be good to assert it and also spell it out in the comment.


pkg/sql/conn_executor_exec.go, line 492 at r2 (raw file):

	// Since releasing leases is needed before the transaction commits
	// it is done once here.
	ex.extraTxnState.tables.releaseLeases(ex.Ctx())

was there a particular reason why this moved up? Does it make taking the if count == 0 early return below more likely? (in fact, was that early return ever taken before?)


pkg/sql/lease.go, line 945 at r2 (raw file):

}

// releaseLease associated with the table version. Guarantee

"guarantee" as opposed to what? I think this comment should be more precise. The release operation is still async.
Also, what was the problem with the old draining check? Was it racy?
I haven't looked into how draining the LeaseManager works, but I think that if RunAsyncTask returns an error it means that the stopper is already quiescing and I think that means you can can no longer do any KV ops - they'll just err out.

I think You should test this new code; or, rather, I think the draining check was good and that we should propagate an error from RunAsyncTask() (which error should hopefully unwind the stack enough and prevent whatever deadlock you're worried about here).

nit: I think "Guarantee the release..." is not a sentence and is a bit ambiguous / hard to read (it's using an infinitive mood verb). Who does the guaranteeing - the function or the caller?


pkg/sql/schema_changer_test.go, line 2096 at r2 (raw file):

// process might not be able to reconstruct the index and thus
// recreates the column as non-UNIQUE. For now this is considered
// aceeptable.

aceeptable


pkg/sql/table.go, line 371 at r2 (raw file):

// releaseOpt specifies options for tc.releaseTables().
type releaseOpt bool

this is now unused


pkg/sql/table.go, line 408 at r2 (raw file):

				dbID, err := dc.getDatabaseID(ctx,
					tc.leaseMgr.execCfg.DB.Txn, uc.name, false /*required*/)
				if err != nil || dbID == 0 {

wait, how come we're swallowing this error here?
I know we've discussed replacing this getDatabaseID with something else, which hopefully wouldn't return errors, but until then?

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't disagree with the 1st commit either, but I would appreciate if you would extend the commit message with a motivation for the change. I distinctly recall that this was needed to ensure that the table collection was updated with the most recent version of the descriptor upon acquiring a lease. How come this is not needed any more?

After reviewing the 2nd commit, I see that it compensates for the 1st commit. But have you tried running CI on the PR with just the 1st commit? Perhaps the two changes cannot really be separated.

Reviewed 2 of 2 files at r1, 8 of 8 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@vivekmenezes vivekmenezes force-pushed the dropfailure branch 2 times, most recently from de4882a to bdf6015 Compare August 9, 2018 15:04
Copy link
Contributor Author

@vivekmenezes vivekmenezes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knz it's a mystery to me why that was added in the first place. It's not required.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/conn_executor.go, line 1853 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

how come this can go away? Was it never needed?

I've updated the comments to explain this better


pkg/sql/conn_executor_exec.go, line 487 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…
This is also required to be done

I'm not succeeding in parsing this comment. One part says "required to be done after commit", another part "before the transaction" commits.
The "also" seems to not be tied to anything.

Perhaps the comment could start with what we're doing and why, and then go into why it's safe. And btw, we're inferring that CommitTimestamp() must have been called since tables != nil, right? That seems subtle to me, I think it'd be good to assert it and also spell it out in the comment.

I've updated the comments here. Do you think I should also export an API from txn to assert that CommitTimestamp() was called?


pkg/sql/conn_executor_exec.go, line 492 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

was there a particular reason why this moved up? Does it make taking the if count == 0 early return below more likely? (in fact, was that early return ever taken before?)

It's been moved to "kill two birds with one stone" See the comment.


pkg/sql/lease.go, line 945 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

"guarantee" as opposed to what? I think this comment should be more precise. The release operation is still async.
Also, what was the problem with the old draining check? Was it racy?
I haven't looked into how draining the LeaseManager works, but I think that if RunAsyncTask returns an error it means that the stopper is already quiescing and I think that means you can can no longer do any KV ops - they'll just err out.

I think You should test this new code; or, rather, I think the draining check was good and that we should propagate an error from RunAsyncTask() (which error should hopefully unwind the stack enough and prevent whatever deadlock you're worried about here).

nit: I think "Guarantee the release..." is not a sentence and is a bit ambiguous / hard to read (it's using an infinitive mood verb). Who does the guaranteeing - the function or the caller?

I've reverted this change and added a comment to the call to releaseLeases talking about how some leases might not get released and what that means. I'm not happy with this nor happy with the alternative that a user might see a "lease not released" error


pkg/sql/schema_changer_test.go, line 2096 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

aceeptable

Done.


pkg/sql/table.go, line 371 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this is now unused

Done.


pkg/sql/table.go, line 408 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

wait, how come we're swallowing this error here?
I know we've discussed replacing this getDatabaseID with something else, which hopefully wouldn't return errors, but until then?

This is swallowing the error here because it's better to do so here and keep the API cleaner rather than report it up and swallow it at a higher layer.

We had discussed replacing getDatabaseID because it was thought at the time that this call should not call out to the store and cause a deadlock. We at the time didn't understand why the deadlock was happening. Now we do and fixing it (this change) doesn't require changing this any more. So I'm leaving it as is.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/conn_executor_exec.go, line 487 at r2 (raw file):

Previously, vivekmenezes wrote…

I've updated the comments here. Do you think I should also export an API from txn to assert that CommitTimestamp() was called?

i think that'd be nice. Something like TimestampLocked() perhaps.


pkg/sql/table.go, line 408 at r2 (raw file):

Previously, vivekmenezes wrote…

This is swallowing the error here because it's better to do so here and keep the API cleaner rather than report it up and swallow it at a higher layer.

We had discussed replacing getDatabaseID because it was thought at the time that this call should not call out to the store and cause a deadlock. We at the time didn't understand why the deadlock was happening. Now we do and fixing it (this change) doesn't require changing this any more. So I'm leaving it as is.

well at least comment the method saying that it swallows errors. I for one would much prefer swallowing higher up.

Copy link
Contributor Author

@vivekmenezes vivekmenezes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/conn_executor_exec.go, line 487 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

i think that'd be nice. Something like TimestampLocked() perhaps.

Added another commit that adds method OrigTimestampWasObserved() to match the value in the txn protobuf


pkg/sql/table.go, line 408 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

well at least comment the method saying that it swallows errors. I for one would much prefer swallowing higher up.

Added a comment. In this case since this method is called only from one place it's better to swallow it at the bottom.

@vivekmenezes
Copy link
Contributor Author

@andreimatei is the last new commit okay?

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think my preference would be for the "was the ts observed" question to be answered by the client.Txn, not by the TxnSender. Even though the TxnCoordSender naturally knows the answer, I don't think we should burden other implementors of the interface.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/internal/client/txn.go, line 278 at r5 (raw file):

}

// OrigTimestampWasObserved returns true if CommitTimestamp() was

as much as possible I think we should strive for the concept of the OrigTimestamp to not be visible to the client. I'd call CommitTimestampObserved() or, better yet, CommitTimestampFixed() (I'd say "locked" is even better, but I guess that clashes with our conventions for "Locked" methods").
Also, this will also return true if SetFixedTimestamp was called. And I think it makes sense for callers of this method to rely on that. That's another reason why I think "fixed/locked" is a better name (and with documentation that talks about what it means - that the txn cannot change its timestamp later) rather than talking about what other methods were called)

A schema change failure could result in a TableCollection
associated with a transaction not being reset properly. A
future transaction on the same session could see
uncommitted descriptors from a previous transaction.
Added TestSchemaUniqueColumnDropFailure
that fails before this change.

A schema change transaction abort/reset would always call
waitForCacheState() to check that dropped databases
were dropped from the cache. But a DROP DATABASE schema
change that fails will keep the database in the cache
resulting in the system locking up.

Consolidated call to resetLeases() to before a schema change
commit. The call is needed always after a schema change commit
and occasionally before a commit, so it is felt it's
better to always call it before commit. Also clarified
through comments why releaseLeases() will not deadlock.

Fixes cockroachdb#27847
Fixes cockroachdb#26961

Release note (sql change): Fix node freeze after DROP DATABASE
when command aborts.
Release note (sql change): Fix rare use of an older descriptor after
DROP INDEX.
Copy link
Contributor Author

@vivekmenezes vivekmenezes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/internal/client/txn.go, line 278 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

as much as possible I think we should strive for the concept of the OrigTimestamp to not be visible to the client. I'd call CommitTimestampObserved() or, better yet, CommitTimestampFixed() (I'd say "locked" is even better, but I guess that clashes with our conventions for "Locked" methods").
Also, this will also return true if SetFixedTimestamp was called. And I think it makes sense for callers of this method to rely on that. That's another reason why I think "fixed/locked" is a better name (and with documentation that talks about what it means - that the txn cannot change its timestamp later) rather than talking about what other methods were called)

I've called it CommitTimestampFixed and added a comment explaining it

@vivekmenezes
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 11, 2018

👎 Rejected by code reviews

@knz
Copy link
Contributor

knz commented Aug 11, 2018

bors r=andreimatei

craig bot pushed a commit that referenced this pull request Aug 11, 2018
28381: sql: fix a number of bugs in schema change commit  r=andreimatei a=vivekmenezes

A schema change failure could result in a TableCollection
associated with a transaction not being reset properly. A
future transaction on the same session could see
uncommitted descriptors from a previous transaction.
Added TestSchemaUniqueColumnDropFailure
that fails before this change.

A schema change transaction abort/reset would always call
waitForCacheState() to check that dropped databases
were dropped from the cache. But a DROP DATABASE schema
change that fails will keep the database in the cache
resulting in the system locking up.

Consolidated call to resetLeases() to before a schema change
commit. The call is needed always after a schema change commit
and occasionally before a commit, so it is felt it's
better to always call it before commit. Also clarified
through comments why releaseLeases() will not deadlock.

Fixes #27847
Fixes #26961

Release notes (sql change): Fix node freeze after DROP DATABASE
when command aborts.
Release notes (sql change): Fix rare use of an older descriptor after
DROP INDEX.

Co-authored-by: Vivek Menezes <vivek@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Aug 11, 2018

Build succeeded

@craig craig bot merged commit 3f0ae48 into cockroachdb:master Aug 11, 2018
@vivekmenezes vivekmenezes deleted the dropfailure branch August 11, 2018 16:31
vivekmenezes added a commit to vivekmenezes/cockroach that referenced this pull request Aug 30, 2018
change cockroachdb#25313 introduced getDatabaseID as the callback
used in waitForCacheState. However, waitForCacheState holds
on to a lock while calling its callback and can get stuck
on an intent written by another transaction. The other
transaction having written an intent can be trying to acquire
the databaseCacheHolder lock, resulting in the system getting
stuck in a deadlock.

Originally, the deadlock was thought to be cause by another
bug which got fixed through cockroachdb#28381, but now we know of
an actual legitimate situation in which the deadlock
can happen.

The fix is to use another callback that will fix cockroachdb#25313
while not going to the store.

fixes cockroachdb#29090

Release note: None
vivekmenezes added a commit to vivekmenezes/cockroach that referenced this pull request Sep 12, 2018
change cockroachdb#25313 introduced getDatabaseID as the callback
used in waitForCacheState. However, waitForCacheState holds
on to a lock while calling its callback and can get stuck
on an intent written by another transaction. The other
transaction having written an intent can be trying to acquire
the databaseCacheHolder lock, resulting in the system getting
stuck in a deadlock.

Originally, the deadlock was thought to be cause by another
bug which got fixed through cockroachdb#28381, but now we know of
an actual legitimate situation in which the deadlock
can happen.

The fix is to use another callback that will fix cockroachdb#25313
while not going to the store.

fixes cockroachdb#29090

Release note: None
craig bot pushed a commit that referenced this pull request Sep 12, 2018
29291: sql: resolve deadlock in waitForCacheState r=vivekmenezes a=vivekmenezes

change #25313 introduced getDatabaseID as the callback
used in waitForCacheState. However, waitForCacheState holds
on to a lock while calling its callback and can get stuck
on an intent written by another transaction. The other
transaction having written an intent can be trying to acquire
the databaseCacheHolder lock, resulting in the system getting
stuck in a deadlock.

Originally, the deadlock was thought to be cause by another
bug which got fixed through #28381, but now we know of
an actual legitimate situation in which the deadlock
can happen.

The fix is to use another callback that will fix #25313
while not going to the store.

fixes #29090

Release note: None

Co-authored-by: Vivek Menezes <vivek@cockroachlabs.com>
vivekmenezes added a commit to vivekmenezes/cockroach that referenced this pull request Sep 12, 2018
change cockroachdb#25313 introduced getDatabaseID as the callback
used in waitForCacheState. However, waitForCacheState holds
on to a lock while calling its callback and can get stuck
on an intent written by another transaction. The other
transaction having written an intent can be trying to acquire
the databaseCacheHolder lock, resulting in the system getting
stuck in a deadlock.

Originally, the deadlock was thought to be cause by another
bug which got fixed through cockroachdb#28381, but now we know of
an actual legitimate situation in which the deadlock
can happen.

The fix is to use another callback that will fix cockroachdb#25313
while not going to the store.

fixes cockroachdb#29090

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants