Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix concurrent access for SQL storage. #1370

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

kepkin
Copy link

@kepkin kepkin commented Dec 10, 2018

SERIALIZABLE isolation level is not viable solution, as it will slow down the whole Postgres. And again there must be retries as you will always receive errors like "could not serialize access...".

Previous pull request for retries like #1356 were wrong as it resulted in deadlock in Database (you open transaction inside another transaction in conformance tests).

In SQL world the better solution is to lock the row explicitly with SELECT FOR UPDATE.

Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise, this looks good. I'm curious to play with this a little, and I didn't get to it yet.

Thanks for digging in, and providing some domain-knowledge where it's needed! 😃

@@ -380,13 +386,23 @@ func (c *conn) UpdateKeys(updater func(old storage.Keys) (storage.Keys, error))
firstUpdate := false
// TODO(ericchiang): errors may cause a transaction be rolled back by the SQL
// server. Test this, and consider adding a COUNT() command beforehand.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Is that comment still accurate now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember. Ha

@@ -22,6 +22,10 @@ type flavor struct {
// Optional function to create and finish a transaction.
executeTx func(db *sql.DB, fn func(*sql.Tx) error) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field could go away now, can't it?

@@ -140,10 +140,16 @@ func (c *conn) UpdateAuthRequest(id string, updater func(a storage.AuthRequest)
return err
}

err = c.flavor.lockForUpdate(tx, "auth_request", "id", r.ID)
if err != nil {
return fmt.Errorf("update auth request: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 All these fmt.Errorf make me nervous -- they're hard to deal with in call sites -- but this is the style this package is written in, so it's fine at add a few. 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can switch to github.com/pkg/errors in another PR :)

}
return tx.Commit()
lockForUpdate: func(tx *trans, table, column, value string) error {
_, err := tx.Exec("SELECT 1 FROM "+table+" WHERE "+column+" = $1 FOR UPDATE NOWAIT;", value)
Copy link
Contributor

@vito vito Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For my own understanding,) NOWAIT is used here because the intent of this PR is still to fail one transaction in the event of concurrent updates? i.e. same goal as before, just a different (less heavy-handed) method. And so the existing concurrency tests still pass with no changes necessary.

So this doesn't attempt to address #1341 (and associated PR #1342) as-is; that could be done in a separate PR and would also involve changes to the tests. I suppose that could either be done by removing the NOWAIT or adding explicit retry logic, similar to #1342.

I'm happy to submit that PR after this is merged if no one else has plans to already, not trying to increase scope of your PR. Just making sure I understand. 🙂

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to add an inline comment :)

This works, but the lockForUpdate API seems a bit risky. Ideally it could be baked into ExecTx or something, but nothing's coming to mind at the moment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes NOWAIT here is only because of tests. And yes, I think tests need to be redesigned a little bit, as changing the same auth_request concurrently wouldn't actually harm a login flow, and deadlock which current tests produce is not possible in real life situations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably just update/remove the tests then, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't come to idea how to test concurrency better yet. Maybe later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the issue with interleaving connectors in the concurrent login flow mentioned in #1356 (comment) ?

Failing one of the transactions seems to be the only way to handle it in current implementation of handler/storage.

//
// NOTE(ericchiang): For some reason using `SET SESSION CHARACTERISTICS AS TRANSACTION` at a
// session level didn't work for some edge cases. Might be something worth exploring.
executeTx: func(db *sql.DB, fn func(sqlTx *sql.Tx) error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anything use this anymore? can we nuke it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nuked

storage/sql/crud.go Outdated Show resolved Hide resolved
storage/sql/sql.go Outdated Show resolved Hide resolved
Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather fix the tests to remove the requirement for that NOWAIT (if I understand the situation correctly). I'll try to come up with something. 🤔

return fmt.Errorf("get keys: %v", err)
}

old, err = getKeys(tx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Let's ditch line 389 and go with old, err := here, now that the code is rearranged.

@srenatus
Copy link
Contributor

Wait a moment, I've misunderstood something there, around #1370 (comment).

So, what would be a better set of test cases? Which app-level conditions do we want to avoid, when it comes to concurrent actions?

(To be explicit -- don't wait for me based on #1370 (review), I have nothing up my sleeves)

@bonifaido
Copy link
Member

bonifaido commented Aug 3, 2019

Can we revive this PR? I'm happy to help. We have an issue with the MySQL storage backend where according to the old conformance tests (the ones currently in master) SERIALIZABLE isolation level is needed, which has quite some issues on different MySQL implementations (as in #1511). Percona XtraDB doesn't let you use it by default, AWS Aurora doesn't have it at all, on MariaDB transaction_isolation is called tx_isolation so it is a pain TBH.

FYI: I have rebased this branch on the current master and copied the lockForUpdate method from the Postgres implementation, removed the explicit SERIALIZABLE tx_isolation level setting, all the MySQL conformance tests have passed on MariaDB, in MySQL 5.7 NOWAIT doesn't exist, it got introduced in 8.0, so there it works. But if I understand correctly NOWAIT is something we would like to see removed.

@srenatus
Copy link
Contributor

srenatus commented Aug 3, 2019

💯 I'd be very interested in getting this resolved. I think the main blocker here is that we have no test suite that makes the error appear -- our contributors, however, seem to have that. This would help evaluating the different approaches that have been put forward when this issue had been on the table last winter....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants