quickfix #1355: only try three times #1356

srenatus · 2018-11-21T10:15:00Z

cc @vito 😉

Signed-off-by: Stephan Renatus <srenatus@chef.io>

vito · 2018-11-22T03:55:55Z

Fair enough! I can test this out Monday (on vacation atm) but I'm a little worried this limit will be too low. It was super easy to trigger the failure before the retry logic was added.

I can also try to work on a test to verify concurrent logins.

srenatus · 2018-11-22T07:07:01Z

@vito Sure, enjoy your time off. We're just not able to cut a release right now, but that's OK.

I'm having a hard time coming up with a way to differentiate legitimate serialization failures from retryable ones that'll eventually succeed... might try to find out what other folks do there.

vito · 2018-11-27T15:48:42Z

@srenatus Ok, so here's my theory. I don't think there's really such a thing as legitimate vs. retryable serialization failures - I think that was a symptom of the tests, because they literally force an unsatisfiable concurrency situation which I would expect is unrealistic compared to actual API usage. I'm gonna go out on a limb here and suggest that all serialization failures should be retriable, otherwise it's a potential bug in the code, and part of the point of setting serializable transaction level would be to notice and fix those before they end up in the wild.

The tests passed for me because, as you found, my implementation leaked connections, eventually exhausting the server connection pool and resulting in an error. All the tests checked for was that an error was hit, so, well, they passed.

But I think the real fix here is to fix the connection leak by calling tx.Rollback() in the error case, and update the tests to eventually give up on their concurrent update attempts, and the test should see that this limit is hit. This would effectively test the retry logic, too, which is nice (since my PR initially had no test changes).

So far I've confirmed the connection leak theory by adding a tx.Rollback() in the error path. With this added, the tests went into an infinite loop (yay!). I'm thinking the next step is to have the tests eventually give up on the concurrent update, and assert that it reached the desired number of attempts before succeeding.

srenatus · 2018-11-28T08:48:13Z

@vito thanks for looking into this. I'm wondering about this, too,

which I would expect is unrealistic compared to actual API usage.

-- when would this happen? I think it could happen if you "choose" two different connectors at the same time. When you see the connector list, like

[ Log in with Username ]
[ Log in with SAML ]

the first link goes to /dex/auth/local?req=authnreq-id, and the second one to /dex/auth/saml?req=authnreq-id, and the handler code behind these will

fetch the authn request with that id
attempt to set its ConnectorID to local or saml

Now, if you'd do the two requests concurrently, this would lead to a serialization failure that isn't entirely wrong, and can't succeed on retry (unless I'm missing something). 👉 Pushed a commit here illustrating the situation a little, srenatus@ad2e530. (Also, it would be a condition that can be triggered by an unauthenticated user 🤔)

What do you think? I guess there's a few options here -- for example, there has been evidence that restructuring the authn request code could be a good idea (#646, #1292).

vito · 2018-11-28T16:26:00Z

@srenatus I would expect that one transaction would succeed at a time until they all eventually succeed via their retry logic, in some (non-deterministic) order. The docs seem to indicate this:

If either transaction were running at the Repeatable Read isolation level, both would be allowed to commit; but since there is no serial order of execution consistent with the result, using Serializable transactions will allow one transaction to commit and will roll the other back with this message:
ERROR:  could not serialize access due to read/write dependencies among transactions

(from https://www.postgresql.org/docs/current/transaction-iso.html#XACT-SERIALIZABLE)

If somehow all transactions end up failing resulting in a retry-loop "live-lock", perhaps there's something wrong with how Dex is using Postgres transactions for this? (I don't know what this would look like - maybe holding them open too long and having multiple points at which a transaction may fail based on different data being referenced/updated?) I haven't run your test change there yet, but I'm guessing you're seeing it fail (or hang)?

srenatus · 2018-11-28T16:35:20Z

I haven't run your test change there yet, but I'm guessing you're seeing it fail (or hang)?

Yeah, depending on whether I'll tx.Rollback(), it's one or the other. Didn't keep close track, just tried to come up with a more realistic scenario. I'll dig in again tomorrow.

What's our desired behaviour, though? I'd think if the two-connectors-selected situation is happening, I would want the first one to success (whichever is first) and the second one to fail. In some way we need to ensure there's no cross-over, or mixup, of connectors. You cannot start an LDAP flow, do something to have the connector ID switch to SAML, and finish the LDAP flow. Or you can do that, but the result must not look like you had logged in using SAML 😉

@ericchiang Could you chime in here? Do you know what edge cases this was supposed to catch?

vito · 2018-11-28T16:50:50Z

Fair; I really don't know enough about the full Dex API and auth flow to know what kinds of guarantees are critical for consistency reasons, and if there are good reasons to fail user-initiated requests due to concurrency. So I'd appreciate @ericchiang's input here too. 🙂

I typically expect last-write-wins; to me, two simultaneous requests with different intents have no absolutely "correct" resolution because it's subject to network jitter, but I could totally see why someone would want to start one login flow and then start a different one soon after, expecting the last one to work - they may have mistakenly initiated the first request via a simple misclick and just gone back to start over. This case is more trivially serializable, of course, but I'd guess supporting that would mean going with last-write-wins if it really comes down to the wire.

ericchiang · 2018-11-28T17:01:54Z

Just to clarify about the tests. The tests just try to make sure that dex does the right thing in the case of concurrency. When they were written, no storage implementations retried, so the if there was a racy call, at least one of them should fail.

Since you've added some retries for optimistic concurrency, the test now passes. Which is the right behavior now, but the test is still locked into the old assumption. We should probably change the test.

I'm actually more concerned about the tx leak and HEAD having a failing test. To get HEAD back to a good state, can we:

revert retry on serialization errors #1342
remerge the vendor changes in retry on serialization errors #1342
submit a PR that implements retry on serialization errors #1342 while taking into account storage/sql/conformance: concurrent access tests fail for the wrong reason #1355, postgres: fix tx leak on serialization failure #1363 and this PR?

I'm fine with retrying on any error. Maybe we should bake that into the storage layer so all implementations do that?

…lization-error Revert "retry on serialization errors" This will come back, as outline here: #1356 (comment)

srenatus · 2018-11-29T08:10:14Z

revert retry on serialization errors #1342
remerge the vendor changes in retry on serialization errors #1342
submit a PR that implements retry on serialization errors #1342 while taking into account storage/sql/conformance: concurrent access tests fail for the wrong reason #1355, postgres: fix tx leak on serialization failure #1363 and this PR?

srenatus · 2018-12-10T12:00:20Z

The code of this PR is not useful anymore -- I'm keeping it open regardless, to track the #1356 (comment) progress above.

srenatus · 2018-12-17T06:54:11Z

This is pretty much out of data right now, the discussion happens in #1341. Next-up relevant PR is #1370.

…-serialization-error Revert "retry on serialization errors" This will come back, as outline here: dexidp#1356 (comment)

quickfix dexidp#1355: only try three times

0be9a48

Signed-off-by: Stephan Renatus <srenatus@chef.io>

srenatus mentioned this pull request Nov 21, 2018

postgres: expose database/sql tunables #1357

Merged

This was referenced Nov 22, 2018

storage/sql/conformance: concurrent access tests fail for the wrong reason #1355

Closed

Release v2.14.0 #1362

Closed

vito mentioned this pull request Nov 27, 2018

postgres: fix tx leak on serialization failure #1363

Closed

srenatus mentioned this pull request Nov 29, 2018

Revert "retry on serialization errors" #1366

Merged

srenatus added a commit that referenced this pull request Nov 29, 2018

Merge pull request #1366 from dexidp/revert-1342-pr/retry-on-pg-seria…

d91f9fb

…lization-error Revert "retry on serialization errors" This will come back, as outline here: #1356 (comment)

srenatus mentioned this pull request Nov 29, 2018

bump lib/pq #1367

Merged

kepkin mentioned this pull request Dec 10, 2018

Fix concurrent access for SQL storage. #1370

Open

vito mentioned this pull request Dec 11, 2018

Postgres: forcing transactions to be serializable breaks concurrent login #1341

Open

srenatus closed this Dec 17, 2018

mmrath pushed a commit to mmrath/dex that referenced this pull request Sep 2, 2019

Merge pull request dexidp#1366 from dexidp/revert-1342-pr/retry-on-pg…

8dcb9cb

…-serialization-error Revert "retry on serialization errors" This will come back, as outline here: dexidp#1356 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quickfix #1355: only try three times #1356

quickfix #1355: only try three times #1356

srenatus commented Nov 21, 2018

vito commented Nov 22, 2018

srenatus commented Nov 22, 2018

vito commented Nov 27, 2018 •

edited

Loading

srenatus commented Nov 28, 2018

vito commented Nov 28, 2018 •

edited

Loading

srenatus commented Nov 28, 2018

vito commented Nov 28, 2018

ericchiang commented Nov 28, 2018

srenatus commented Nov 29, 2018 •

edited

Loading

srenatus commented Dec 10, 2018

srenatus commented Dec 17, 2018

quickfix #1355: only try three times #1356

quickfix #1355: only try three times #1356

Conversation

srenatus commented Nov 21, 2018

vito commented Nov 22, 2018

srenatus commented Nov 22, 2018

vito commented Nov 27, 2018 • edited Loading

srenatus commented Nov 28, 2018

vito commented Nov 28, 2018 • edited Loading

srenatus commented Nov 28, 2018

vito commented Nov 28, 2018

ericchiang commented Nov 28, 2018

srenatus commented Nov 29, 2018 • edited Loading

srenatus commented Dec 10, 2018

srenatus commented Dec 17, 2018

vito commented Nov 27, 2018 •

edited

Loading

vito commented Nov 28, 2018 •

edited

Loading

srenatus commented Nov 29, 2018 •

edited

Loading