Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quickfix #1355: only try three times #1356

Closed
wants to merge 1 commit into from

Conversation

srenatus
Copy link
Contributor

cc @vito 😉

Signed-off-by: Stephan Renatus <srenatus@chef.io>
@vito
Copy link
Contributor

vito commented Nov 22, 2018

Fair enough! I can test this out Monday (on vacation atm) but I'm a little worried this limit will be too low. It was super easy to trigger the failure before the retry logic was added.

I can also try to work on a test to verify concurrent logins.

@srenatus
Copy link
Contributor Author

@vito Sure, enjoy your time off. We're just not able to cut a release right now, but that's OK.

I'm having a hard time coming up with a way to differentiate legitimate serialization failures from retryable ones that'll eventually succeed... might try to find out what other folks do there.

@vito
Copy link
Contributor

vito commented Nov 27, 2018

@srenatus Ok, so here's my theory. I don't think there's really such a thing as legitimate vs. retryable serialization failures - I think that was a symptom of the tests, because they literally force an unsatisfiable concurrency situation which I would expect is unrealistic compared to actual API usage. I'm gonna go out on a limb here and suggest that all serialization failures should be retriable, otherwise it's a potential bug in the code, and part of the point of setting serializable transaction level would be to notice and fix those before they end up in the wild.

The tests passed for me because, as you found, my implementation leaked connections, eventually exhausting the server connection pool and resulting in an error. All the tests checked for was that an error was hit, so, well, they passed.

But I think the real fix here is to fix the connection leak by calling tx.Rollback() in the error case, and update the tests to eventually give up on their concurrent update attempts, and the test should see that this limit is hit. This would effectively test the retry logic, too, which is nice (since my PR initially had no test changes).

So far I've confirmed the connection leak theory by adding a tx.Rollback() in the error path. With this added, the tests went into an infinite loop (yay!). I'm thinking the next step is to have the tests eventually give up on the concurrent update, and assert that it reached the desired number of attempts before succeeding.

@srenatus
Copy link
Contributor Author

@vito thanks for looking into this. I'm wondering about this, too,

which I would expect is unrealistic compared to actual API usage.

-- when would this happen? I think it could happen if you "choose" two different connectors at the same time. When you see the connector list, like

[ Log in with Username ]
[ Log in with SAML ]

the first link goes to /dex/auth/local?req=authnreq-id, and the second one to /dex/auth/saml?req=authnreq-id, and the handler code behind these will

  • fetch the authn request with that id
  • attempt to set its ConnectorID to local or saml

Now, if you'd do the two requests concurrently, this would lead to a serialization failure that isn't entirely wrong, and can't succeed on retry (unless I'm missing something). 👉 Pushed a commit here illustrating the situation a little, srenatus@ad2e530. (Also, it would be a condition that can be triggered by an unauthenticated user 🤔)

What do you think? I guess there's a few options here -- for example, there has been evidence that restructuring the authn request code could be a good idea (#646, #1292).

@vito
Copy link
Contributor

vito commented Nov 28, 2018

@srenatus I would expect that one transaction would succeed at a time until they all eventually succeed via their retry logic, in some (non-deterministic) order. The docs seem to indicate this:

If either transaction were running at the Repeatable Read isolation level, both would be allowed to commit; but since there is no serial order of execution consistent with the result, using Serializable transactions will allow one transaction to commit and will roll the other back with this message:

ERROR:  could not serialize access due to read/write dependencies among transactions

(from https://www.postgresql.org/docs/current/transaction-iso.html#XACT-SERIALIZABLE)

If somehow all transactions end up failing resulting in a retry-loop "live-lock", perhaps there's something wrong with how Dex is using Postgres transactions for this? (I don't know what this would look like - maybe holding them open too long and having multiple points at which a transaction may fail based on different data being referenced/updated?) I haven't run your test change there yet, but I'm guessing you're seeing it fail (or hang)?

@srenatus
Copy link
Contributor Author

I haven't run your test change there yet, but I'm guessing you're seeing it fail (or hang)?

Yeah, depending on whether I'll tx.Rollback(), it's one or the other. Didn't keep close track, just tried to come up with a more realistic scenario. I'll dig in again tomorrow.

What's our desired behaviour, though? I'd think if the two-connectors-selected situation is happening, I would want the first one to success (whichever is first) and the second one to fail. In some way we need to ensure there's no cross-over, or mixup, of connectors. You cannot start an LDAP flow, do something to have the connector ID switch to SAML, and finish the LDAP flow. Or you can do that, but the result must not look like you had logged in using SAML 😉

@ericchiang Could you chime in here? Do you know what edge cases this was supposed to catch?

@vito
Copy link
Contributor

vito commented Nov 28, 2018

Fair; I really don't know enough about the full Dex API and auth flow to know what kinds of guarantees are critical for consistency reasons, and if there are good reasons to fail user-initiated requests due to concurrency. So I'd appreciate @ericchiang's input here too. 🙂

I typically expect last-write-wins; to me, two simultaneous requests with different intents have no absolutely "correct" resolution because it's subject to network jitter, but I could totally see why someone would want to start one login flow and then start a different one soon after, expecting the last one to work - they may have mistakenly initiated the first request via a simple misclick and just gone back to start over. This case is more trivially serializable, of course, but I'd guess supporting that would mean going with last-write-wins if it really comes down to the wire.

@ericchiang
Copy link
Contributor

Just to clarify about the tests. The tests just try to make sure that dex does the right thing in the case of concurrency. When they were written, no storage implementations retried, so the if there was a racy call, at least one of them should fail.

Since you've added some retries for optimistic concurrency, the test now passes. Which is the right behavior now, but the test is still locked into the old assumption. We should probably change the test.

I'm actually more concerned about the tx leak and HEAD having a failing test. To get HEAD back to a good state, can we:

  1. revert retry on serialization errors #1342
  2. remerge the vendor changes in retry on serialization errors #1342
  3. submit a PR that implements retry on serialization errors #1342 while taking into account storage/sql/conformance: concurrent access tests fail for the wrong reason #1355, postgres: fix tx leak on serialization failure #1363 and this PR?

I'm fine with retrying on any error. Maybe we should bake that into the storage layer so all implementations do that?

srenatus added a commit that referenced this pull request Nov 29, 2018
…lization-error

Revert "retry on serialization errors"

This will come back, as outline here: #1356 (comment)
@srenatus
Copy link
Contributor Author

srenatus commented Nov 29, 2018

@srenatus srenatus mentioned this pull request Nov 29, 2018
@srenatus
Copy link
Contributor Author

The code of this PR is not useful anymore -- I'm keeping it open regardless, to track the #1356 (comment) progress above.

@srenatus
Copy link
Contributor Author

This is pretty much out of data right now, the discussion happens in #1341. Next-up relevant PR is #1370.

@srenatus srenatus closed this Dec 17, 2018
mmrath pushed a commit to mmrath/dex that referenced this pull request Sep 2, 2019
…-serialization-error

Revert "retry on serialization errors"

This will come back, as outline here: dexidp#1356 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants