-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider raising provider-specific concurrency exceptions as DbUpdateConcurrencyException #33068
Comments
Per my original issue on the Npgsql repo, I have a concurrency test that is (purposefully) causing the database to report a transaction serialization error (PG error 40001) due to the transaction isolation level. When I run my test at lower isolation levels, I get the expected From my perspective as the author of the calling library, I think there are a number of things that could be improved here:
My request is that EF make it possible for providers to flag explicitly when an error should be represented as a concurrency conflict, and to have that bubble up as a |
How so? A serialization error is very much a result of your transaction happening concurrently with another transaction that is incompatible with it. When a program gets a serialization error, the right action is precisely to retry, so that the other transaction is no longer happening, no? In fact, I'd say concurrency errors (as such serializability errors) are perhaps one of the best examples of a database transient error...
That's a larger question that isn't just about this case or EF; logging code should always report the entire exception - including all nested exceptions - there are a great many .NET libraries and APIs which work this way, with wrapped exceptions. |
My only concern here is whether this could cause issues with existing code that assumes |
That's interesting... I can't think of a case where the distinction between "unexpected rows affected" (current meaning) and "serialization exception" could be meaningful to a client program; both indicate failure due to a concurrent operation, and in both cases the SaveChanges was rolled back (so we end up in the same state). Do you have a specific thing in mind? I'm actually a bit more worried about the breaking change aspect here - a program not currently using optimistic concurrency won't currently be catching DbUpdateConcurrencyException; if that program is currently written to deal with e.g. serialization failures (by catching PostgresException with the correct code), that program will no longer work. |
@roji EF should be wrapping the provider-specific error in a
No, which is why I think this is okay. Just things like the U.I. is broken because a list that previously always has at least one entry now has none. |
That's a good point, but an application retrying for serialization errors is catching DbUpdateException wrapping PostgresException with a specific code; it's not retrying for any DbUpdateException (at least not if it's written correctly). So code specifically identifying serializability exceptions will start to fail, no? |
@roji I went and looked at the original issue. It seems that Npgsql is wrapping the That aside, this is the code: try
{
// unrelated project-specific logic here
}
catch (InvalidOperationException e)
{
// Work around an error reporting bug in Npgsql.EntityFrameworkCore:
// For details, see https://github.com/npgsql/efcore.pg/issues/3084
if (e.InnerException is DbUpdateException
{
InnerException: PostgresException { SqlState: "40001" } postgresException
})
{
throw new DbUpdateConcurrencyException(postgresException.Message, postgresException);
}
throw;
} This will still work the same regardless of whether the inner exception is |
I see, thanks for looking! I think you're right, so beyond fixing the odd EFCore.PG InvalidOperationException (opened npgsql/efcore.pg#3098 to track), in this issue we'd basically only be swapping DbUpdateException with DbUpdateConcurrencyException, which is a subclass of the former - yeah, seems safe! |
This is true if you can safely assume that the precondition for running the failed transaction hasn't been invalidated by the intervening transaction. The failure in my case occurs when two concurrent tasks attempt to modify the same version of a given entity, as my application writes modifications to this entity in an append-only fashion (primary key is It's analogous to the error that git gives you when attempting to push a commit that would cause there to be multiple heads on the branch to which you're pushing. Simply replaying the transaction would be (loosely) equivalent to always automatically force pushing rather than letting the user decide based on their intentions whether to rebase, merge, or force push. |
@roji Without any disagreement to your point about best practices here, I think it's equally valid to say that top-level exception messages should describe the error that caused the exception to be thrown. |
It's an interesting discussion, and it depends on the precise definition of transience. The way I understand it, a transient error doesn't necessarily mean that the specific, isolated failed operation can/should be retried, but rather the entire transaction (or rather unit-of-work). To take a concrete example, let's say my UoW load a row, increments some column and attempts to save the results back. If a serialization error (or just a write conflict) happens, the program indeed can't just attempt to re-save - the entire UoW needs to be retried (reloading the row, incrementing and saving). Note that the EF docs address this in the context of concurrency token-drive optimistic concurrency (docs) - it's worth comparing that guidance to this discussion, as things are very similar (the main difference is that with concurrency tokens EF generates the concurrency exception, whereas with serializability errors the database does). Note that at least in PostgreSQL, retrying a specific failed operation within the transaction isn't even possible; when a serializable error (or any other database error) occurs, the entire transaction transitions to a failed state, and can only be rolled back. In that sense, the database enforces that any retry occurs at the level of the transaction (UoW), and not at the level of the specific operation within the transaction which happened to fail. To address your git analogy:
I don't think that's the right equivalent; retrying means pulling in the latest changes, and then redoing the changes, applying the entire original UoW logic on the new commit - definitely not force-pushing the original change. This may mean that the new change is different from the original change, or maybe even that no change needs to occur at all. It's true that there's a distinction here between a user-driven change (where it may make sense to allow the user to manually resolve conflicts, e.g. via a UI), and an automated change (where the program catches the concurrency exception and handles it without user intervention); automatic retry is relevant for the 2nd one, and less for the 1st. But that's a matter of how the concurrency error ir resolved, and largely orthogonal to whether we consider the error itself transient. To summarize, if transience is interpreted as applying to the entire transaction/UoW - as opposed to only the isolated operation within the transaction which generated it - then I think serializability (and other concurrency) errors fit very well under that definition. I'd be interested in hearing your thoughts on this. |
@roji I'm not sure that I have the expertise to dive deep on formal definitions of transience, but as a developer who consumes these libraries, I'll do my best to at least share my perspective. Just to clarify, in my previous comments I understood that the UoW that would be repeated by a retrying execution strategy in this case would be the entire transaction that failed to serialize. Apologies if I phrased things poorly there. However I think that under your definition of transience, it may be that the transaction is actually an incomplete portion of the UoW that would need to be repeated. That is, the full unit of work starts when the user fetches the entity to be modified, and ends when they submit their modifications to the database to be persisted. In my application this doesn't happen in the context of a single transaction, and as a result, the assumption that the transaction encompasses a complete unit of work breaks down. I think under your more strict definition of transience it's perfectly valid to argue that by not having the full UoW encompassed in a single transaction here, I've violated the key design assumption(s) necessary to make transient errors automatically repeatable. Unfortunately, resolving the misalignment of the UoW boundary with the transaction boundary just isn't practical in my application. My perspective (which may well be based on incomplete knowledge or false assumptions) is that the strictness of your definition of "transience" is likely causing a misalignment between what you as a library producer are intending to convey when you set the From the linked reference docs, the property has the following definition:
The missing nuance is, I think, in the definition of "succeed," there. This is potentially one of the false assumptions that I mentioned above, but I think your perspective is that, in the context of a serialization failure, success occurs on retry if the transaction commits without error. However as a developer who is thinking about a broader application-level workflow, I'd actually regard it as a compound failure in my case if the transaction were to fail due to a serialization error, be automatically retried, and then succeed. Functionally speaking, it would appear that setting I think my case demonstrates that whether or not it's safe to retry on a serialization failure is an application detail. As a result, I don't think it's fitting for the provider library to assume on my behalf that retrying should be safe, which is what it appears to be doing by setting |
You're right - although I used UoW and transaction somewhat interchangeably above, the UoW is the thing that actually matters (in some cases the two correspond, in others not). So to be precise, the entire UoW is the thing that needs to be retried when a concurrency exception occurs - including the querying. I think that's still fine, as long as all writes in the UoW are encompassed by a single transaction - which seems to be the case in most scenarios. And it's true that in this sense, in my mind there indeed are different types of transient errors: some may allow for the specific operation within the UoW to be retried (e.g. network issue), while others may require the entire UoW to be retried (e.g. concurrency issue). It's worth pointing out that just always retrying the UoW - without making that distinction - is usually acceptable; retrying only the specific operation can be considered an optimization that isn't worth it in the general case. That's valuable because it allows a relatively simple "just retry the UoW on transient errors" that works for all cases. Finally, I think you're reading too much into the meaning of |
I interpreted it to have exactly this meaning. I just don't agree that in my use case that it's possible for the operation in question to succeed if retried. As a result, I can't assume that generic library components that may branch on the value of Thankfully I don't think many such components exist, so it's really a secondary concern for me. Otherwise npgsql/efcore.pg#3098 addresses my primary concern. Overall I'm happy with this outcome if you'd like to leave it there. |
I'm not sure I understand - the operation (saving the changes) may succeed, although the results are not what you want (because you'd possibly not be taking into account the effects of a concurrent transaction, and overwriting them etc.). In PG specifically you have to roll back the transaction on failure, meaning you (likely!) have to perform the querying again, which would take into account the effects of the concurrent transaction; but that seems orthogonal to the overall definition here.
So again, the representation is invalid because you consider it as such; as long as the operation can succeed (regardless of what state it produces), then the retry "succeeded" in the narrow sense. But more importantly, as I wrote above, my expectation is for generic library components to retry the entire UoW upon a transient exception (including a serializable one). What exact problem do you see with that? |
It's less about what I want, but rather the impact on the user. The transaction succeeding on retry in this case would very much be viewed by the user as a defect in our application. Users of our application would argue that this isn't succeeding "in a narrow sense," but rather it's failing silently in a way that may go undetected for quite some time. It's the exact opposite of the principle of failing fast and loudly. The application in question is part of a B2B product that informs key business decisions with large financial stakes. If users lose their work, only to discover the loss after said key business decisions have been made, it's not going to do good things for their trust in our application.
Emphasis mine in the first quote. The exact problem that I see with that is that no generic library component would be able to do this for my application. The application is a GraphQL API that is accessed by users via our application's web frontend. The use case is just a fairly basic read-modify-write. User reads the entity from the API, makes changes, and sends those changes back to the API to be written as an update. Behind the scenes, the update is implemented in an "append only" fashion. A new row is inserted rather than an Without the user being included in the system as a decision process that's invokable by the retrying library component, the retry can't reliably succeed because the information upon which they based their decisions while making their edits may no longer be valid. |
FWIW, it looks like some really smart guy foresaw these concerns back in 2020 when the dotnet/runtime#34817 (comment) From that comment:
I think that's a perfectly valid response here. My only lingering concern is that this situation forces me to implement my own retry logic early, as otherwise we run the risk that later on someone on the team who's less familiar with our entity versioning scheme will see a transient network failure and add Side note: Polly looks really interesting. I hadn't seen that before, and I'll definitely have a closer look. Thanks for pointing that out! |
To summarize your case, you're basically saying that automated retrying isn't possible for disconnected scenarios, where the (1st) querying is done in a completely separate web request, without state being maintained server-side across the two web requests (query, apply changes). In other words, the conceptual UoW is spread across two web requests, and as such there's no way to automatically retry. You're absolutely right - and IsTransient is indeed of limited value for your scenario; it may indicate safe retrying is possible in some cases (e.g. timeout), but you cannot do this automatically as some forms of transient errors cannot be retried (as they require the entire UoW to be retried). I agree this limits the usefulness of IsTransient as a general mechanism; there still are specific scenarios where automated retries work quite well based on it (i.e. where the UoW can be retried, in non-disconnected scenarios), but it's not a thing to be systematically used everywhere without having to think about it. /cc @ajcvickers @benjamincburns in general, thanks for this conversation - it's very valuable to have this kind of discussion and better understand the problems you're running up against. |
To return to the original questions here... I still think it probably makes sense to allow database concurrency/serialization errors to be raised as EF DbUpdateConcurrencyException - the exception expresses the exact error type, and as with IsTransient, for non-disconnected scenarios can trigger an automatic retry of the UoW. In other words, how exactly we raise the exception is orthogonal to whether automatic retrying is possible or not. And regardless, I'll need to look at npgsql/efcore.pg#3098. |
All good. FWIW, I really appreciate your patience in getting to a point of mutual understanding. |
Team decision: we think it makes sense to add a provider hook to allow providers to normalize database concurrency exceptions to DbUpdateConcurrencyException. |
DbUpdateConcurrencyException is currently only thrown for an EF concurrency-token related failure, i.e. when the rows-affected from the database don't match the expectation.
There are various other concurrency errors that are raised directly from the database - deadlocks, write conflicts in snapshot isolation mode (PG repeatable read), serialization errors in serializable isolation mode. There's no conceptual difference between these and EF's error; in fact, snapshot mode specifically optimistically handles the exact same case that EF's concurrency token handles - write conflicts. So it could make sense to provide a provider hook that would allow wrapping such exceptions in DbUpdateConcurrencyException; this would allow users to only need to catch that for retries etc.
On the downside, note that this would be a small breaking change, as applications currently handling e.g. deadlock retrying by catching SqlException would suddenly start getting a DbUpdateConcurrencyException. They're probably already doing that if they're using concurrency tokens, but not everyone uses them. In addition, EF has so far refrained from interpreting/doing stuff to provider exceptions, simply bubbling them up; this may be useful to users as provider exceptions look and behave the same with or without EF, rather than being caught and rethrown in some other way as is being proposed here.
Note also the relationship with DbException.IsTransient, through which a provider can indicate that an exception may be transient and it makes sense to retry (not yet implemented on SqlClient). Though not all transient exceptions represent concurrency errors.
Raised by @benjamincburns in npgsql/efcore.pg#3084
The text was updated successfully, but these errors were encountered: