Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider raising provider-specific concurrency exceptions as DbUpdateConcurrencyException #33068

Open
roji opened this issue Feb 12, 2024 · 21 comments

Comments

@roji
Copy link
Member

roji commented Feb 12, 2024

DbUpdateConcurrencyException is currently only thrown for an EF concurrency-token related failure, i.e. when the rows-affected from the database don't match the expectation.

There are various other concurrency errors that are raised directly from the database - deadlocks, write conflicts in snapshot isolation mode (PG repeatable read), serialization errors in serializable isolation mode. There's no conceptual difference between these and EF's error; in fact, snapshot mode specifically optimistically handles the exact same case that EF's concurrency token handles - write conflicts. So it could make sense to provide a provider hook that would allow wrapping such exceptions in DbUpdateConcurrencyException; this would allow users to only need to catch that for retries etc.

On the downside, note that this would be a small breaking change, as applications currently handling e.g. deadlock retrying by catching SqlException would suddenly start getting a DbUpdateConcurrencyException. They're probably already doing that if they're using concurrency tokens, but not everyone uses them. In addition, EF has so far refrained from interpreting/doing stuff to provider exceptions, simply bubbling them up; this may be useful to users as provider exceptions look and behave the same with or without EF, rather than being caught and rethrown in some other way as is being proposed here.

Note also the relationship with DbException.IsTransient, through which a provider can indicate that an exception may be transient and it makes sense to retry (not yet implemented on SqlClient). Though not all transient exceptions represent concurrency errors.

Raised by @benjamincburns in npgsql/efcore.pg#3084

@benjamincburns
Copy link

Per my original issue on the Npgsql repo, I have a concurrency test that is (purposefully) causing the database to report a transaction serialization error (PG error 40001) due to the transaction isolation level. When I run my test at lower isolation levels, I get the expected DbUpdateConcurrencyException. When I run it at higher isolation levels, I get a very generic IllegalOperationException with the highly ambiguous message An exception has been raised that is likely due to a transient failure.

From my perspective as the author of the calling library, I think there are a number of things that could be improved here:

  • Exception message is misleading
    • The serialization error that I mentioned in my original ticket is not a transient error - a retry loop can't help it to succeed
  • Exception message and properties (apart from innerException) doesn't indicate the nature of the issue
    • A lot of error logging code only reports top-level exception type and message, I'd expect the nature of the error to be more clear from at least the exception message
  • Exception type specificity
    • The current implementation bubbles up what is conceptually a concurrency error as an InvalidOperationException
  • Unnecessarily complicated error handling code
    • From the perspective of the calling code, the nature of the error is the same regardless of its source (transaction serialization error vs concurrency token error)
    • Users with less rigorous testing discipline are more likely to encounter one version or the other of this error during testing and assume they've covered the expected failure modes for concurrency failures, only to find out later on (perhaps in prod) that they missed one.
      • Isolation level is often set outside of query logic

My request is that EF make it possible for providers to flag explicitly when an error should be represented as a concurrency conflict, and to have that bubble up as a DbUpdateConcurrencyException (as this seems to be the most semantically correct exception type). Ideally the exception message for DbUpdateConcurrencyException would indicate the source of the error, but at that stage the exception type is specific enough that having to determine the source of the error from InnerException doesn't feel quite so out of place.

@roji
Copy link
Member Author

roji commented Feb 12, 2024

The serialization error that I mentioned in my original ticket is not a transient error - a retry loop can't help it to succeed

How so? A serialization error is very much a result of your transaction happening concurrently with another transaction that is incompatible with it. When a program gets a serialization error, the right action is precisely to retry, so that the other transaction is no longer happening, no?

In fact, I'd say concurrency errors (as such serializability errors) are perhaps one of the best examples of a database transient error...

A lot of error logging code only reports top-level exception type and message, I'd expect the nature of the error to be more clear from at least the exception message

That's a larger question that isn't just about this case or EF; logging code should always report the entire exception - including all nested exceptions - there are a great many .NET libraries and APIs which work this way, with wrapped exceptions.

@ajcvickers
Copy link
Member

My only concern here is whether this could cause issues with existing code that assumes DbUpdateConcurrencyException must mean actual rows affected is different from expected rows affected, since there has always been a 1:1 correlation here, even in legacy EF. That being said, I still think this is probably the right thing to do.

@roji
Copy link
Member Author

roji commented Feb 14, 2024

My only concern here is whether this could cause issues with existing code that assumes DbUpdateConcurrencyException must mean actual rows affected is different from expected rows affected [...]

That's interesting... I can't think of a case where the distinction between "unexpected rows affected" (current meaning) and "serialization exception" could be meaningful to a client program; both indicate failure due to a concurrent operation, and in both cases the SaveChanges was rolled back (so we end up in the same state). Do you have a specific thing in mind?

I'm actually a bit more worried about the breaking change aspect here - a program not currently using optimistic concurrency won't currently be catching DbUpdateConcurrencyException; if that program is currently written to deal with e.g. serialization failures (by catching PostgresException with the correct code), that program will no longer work.

@ajcvickers
Copy link
Member

@roji EF should be wrapping the provider-specific error in a DbUpdateException. DbUpdateConcurrencyException derives from DbUpdateException, so if code is currently catching only DbUpdateException, then it should still work the same after this change.

Do you have a specific thing in mind?

No, which is why I think this is okay. Just things like the U.I. is broken because a list that previously always has at least one entry now has none.

@roji
Copy link
Member Author

roji commented Feb 14, 2024

EF should be wrapping the provider-specific error in a DbUpdateException. DbUpdateConcurrencyException derives from DbUpdateException, so if code is currently catching only DbUpdateException, then it should still work the same after this change.

That's a good point, but an application retrying for serialization errors is catching DbUpdateException wrapping PostgresException with a specific code; it's not retrying for any DbUpdateException (at least not if it's written correctly). So code specifically identifying serializability exceptions will start to fail, no?

@ajcvickers
Copy link
Member

@roji I went and looked at the original issue. It seems that Npgsql is wrapping the DbUpdateException in an InvalidOperationException, which is strange for an error that comes from SaveChanges. /cc @AndriySvyryd

That aside, this is the code:

try
{
    // unrelated project-specific logic here
}
catch (InvalidOperationException e)
{
    // Work around an error reporting bug in Npgsql.EntityFrameworkCore:
    // For details, see https://github.com/npgsql/efcore.pg/issues/3084
    if (e.InnerException is DbUpdateException
        {
            InnerException: PostgresException { SqlState: "40001" } postgresException
        })
    {
        throw new DbUpdateConcurrencyException(postgresException.Message, postgresException);
    }

    throw;
}

This will still work the same regardless of whether the inner exception is DbUpdateException or DbUpdateConcurrencyException. This is by-design, so that applications that don't handle concurrency errors specifically still get the expected exception if the update did not succeed.

@roji
Copy link
Member Author

roji commented Feb 15, 2024

I see, thanks for looking! I think you're right, so beyond fixing the odd EFCore.PG InvalidOperationException (opened npgsql/efcore.pg#3098 to track), in this issue we'd basically only be swapping DbUpdateException with DbUpdateConcurrencyException, which is a subclass of the former - yeah, seems safe!

@benjamincburns
Copy link

benjamincburns commented Feb 19, 2024

How so? A serialization error is very much a result of your transaction happening concurrently with another transaction that is incompatible with it. When a program gets a serialization error, the right action is precisely to retry, so that the other transaction is no longer happening, no?

This is true if you can safely assume that the precondition for running the failed transaction hasn't been invalidated by the intervening transaction. The failure in my case occurs when two concurrent tasks attempt to modify the same version of a given entity, as my application writes modifications to this entity in an append-only fashion (primary key is id: uuid, version: int). Simply replaying the failed transaction wouldn't be the correct approach here, as that would have the effect of overwriting the changes that were written first. In this case, the user needs to be made aware of the conflicting change so they can apply their edits in light of that change.

It's analogous to the error that git gives you when attempting to push a commit that would cause there to be multiple heads on the branch to which you're pushing. Simply replaying the transaction would be (loosely) equivalent to always automatically force pushing rather than letting the user decide based on their intentions whether to rebase, merge, or force push.

@benjamincburns
Copy link

benjamincburns commented Feb 19, 2024

That's a larger question that isn't just about this case or EF; logging code should always report the entire exception - including all nested exceptions - there are a great many .NET libraries and APIs which work this way, with wrapped exceptions.

@roji Without any disagreement to your point about best practices here, I think it's equally valid to say that top-level exception messages should describe the error that caused the exception to be thrown.

@roji
Copy link
Member Author

roji commented Feb 19, 2024

This is true if you can safely assume that the precondition for running the failed transaction hasn't been invalidated by the intervening transaction.

It's an interesting discussion, and it depends on the precise definition of transience.

The way I understand it, a transient error doesn't necessarily mean that the specific, isolated failed operation can/should be retried, but rather the entire transaction (or rather unit-of-work). To take a concrete example, let's say my UoW load a row, increments some column and attempts to save the results back. If a serialization error (or just a write conflict) happens, the program indeed can't just attempt to re-save - the entire UoW needs to be retried (reloading the row, incrementing and saving). Note that the EF docs address this in the context of concurrency token-drive optimistic concurrency (docs) - it's worth comparing that guidance to this discussion, as things are very similar (the main difference is that with concurrency tokens EF generates the concurrency exception, whereas with serializability errors the database does).

Note that at least in PostgreSQL, retrying a specific failed operation within the transaction isn't even possible; when a serializable error (or any other database error) occurs, the entire transaction transitions to a failed state, and can only be rolled back. In that sense, the database enforces that any retry occurs at the level of the transaction (UoW), and not at the level of the specific operation within the transaction which happened to fail.

To address your git analogy:

Simply replaying the transaction would be (loosely) equivalent to always automatically force pushing rather than letting the user decide based on their intentions whether to rebase, merge, or force push.

I don't think that's the right equivalent; retrying means pulling in the latest changes, and then redoing the changes, applying the entire original UoW logic on the new commit - definitely not force-pushing the original change. This may mean that the new change is different from the original change, or maybe even that no change needs to occur at all.

It's true that there's a distinction here between a user-driven change (where it may make sense to allow the user to manually resolve conflicts, e.g. via a UI), and an automated change (where the program catches the concurrency exception and handles it without user intervention); automatic retry is relevant for the 2nd one, and less for the 1st. But that's a matter of how the concurrency error ir resolved, and largely orthogonal to whether we consider the error itself transient.

To summarize, if transience is interpreted as applying to the entire transaction/UoW - as opposed to only the isolated operation within the transaction which generated it - then I think serializability (and other concurrency) errors fit very well under that definition. I'd be interested in hearing your thoughts on this.

@benjamincburns
Copy link

benjamincburns commented Feb 20, 2024

@roji I'm not sure that I have the expertise to dive deep on formal definitions of transience, but as a developer who consumes these libraries, I'll do my best to at least share my perspective.

Just to clarify, in my previous comments I understood that the UoW that would be repeated by a retrying execution strategy in this case would be the entire transaction that failed to serialize. Apologies if I phrased things poorly there.

However I think that under your definition of transience, it may be that the transaction is actually an incomplete portion of the UoW that would need to be repeated. That is, the full unit of work starts when the user fetches the entity to be modified, and ends when they submit their modifications to the database to be persisted. In my application this doesn't happen in the context of a single transaction, and as a result, the assumption that the transaction encompasses a complete unit of work breaks down.

I think under your more strict definition of transience it's perfectly valid to argue that by not having the full UoW encompassed in a single transaction here, I've violated the key design assumption(s) necessary to make transient errors automatically repeatable. Unfortunately, resolving the misalignment of the UoW boundary with the transaction boundary just isn't practical in my application.

My perspective (which may well be based on incomplete knowledge or false assumptions) is that the strictness of your definition of "transience" is likely causing a misalignment between what you as a library producer are intending to convey when you set the DbException.IsTransient property, vs what developers interpret it to mean.

From the linked reference docs, the property has the following definition:

Indicates whether the error represented by this DbException could be a transient error, i.e. if retrying the triggering operation may succeed without any other change.

The missing nuance is, I think, in the definition of "succeed," there. This is potentially one of the false assumptions that I mentioned above, but I think your perspective is that, in the context of a serialization failure, success occurs on retry if the transaction commits without error. However as a developer who is thinking about a broader application-level workflow, I'd actually regard it as a compound failure in my case if the transaction were to fail due to a serialization error, be automatically retried, and then succeed.

Functionally speaking, it would appear that setting IsTransient is equivalent to marking the error condition as safe to retry, and this implication is reflected by the design of NpgsqlRetryingExecutionStrategy in that it has a built-in facility for including additional non-transient error codes for automatic retrying, but no facility for excluding errors that have IsTransient set.

I think my case demonstrates that whether or not it's safe to retry on a serialization failure is an application detail. As a result, I don't think it's fitting for the provider library to assume on my behalf that retrying should be safe, which is what it appears to be doing by setting IsTransient here.

@roji
Copy link
Member Author

roji commented Feb 20, 2024

However I think that under your definition of transience, it may be that the transaction is actually an incomplete portion of the UoW that would need to be repeated. That is, the full unit of work starts when the user fetches the entity to be modified, and ends when they submit their modifications to the database to be persisted. In my application this doesn't happen in the context of a single transaction, and as a result, the assumption that the transaction encompasses a complete unit of work breaks down.

You're right - although I used UoW and transaction somewhat interchangeably above, the UoW is the thing that actually matters (in some cases the two correspond, in others not). So to be precise, the entire UoW is the thing that needs to be retried when a concurrency exception occurs - including the querying. I think that's still fine, as long as all writes in the UoW are encompassed by a single transaction - which seems to be the case in most scenarios.

And it's true that in this sense, in my mind there indeed are different types of transient errors: some may allow for the specific operation within the UoW to be retried (e.g. network issue), while others may require the entire UoW to be retried (e.g. concurrency issue). It's worth pointing out that just always retrying the UoW - without making that distinction - is usually acceptable; retrying only the specific operation can be considered an optimization that isn't worth it in the general case. That's valuable because it allows a relatively simple "just retry the UoW on transient errors" that works for all cases.

Finally, I think you're reading too much into the meaning of DbException.IsTransient - it merely states whether the operation could succeed if retried, and does not say anything about whether actually retrying that specific operation is desirable ("safe") or makes sense in a given application. As written above, I do think retrying the UoW (as opposed to the specific operation) does make sense for concurrency errors, and having IsTransient allows implementing that logic in a general way without resorting to database-specific identification of errors.

@benjamincburns
Copy link

benjamincburns commented Feb 22, 2024

I think you're reading too much into the meaning of DbException.IsTransient - it merely states whether the operation could succeed if retried

I interpreted it to have exactly this meaning. I just don't agree that in my use case that it's possible for the operation in question to succeed if retried. As a result, I can't assume that generic library components that may branch on the value of IsTransient won't introduce invalid representations of my entities in my database.

Thankfully I don't think many such components exist, so it's really a secondary concern for me. Otherwise npgsql/efcore.pg#3098 addresses my primary concern. Overall I'm happy with this outcome if you'd like to leave it there.

@roji
Copy link
Member Author

roji commented Feb 22, 2024

I interpreted it to have exactly this meaning. I just don't agree that in my use case that it's possible for the operation in question to succeed if retried.

I'm not sure I understand - the operation (saving the changes) may succeed, although the results are not what you want (because you'd possibly not be taking into account the effects of a concurrent transaction, and overwriting them etc.). In PG specifically you have to roll back the transaction on failure, meaning you (likely!) have to perform the querying again, which would take into account the effects of the concurrent transaction; but that seems orthogonal to the overall definition here.

As a result, I can't assume that generic library components that may branch on the value of IsTransient won't introduce invalid representations of my entities in my database.

So again, the representation is invalid because you consider it as such; as long as the operation can succeed (regardless of what state it produces), then the retry "succeeded" in the narrow sense.

But more importantly, as I wrote above, my expectation is for generic library components to retry the entire UoW upon a transient exception (including a serializable one). What exact problem do you see with that?

@benjamincburns
Copy link

benjamincburns commented Feb 23, 2024

the operation (saving the changes) may succeed, although the results are not what you want (because you'd possibly not be taking into account the effects of a concurrent transaction, and overwriting them etc.).

as long as the operation can succeed (regardless of what state it produces), then the retry "succeeded" in the narrow sense.

It's less about what I want, but rather the impact on the user. The transaction succeeding on retry in this case would very much be viewed by the user as a defect in our application. Users of our application would argue that this isn't succeeding "in a narrow sense," but rather it's failing silently in a way that may go undetected for quite some time. It's the exact opposite of the principle of failing fast and loudly.

The application in question is part of a B2B product that informs key business decisions with large financial stakes. If users lose their work, only to discover the loss after said key business decisions have been made, it's not going to do good things for their trust in our application.

you have to roll back the transaction on failure, meaning you (likely!) have to perform the querying again, which would take into account the effects of the concurrent transaction

my expectation is for generic library components to retry the entire UoW upon a transient exception (including a serializable one). What exact problem do you see with that?

Emphasis mine in the first quote.

The exact problem that I see with that is that no generic library component would be able to do this for my application.

The application is a GraphQL API that is accessed by users via our application's web frontend. The use case is just a fairly basic read-modify-write. User reads the entity from the API, makes changes, and sends those changes back to the API to be written as an update. Behind the scenes, the update is implemented in an "append only" fashion. A new row is inserted rather than an UPDATE query, and in order to enforce the invariants of our versioning scheme there's another (more brief) read-modify-write happening there. That second read-modify-write is the transaction that's failing to serialize in my tests. As a result the initial read of the entity to be modified isn't captured within the lifecycle of the update request.

Without the user being included in the system as a decision process that's invokable by the retrying library component, the retry can't reliably succeed because the information upon which they based their decisions while making their edits may no longer be valid.

@benjamincburns
Copy link

FWIW, it looks like some really smart guy foresaw these concerns back in 2020 when the IsTransient flag was first introduced.

dotnet/runtime#34817 (comment)

From that comment:

consumers are always free to ignore the proposed IsTransient property and define whatever retry policy they want

I think that's a perfectly valid response here. My only lingering concern is that this situation forces me to implement my own retry logic early, as otherwise we run the risk that later on someone on the team who's less familiar with our entity versioning scheme will see a transient network failure and add NpgsqlRetryingExecutionStrategy into the mix as a quick fix.

Side note: Polly looks really interesting. I hadn't seen that before, and I'll definitely have a closer look. Thanks for pointing that out!

@roji
Copy link
Member Author

roji commented Feb 23, 2024

my expectation is for generic library components to retry the entire UoW upon a transient exception (including a serializable one). What exact problem do you see with that?

The exact problem that I see with that is that no generic library component would be able to do this for my application.

To summarize your case, you're basically saying that automated retrying isn't possible for disconnected scenarios, where the (1st) querying is done in a completely separate web request, without state being maintained server-side across the two web requests (query, apply changes). In other words, the conceptual UoW is spread across two web requests, and as such there's no way to automatically retry.

You're absolutely right - and IsTransient is indeed of limited value for your scenario; it may indicate safe retrying is possible in some cases (e.g. timeout), but you cannot do this automatically as some forms of transient errors cannot be retried (as they require the entire UoW to be retried).

I agree this limits the usefulness of IsTransient as a general mechanism; there still are specific scenarios where automated retries work quite well based on it (i.e. where the UoW can be retried, in non-disconnected scenarios), but it's not a thing to be systematically used everywhere without having to think about it.

/cc @ajcvickers

@benjamincburns in general, thanks for this conversation - it's very valuable to have this kind of discussion and better understand the problems you're running up against.

@roji
Copy link
Member Author

roji commented Feb 23, 2024

To return to the original questions here... I still think it probably makes sense to allow database concurrency/serialization errors to be raised as EF DbUpdateConcurrencyException - the exception expresses the exact error type, and as with IsTransient, for non-disconnected scenarios can trigger an automatic retry of the UoW. In other words, how exactly we raise the exception is orthogonal to whether automatic retrying is possible or not.

And regardless, I'll need to look at npgsql/efcore.pg#3098.

@benjamincburns
Copy link

thanks for this conversation

All good. FWIW, I really appreciate your patience in getting to a point of mutual understanding.

@roji
Copy link
Member Author

roji commented Feb 26, 2024

Team decision: we think it makes sense to add a provider hook to allow providers to normalize database concurrency exceptions to DbUpdateConcurrencyException.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants