Retrying transactions automatically #1738

tailhook · 2020-09-01T14:06:49Z

tailhook
Sep 1, 2020

Motivation

Sometimes PostgreSQL can't apply concurrent transactions and errors out with:

Error: could not serialize access due to concurrent update

It's expected that transaction may be repeated and be successful when repeated. Common case when this happens is using INSERT ... UNLESS CONFLICT .. statement in EdgeDB. See Repeating EdgeQL section for an explanation of why Postgres and EdgeDB can't repeat transactions automatically.

EdgeDB has only REPEATABLE READ and SERIALIZABLE isolation levels. That means errors like the above may happen more often than with the default READ COMMITTED isolation level in Postgres.

As we implement retrying connection on "concurrent update" we also want to handle:

Transaction deadlocks
Connection errors anywhere between BEGIN TRANSACTION and before COMMIT (we can't reliably know whether commit itself worked if we sent last COMMIT and got no response)

There might be a different number of attempts and different timeout settings, though.

Proposal

General idea of the feature:

We introduce a block of user's code that may be repeated
Repeat that block within a new transaction until succeeds or attempt number is exhausted

We need to introduce a new term. We can't use the term transaction for retryable transaction, because users might not expect "transaction" to be a retryable block. Here is a quick brainstorm on the block name:

atomic as in db.atomic(t => t.execute(..)) is my favorite so far, so I'll use it in examples. It reasonably means a block that is applied "atomically" (as seen by an out of transaction observer), but not a "transaction" so one will need to look in the docs to find out what it does specifically first time they see it.
db.mutate(transaction => transaction.execute(..))
db.apply(transaction => transaction.execute(..))
db.unit_of_work(t => t.execute(..))
db.block(t => t.execute(..))
for transaction in db.retry(): (see Alternatives)
db.try(transaction => transaction.execute(..))
db.retry_transaction(t => t.execute(..))

Also we should rename pool.transaction so it isn't the first thing people pick when they need transactions:

raw_transaction as in with db.raw_transaction() as t: is my (@tailhook) favourite, also requires user to find out what raw transaction is
with db.try_transaction() as t: also good as as in for _ in range(10): with db.try_transaction(): ...
with db.plain_transaction() as t:
with db.unreliable_transaction() as t:
with db.single_transaction() as t:

JavaScript API

JavaScript API is quite straighforward:

await pool.atomic(transaction => {
    let val = await transaction.fetch("...")
    await transaction.execute("...", process_value(val))
})

Settings of how many retries and timeouts should be either in poolConfig or the connectOptions depending on how we decide atomic() to work on connection object or if at all (see Open Questions)

We may also allow option form: pool.atomic({retries: 5}, t => {...}), or pool.with_options({retries: 5}).atomic(t => {...})

Exceptions API

Error throwed by the client should have three methods:

err.isTransient() -- for concurrent update and deadlock error
err.isNetworkFailure() -- basically is true when connection is_closed() after the exception
err.isEarlyNetworkFailure() -- is true when connection was closed before request is sent. This is needed to retry conn.query() in non-transaction context

And we should ensure that all network errors are caught and transformed into proper error classes having these methods.

We don't provide catch-all method like isRetriable, because network failures must not be caught during COMMIT so users have the two errors a little bit different:

while(true) {
    let commit = false;
    try {
        conn.execute("START TRANSACTION");

        // transaction body

        commit = true;
        conn.execute("COMMIT");
    } catch(e) {
        if (e instanceof BaseError) {
            if(e.isTransient()) continue;
            if(e.isNetworkFailure() && !commit) continue;
            if(e.isEarlyNetworkError()) continue;
        }
        throw e;
    }
    break;
}

Python API

Python API is harder. We can't use with statement any more because with
blocks can't be retried.

We probably want to support two APIs, as a decorator of an in-place function:

def handler(req, db: edgedb.Pool):
    do_something(req)

    @db.atomic()
    async def my_tx(transaction):
        let val = await transaction.fetch("...")
        await transaction.execute("...", process_value(val))

    return render_page(val)

And as a function call:

def handler(req, db: edgedb.Pool):
    await db.atomic(my_tx, req)

async def my_tx(transaction, req):
    let val = await transaction.fetch("...")
    await transaction.execute("...", process_value(val))

Note: this form forwards all the arguments to the function. We may want to special-case options: db.atomic(my_tx, options=TransactionOptions(**opts)).

Also see Alternatives.

Exceptions API

Error throwed by the client should have three methods:

err.is_transient()
err.is_network_failure()
err.is_early_network_failure()

See JavaScript API for descriptions

Open Questions

Transaction on Specific Connection

Do we want and how connection.atomic() should work?
a) It may only retry on the same connection and fail on disconnect
b) It may reconnect and replace underlying socket in the connection object and retry
c) We may only allow transactions on connection pools

Retry Single Queries on Connection Error

To make changing EdgeDB address or restarting EdgeDB work nicely, we need to retry simple queries on connection errors too:

pool.query("SELECT ..")

But there are couple of issues:

Repeating read-only queries is always safe, but we don't know which ones are readonly. We tackle this below.
Repeating non-readonly queries while might be useful, they are better not made automatic. The reason is that enabling retries on all queries gives the false sense of security: no "concurrent update" issues seen. But it's better to see such error and turn the whole block of code into a transaction rather than just a mutation. I.e. retrying a single mutable request on a "concurrent update" error must be a deliberate decision.

There are couple of ideas to differentiate read-only and mutable queries:

We can expose "read-only" flag in Prepare (and it's always safe to retry before Execute happens)
Enable connection mode that forbids mutating queries outside of transaction and retry all (read-only) queries
Do (2) and have pool.query_atomic(..) shortcut to pool.atomic(lambda t: t.query(..))
Or vice versa add pool.read_only_query() method
Add pool.read_only().query()
Add pool.with_options({readonly: true}).query()

Note: methods (2-6) are also helpful for working with primary/replica installations. But probably only last two would allow full power, as they allow pool.read_only(primary=true) (i.e. in case you need read-only transaction that can't go to a replica).

This issue can be solved by a later RFC.

Learning Curve

This complicates learning curve, but:

This is already a problem in Postgres, there a lot of people who ignore the issue for pet projects and a lot of startups having "normal" rate or 500 errors at any point of time, while it's preventable.
Repeating transactions would make less incentive to keep transactions open for a long time (e.g. while accessing slow network resources like external API), which is a problem if itself.
Even if we never have failed concurrent updates we would want seamless reconnect on connection failures (i.e. server restart, primary/replica change, etc.)
To make learning curves shorter I think we should intentionally inject failures. This is needed so that users quickly find out that side effects of their transactions are in effect several times.

So while increasing learning curve we fix heisenbugs and simplify operations.

Failure Injection

I propose the following enabled by default:

Collect statitics of how many queries are executed in the previous second and on each new request trigger a failure with the probability of 1/n where n is the number of requests in the previous second. We still need to figure out whether n counts queries, transactions, or mutable queries/transactions (and have a list of exceptions, perhaps: dump+restore+migrations).

The idea is that there will be ~1 failure per second. So on local instance when testing manually it would hit almost every request (which is fine as repeating them shouldn't be prohibitively costly). But under a huge load of thousands request per second 1 failure per second doesn't influence anything so even for production and/or benchmarks this is fine.

I think it should be disabled by an explicit command-line argument like --disable-failure-injection but might be tweaked with configuration settings?

Alternatives

Python Variable Propagation

Function call API may not support arbitrary *args, **kw, and rely on partial:

def handler(req, db: edgedb.Pool):
    await db.atomic(partial(my_tx, req))

async def my_tx(req, *, transaction: edgedb.Transaction):
    let val = await transaction.fetch("...")
    await transaction.execute("...", process_value(val))

In this case we should make transaction argument a keyword arg. Because using positional argument in this case would make optional arguments impossible (skipping an argument will shift transaction arg), and adding another argument during refactoring is very error-prone.

Python Loop/With API

We may also want for/with API:

for transaction in db.atomic():
    with transaction:
        let val = await transaction.fetch("...")
        await transaction.execute("...", process_value(val))

Or reverse:

with db.atomic() as atomic:
    for transaction in atomic:
        let val = await transaction.fetch("...")
        await transaction.execute("...", process_value(val))

There are number of pros and cons of this comparing to decorator:

[+] There is no implicit call of the decorator (latter is surprising for new users)
[+] It's clear that transaction is retried
[-] Extra rightward drift
[-] It's tempting to put some code into the outer block
[-] It's different than JavaScript

Exception API

Instead of is_transient, is_network_failure we could have subclasses that are tested with isinstance / instanceof. But this may eventually need multiple inheritance. Multiple inheritance doesn't work in JavaScript. And multiple inheritance and deep class hierarchies are rarely good. And at some point we may want to introduce other helpers like is_retriable this certainly doesn't work without multiple inheritance. (@tailhook: but maybe I like it because it is the Rust way)

Repeating EdgeQL

One may think that why we can collect all the queries in the client (or even at the server) and retry.

The problem is that sometimes writes depend on previous reads:

    let user = async db.query("SELECT User {balance}");
    let prod = async db.query("SELECT Product {price}");
    if user.balance > prod.price:
        async db.query("SELECT User { balance := .balance - <decimal>$price }", price=prod.price)
    else:
        return "not enough money"

If it happens that two transactions updating money will happen concurrently, it's possible that user have negative balance, even while code suggests it can't (when retrying transaction we don't check if again). But if we retry the whole block of code it will work correctly.

Enabling Retries in Connection Options

At least for JavaScript we could keep the API as is, and then use connection configuration to introduce retries:

let conn = connect('mydb', {transactionRetries: 5});
await conn.transaction(t => {
  // ...
})

There are few problems of this approach:

This is not composable: some sub-application might want to rely on repeating transaction, but no way to ensure that. Another sub-application might repeat manually an extra repeating automatically might make transaction slower and introduce unexpected repeatable side effects.
This doesn't help in case of pythonic with db.transaction() as we allow now.
If we're advising transaction on connection object, reconnecting on network failures would be an issue

Keep `transaction` as but add a Helper for Retrying

The problem with this approach is that it hard to teach using atomic when row transactions "work on my laptop". However, this is somewhat alleviated by failure injection.

tailhook · 2020-09-01T14:11:18Z

tailhook
Sep 1, 2020
Author

(Looks like discussions doesn't support internal links :( )

Beginning of the discussion: #1708 (comment)

Now as I look at it more, I like for/with more than the decorator syntax.

0 replies

elprans · 2020-09-01T17:44:06Z

elprans
Sep 1, 2020
Maintainer

This is great, thanks for starting the discussion.

with db.atomic() as atomic:
    for transaction in atomic:
        let val = await transaction.fetch("...")
        await transaction.execute("...", process_value(val))

The above strikes me as the most Pythonic approach. Nesting for IMO is better than vice-versa, because there's less potential for misuse (even if you put code into an outer block, it's fine).

Repeating EdgeQL
The problem is that sometimes writes depend on previous reads:

I'm not sure I understand the point in that section. If you didn't enclose your reads and your writes into a transaction, your code is wrong, regardless of whether the statement is repeated. Thus, I don't think this is a good argument against adding an option to automatically repeat single query or execute calls outside of a transaction.

Compare:

await pool.execute('INSERT ...')

and

async with pool.atomic() as atomic:
    async for transaction in atomic:
        await transaction.execute('INSERT')

2 replies

1st1 Sep 1, 2020
Maintainer

The above strikes me as the most Pythonic approach. Nesting for IMO is better than vice-versa, because there's less potential for misuse (even if you put code into an outer block, it's fine).

Sadly it won't work as I explained in #1738 (comment)

tailhook Sep 2, 2020
Author

I'm not sure I understand the point in that section. If you didn't enclose your reads and your writes into a transaction, your code is wrong, regardless of whether the statement is repeated. Thus, I don't think this is a good argument against adding an option to automatically repeat single query or execute calls outside of a transaction.

Yes the code is wrong, but it's hard to show on a simple example. Nevertheless, the idea is that googling on "edgedb concurrent update error" should quickly show you a way to do transactions, including explanation that selects should be included.

1st1 · 2020-09-01T17:49:24Z

1st1
Sep 1, 2020
Maintainer

Thank you, Paul, for summarizing this. It's an excellent start. I'd like us to continue the discussion and then create an RFC out of this (shouldn't be too much work given the amount of content you already have here):

Few comments:

atomic as in db.atomic(t => t.execute(..)) is my favorite so far, so I'll use it in examples. It reasonably means a block that is applied "atomically" (as seen by an out of transaction observer), but not a "transaction" so one will need to look in the docs to find out what it does specifically first time they see it.
db.mutate(transaction => transaction.execute(..))
db.apply(transaction => transaction.execute(..))
db.unit_of_work(t => t.execute(..))
db.block(t => t.execute(..))
for transaction in db.retry(): (see Alternatives)
db.try(transaction => transaction.execute(..))
db.retry_transaction(t => t.execute(..))

Out of all proposed variants I personally like db.mutate() and db.apply(). "Atomic" is quite a loaded term and in this context it's opaque for the reader.

tryTransaction sounds like a good name for regular transactions.

def handler(req, db: edgedb.Pool):
    do_something(req)

    @db.atomic()
    async def my_tx(transaction):
        let val = await transaction.fetch("...")
        await transaction.execute("...", process_value(val))

    return render_page(val)

Note that applying the @db.atomic() decorator won't execute the coroutine (it can't). So it should be followed by await my_tx().

 with db.atomic() as atomic:
    for transaction in atomic:
       let val = await transaction.fetch("...")
       await transaction.execute("...", process_value(val))

While still reading your proposal I've immediately thought of something like this. There are a few fundamental problems with it though:

for _ in atomic won't be able to actually commit the transaction or handle an exception inside its block.
Attempting to commit the transaction could work in an async for (basically try to commit on every new __anext__ call), but it still wouldn't be able to handle exceptions.

The reverse API (for transaction in db.atomic(): async with transaction:) could work, but also has flaws. Notably if the transaction is committed on the first iteration, the transaction object would need to signal to its open iterator that it must stop on the next iteration. So this is something we could do, but it'd be relatively arcane Python. :/

So it's basically a choice between simple looking:

for tx in db.atomic():  # or 'async for' for better UX
  async with tx:
    await tx.query(...)

and a more explicit:

async with db.atomic() as atomic:
  async for tx in atomic:
    async with tx:
      await tx.query(...)

(I'm using async with db.atomic() and async for only to make the API easier; strictly speaking regular with and for would work).

I like the former variant (proposed by Paul) better, even though it's a hack.

Given that this looks like the only sane API for Python, I'd like to propose to use db.retry() name for both Python and JS instead of atomic() or mutate(), so it becomes succinct:

async for tx in db.retry():
  async with tx:
    # ... code ...

Instead of is_transient, is_network_failure we could have subclasses that are tested with isinstance / instanceof. But this may eventually need multiple inheritance.

Can you show examples of where we'd need multiple inheritance?

Using methods would be more preferable in JS, where the catch clause isn't equipped with syntax to check error types. But I'd use proper exception types in Python which has syntax for catching specific exceptions and that's an established language pattern.

At least for JavaScript we could keep the API as is, and then use connection configuration to introduce retries:
<...>
This is not composable: some sub-application might want to rely on repeating transaction, but no way to ensure that. Another sub-application might repeat manually an extra repeating automatically might make transaction slower and introduce unexpected repeatable side effects.

Composability is really that last nail in the old (current) API's coffin. I'm -1 on making con.transaction() repeat anything implicitly.

1 reply

tailhook Sep 2, 2020
Author

Agree on all points. I were somehow thinking in terms of sync code about async code :(

db.retry look quite good for for-based syntax. And yes with/for is better than with/for/with even if there is some signalling between for and with.

Regarding errors, it's still unclear what specific things we need. I think we want is_network_failure on both ClientConnectionError and ProtocolError. But I'm not sure as ClientConnectionError might be handled inside pool.acquire itself.

Nevertheless, it requires restructuring error hierarchy and I don't know how good is introducing intermediate classes at the current stage.

Also currently Python relies on some system errors like ConnectionAbortedError, which should be fixed somehow (we can do that with subclasshook, but I'm not sure this is a great idea). This is better fixed by wrapping in our own exception type, but I'm not sure where this fits in the hierarchy.

So we probably should postpone this question until the implementation is done.

msullivan · 2020-09-02T17:57:06Z

msullivan
Sep 2, 2020
Maintainer

I like

async for tx in db.retry():
  async with tx:
    # ... code ...

Much prefer retry to atomic.

1 reply

1st1 Sep 3, 2020
Maintainer

Yeah, and for the sync API it will be

for tx in db.retry():
  with tx:
    # ... code

It would also be great if we could reuse naming in other bindings, so in JS we'll have

db.retry(() => {
  // ... code
});

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrying transactions automatically #1738

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Retrying transactions automatically #1738

tailhook Sep 1, 2020

Motivation

Proposal

JavaScript API

Exceptions API

Python API

Exceptions API

Open Questions

Transaction on Specific Connection

Retry Single Queries on Connection Error

Learning Curve

Failure Injection

Alternatives

Python Variable Propagation

Python Loop/With API

Exception API

Repeating EdgeQL

Enabling Retries in Connection Options

Keep transaction as but add a Helper for Retrying

Replies: 4 comments · 4 replies

tailhook Sep 1, 2020 Author

elprans Sep 1, 2020 Maintainer

1st1 Sep 1, 2020 Maintainer

tailhook Sep 2, 2020 Author

1st1 Sep 1, 2020 Maintainer

tailhook Sep 2, 2020 Author

msullivan Sep 2, 2020 Maintainer

1st1 Sep 3, 2020 Maintainer

tailhook
Sep 1, 2020

Keep `transaction` as but add a Helper for Retrying

Replies: 4 comments 4 replies

tailhook
Sep 1, 2020
Author

elprans
Sep 1, 2020
Maintainer

1st1 Sep 1, 2020
Maintainer

tailhook Sep 2, 2020
Author

1st1
Sep 1, 2020
Maintainer

tailhook Sep 2, 2020
Author

msullivan
Sep 2, 2020
Maintainer

1st1 Sep 3, 2020
Maintainer