Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simultaneous inserts are not all being saved to database. #98034

Open
cpaczek opened this issue Mar 4, 2023 · 25 comments
Open

Simultaneous inserts are not all being saved to database. #98034

cpaczek opened this issue Mar 4, 2023 · 25 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) X-blathers-triaged blathers was able to find an owner

Comments

@cpaczek
Copy link

cpaczek commented Mar 4, 2023

Describe the problem

We are currently trying to add Cockroach DB support for Strapi (strapi/strapi#12346) however we are having an issue when doing simultaneous inserts where only some of the inserts are being saved to the database.

This issue happens when we are using Promise.all() in Javascript using Knex. All of the insert statements get executed and committed and non of them get rolled back however only some of them are actually added to the database.

To Reproduce

We are running a single node cluster and running the following code

    console.log("racecheck service");
    let arr = [];
    for (let i = 0; i < 10; i++) {
      arr.push(i.toString());
    }
    Promise.all(
      arr.map((i) =>
        strapi.entityService.create("api::blog.blog", { data: { tester: i, comments: [1]}})
      )
    );

This code does a few things, it first creates new blog entries on the blog table then creates a new entry on comments_blogs_links table which creates the relation between blogs and comments.

Lower down on the code here is where it's inserting the new row on the links table.

https://github.com/strapi/strapi/blob/127c04705751139bb32805e79b6f8619d92b6d34/packages/core/database/lib/entity-manager/index.js#L605-L634

Specfically this line

await this.createQueryBuilder(joinTable.name).insert(insert).transacting(trx).execute();

The weird thing is that all of the queries are hitting the database. You can view the logs here:
https://gist.github.com/cpaczek/5b1d253ba8cde243e074826276c38e4c

If you search (CTRL + F) for INSERT INTO ‹\"\"›.‹\"\"›.‹comments_blogs_links› you can see that we are inserting 10 entries into the database but as you can see below only 2 of them actually get inserted. This behaviour can not be replicated on postgres, MySQL, sqlite, or mariadb.

image

You can also see that the ids are being skipped so it seems like they are being removed right after they are inserted or something.
Expected behavior
All Rows get inserted

Environment:

  • CockroachDB version [22.1.4] (but also tested this on the latest version of CRDB)
  • Server OS: [Windows and Docker(Ubuntu)]

Jira issue: CRDB-25033

@cpaczek cpaczek added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 4, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 4, 2023

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @cockroachdb/sql-sessions (found keywords: Knex)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Mar 4, 2023
@Xiang-Gu
Copy link
Contributor

Xiang-Gu commented Mar 5, 2023

Thanks for reporting the issue and I'm adding it to session's triage;

I'm a complete noob with javascript, but I am wondering what if you try

  • trx('table').insert(...) style, and
  • await this.createQueryBuilder(joinTable.name).insert(insert).execute(); style (without transacting); I wanna try this because we are doing single-statement-transaction anyway so even without transacting, CRDB should treat a single insert statement in a implicit transaction.

@Xiang-Gu Xiang-Gu added this to Triage in SQL Sessions - Deprecated via automation Mar 5, 2023
@blathers-crl blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Mar 5, 2023
@cpaczek
Copy link
Author

cpaczek commented Mar 5, 2023

Thanks for reporting the issue and I'm adding it to session's triage;

I'm a complete noob with javascript, but I am wondering what if you try

  • trx('table').insert(...) style, and
  • await this.createQueryBuilder(joinTable.name).insert(insert).execute(); style (without transacting); I wanna try this because we are doing single-statement-transaction anyway so even without transacting, CRDB should treat a single insert statement in a implicit transaction.

Ya I can try taking it out of a transaction, but what I don't understand is that if you look at the logs all the inserts are hitting the db, not sure why they would be deleted. The transaction never got rolled back either.

@cpaczek
Copy link
Author

cpaczek commented Mar 5, 2023

@Xiang-Gu SO I took out the .transacting and it works so it's something to do with a transaction. I'll do a bit more research

@cpaczek
Copy link
Author

cpaczek commented Mar 5, 2023

So did some debugging and it turns out that using a transaction with the select query makes it so that some of the inserts don't apply. This could either be a cockroach db bug (or maybe its intended) or maybe this is an issue with the implementation with the cockroach dialect on knex?

Here is what is breaking it

          const maxResults = await db
            .getConnection()
            .select(inverseJoinColumn.name)
            .max(inverseOrderColumnName, { as: 'max' })
            .whereIn(inverseJoinColumn.name, relIdsToadd)
            .where(joinTable.on || {})
            .groupBy(inverseJoinColumn.name)
            .from(joinTable.name)
            .transacting(trx); // <-- Using a transaction. Removing this fixes the issue.
        // omitted code
        // insert new relations
        await this.createQueryBuilder(joinTable.name)
             .insert(insert)
             .transacting(trx) // <-- Using the same transaction
             .execute();

Not sure if this is a Strapi issue, knex issue or Cockroach issue. This code works with MySql, postgres, sqlite, and maria so I'm leaning towards it being knex or cockroach but I'm not sure. Either way we just removed the select from the transaction as there is no point in adding a read query inside of a transaction.

cc: @kibertoad Do you think I should open an issue on knex for this?

@kibertoad
Copy link

@cpaczek please do!

@cpaczek
Copy link
Author

cpaczek commented Mar 6, 2023

Removing .transacting is not an option because the pool will fill up. Still unsure if this is a knex issue or cockroach. But here is a minimal reproduction using knex.

https://github.com/cpaczek/knex-test/

@cpaczek
Copy link
Author

cpaczek commented Mar 6, 2023

Here is my theory and if someone with more cockroach experience could let me know if it is valid that would be appreciated.

Basically here is what we are doing

Add Data:
    Begin transaction (Begin Transaction)
    Select a value from a column  (Select Value)
    Do some javascript (JavaScript)
    Insert a row into a table that changes the value that we got from the previous select. (Insert Data)
    Commit Transaction (Commit Transaction)

// We are running this "add data" function simultaneously i.e

Add Data[0]
Add Data[1]

// What I think is happening is the following

(Begin Transaction)[0]
(Begin Transaction)[1]
(Select Value)[0]
(Select Value)[1]
(JavaScript)[0]
(JavaScript)[1]
(Insert Data)[0] <- Using data from (SV)[0]
(Commit Transaction)[0]
(Insert Data)[1] <- Using (SV)[0], This value doesn't get inserted because the data it got from (SV)[1] was changed by (INS)[0]
(Commit Transaction)[1]

Is this how Cockroach works? Will it stop an insert because data within the same transaction changed? Is this an intended feature?

Is there a way to disable this behavior? One solution would to be put the select and insert in different transactions however the way the code is organized would be a very difficult refactor.

@rafiss
Copy link
Collaborator

rafiss commented Mar 6, 2023

In the example you provided in cpaczek/knex-test, many of the transactions are failing. The error is

restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: intent on key /Table/107/1/845630329282461697/0): "sql txn" meta={id=b1fae415 key=/Table/107/1/845630329282887681/0 pri=0.00446797 epo=0 ts=1678136296.386659000,1 min=1678136296.383969000,0 seq=1} lock=true stat=PENDING rts=1678136296.383969000,0 wto=false gul=1678136296.883969000,0

Code: 40001
See: https://www.cockroachlabs.com/docs/v23.1/transaction-retry-error-reference.html#retry_serializable

As the docs at that link indicate, transaction retry errors can occur during normal operation of CockroachDB if there are multiple queries running concurrently that modify the same table. Applications and frameworks should have logic to retry errors with an error code of 40001. (If you run your test against PostgreSQL in SERIALIZABLE isolation mode, you will probably see the same behavior.

Another issue here though is: Why did Strapi not return any error messages to the end-user that say that the transaction failed? I confirmed that CockroachDB itself is returning errors.

@cpaczek
Copy link
Author

cpaczek commented Mar 6, 2023

@rafiss
Where did you get that error from? When I do the test no errors get logged. Because if a transaction fails shouldn't it be triggering the try-catch block?

@rafiss
Copy link
Collaborator

rafiss commented Mar 6, 2023

I got the error by setting up Wireshark to capture all the network traffic that is sent from CockroachDB to the client.

When I do the test no errors get logged. Because if a transaction fails shouldn't it be triggering the try-catch block?

Yes, it should, and I agree that's what the problem is here. An error is returned by CockroachDB, but it is not noticed by the client.

Does the client correctly handle errors that are returned by the COMMIT statement? If the COMMIT fails, that means the transaction was not successfully committed.

@cpaczek
Copy link
Author

cpaczek commented Mar 6, 2023

Does the client correctly handle errors that are returned by the COMMIT statement? If the COMMIT fails, that means the transaction was not successfully committed.

@rafiss

Knex isn't even throwing an error. Are you sure the transaction is failing?

I've updated my test so now it runs both sqlite and cockroach db at the same time. https://github.com/cpaczek/knex-test/tree/test-crdb-and-sqlite

If you remove this query from the addRows function, all the rows get inserted but because this select is in there it fails most of the inserts.

Comment out this

const maxResults = await knex
      .select("comment_id")
      .max("blog_order", { as: "max" })
      .whereIn("comment_id", [1])
      .where({})
      .groupBy("comment_id")
      .from("knex_test")
      .transacting(trx);

    const max = maxResults.length === 0 ? 0 : parseInt(maxResults[0].max) + 1;

Edit This

    .insert({
  blog_id: blog_id,
  blog_order: max, // <- Change this to a constant value like 1 or use blog_id.
  comment_id: 1,
  comment_order: 1,
})

Not sure why knex isn't logging the error though, all other errors get logged perfectly fine.

@Bessonov
Copy link

Bessonov commented Mar 6, 2023

Hey @cpaczek , just to help you a little bit to understand the broader context. CRDB runs transactions in serializable isolation level, which avoids most if not all possible anomalies. The downside of that is that instead of unexpected anomalies, you get this error. You should handle this in your client code. I don't see a good nodejs example, but you can look at the java example (search for 40001). Basically and in most cases, you should just retry your transaction.

To reproduce this error with knex and postgres you can set your transaction level manually:

				.transaction()
				.setIsolationLevel('serializable')

To learn more about the isolation levels and anomalies, see https://jepsen.io/consistency .

@Bessonov
Copy link

Bessonov commented Mar 6, 2023

@cpaczek

Knex isn't even throwing an error. Are you sure the transaction is failing?

Sure, it do. You just ignore it:

Promise.all(arr.map((i) => addRows(i)));

You miss await here:

await Promise.all(arr.map((i) => addRows(i)));

@cpaczek
Copy link
Author

cpaczek commented Mar 6, 2023

You miss await here:

I have another try catch in addRows and it doesn't throw an error. Also adding an await to the promise.all doesn't fix it either.

https://github.com/cpaczek/knex-test/blob/16fc902cebb15b74a52d85b0697e49a09d3f5831/index.js#L53-L81

image

If I can figure out why its not throwing an error I can look into configured retry savepoint to retry each transaction.

@Bessonov
Copy link

Bessonov commented Mar 6, 2023

You still haven't fixed it:
https://github.com/cpaczek/knex-test/blob/16fc902cebb15b74a52d85b0697e49a09d3f5831/index.js#L35-L36

Until you await the right thing, you can't expect proper error handling.

@cpaczek
Copy link
Author

cpaczek commented Mar 6, 2023

@Bessonov
Copy link

Bessonov commented Mar 7, 2023

I don't see any error in the code anymore. I would suggest:

Try pg dialect:

	client: 'pg',
	version: '9.6',

And remove the nested catch here.

@rafiss
Copy link
Collaborator

rafiss commented Mar 7, 2023

Also, could you try the suggestion above to run against PG in serializable mode? That would at least help us narrow down the problem.

@cpaczek
Copy link
Author

cpaczek commented Mar 7, 2023

I don't see any error in the code anymore. I would suggest:

Try pg dialect:

	client: 'pg',
	version: '9.6',

And remove the nested catch here.

I tried using the pg dialect and removed the nested catch, still have the same issue. No errors were logged and only some of the rows get added. Will try Postgres in serializable mode next

@cpaczek
Copy link
Author

cpaczek commented Mar 7, 2023

@rafiss
You are correct, I get an error when running in serializable mode:
https://github.com/cpaczek/knex-test/tree/pg-in-serializable-mode

image

And it works when not in serializable mode:
image

Now I just need to figure out why cockroach db isn't throwing an error to the client whereas postgres is. This could be a knex issue.

Is it bad practice to change the isolation level for cockroach to mimic postgres? i.e SET default_transaction_isolation

@cpaczek
Copy link
Author

cpaczek commented Mar 7, 2023

Was just reading about isolation levels
CockroachDB executes all transactions at the strongest ANSI transaction isolation level: SERIALIZABLE. All other ANSI transaction isolation levels (e.g., SNAPSHOT, READ UNCOMMITTED, READ COMMITTED, and REPEATABLE READ) are automatically upgraded to SERIALIZABLE.

does this mean that it's not possible to use an isolation level other than serializable? i.e SET default_transaction_isolation = "READ COMMITTED"; won't work?

@Bessonov
Copy link

Bessonov commented Mar 7, 2023

Now I just need to figure out why cockroach db isn't throwing an error to the client whereas postgres is. This could be a knex issue.

I have never seen this problem with knex. But I have used the pg dialect.

does this mean that it's not possible to use an isolation level other than serializable?

Yes. I have created #36208 for another one.

@cpaczek
Copy link
Author

cpaczek commented Mar 7, 2023

I have never seen this problem with knex. But I have used the pg dialect.

No error gets logged when using cockroach with PG Dialect or Cockroach Dialect. The error only gets logged when using Postgres.

@Boegie19
Copy link

Boegie19 commented May 8, 2023

Is there a timeline for a fix for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) X-blathers-triaged blathers was able to find an owner
Projects
No open projects
Development

No branches or pull requests

6 participants