Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r2d2 doesn't work well with diesel sqlite - register_diesel_sql_functions probably locks database #2365

Open
2 tasks done
kpcyrd opened this issue Apr 15, 2020 · 9 comments
Open
2 tasks done

Comments

@kpcyrd
Copy link
Contributor

kpcyrd commented Apr 15, 2020

Setup

I'm writing a webapp that uses actix, r2d2, diesel and sqlite. r2d2 is used to keep a generic pool of diesel handles that are then handed out to functions handling concurrent requests.

Full code can be found here: https://github.com/diesel-rs/diesel/blob/00f7de703aea02659bddad8d168b9fe6d4aee276/diesel/src/sqlite/connection/mod.rs

Versions

  • Rust: 1.42.0
  • Diesel: 1.4.4
  • Database: sqlite
  • Operating System Linux

Feature Flags

  • diesel: ["sqlite", "r2d2", "chrono"]

Problem Description

Starting the app already causes r2d2 to lock up (which I've worked around with retries), but write queries still fail randomly afterwards.

I'm executing some fluff to make sqlite work well with multiple threads:

        let c = SqliteConnection::establish(database_url)?;

        c.batch_execute("
            PRAGMA journal_mode = WAL;          -- better write-concurrency
            PRAGMA synchronous = NORMAL;        -- fsync only in critical moments
            PRAGMA wal_autocheckpoint = 1000;   -- write WAL changes back every 1000 pages, for an in average 1MB WAL file. May affect readers if number is increased
            PRAGMA wal_checkpoint(TRUNCATE);    -- free some space by truncating possibly massive WAL files from the last run.
            PRAGMA busy_timeout = 250;          -- sleep if the database is busy
            PRAGMA foreign_keys = ON;           -- enforce foreign keys
        ").map_err(ConnectionError::CouldntSetupConfiguration)?;

        Ok(Self(c))

This seems to lock the database somehow. Assuming the PRAGMA doesn't lock I've looked into SqliteConnection::establish:

    fn establish(database_url: &str) -> ConnectionResult<Self> {
        use crate::result::ConnectionError::CouldntSetupConfiguration;

        let raw_connection = RawConnection::establish(database_url)?;
        let conn = Self {
            statement_cache: StatementCache::new(),
            raw_connection,
            transaction_manager: AnsiTransactionManager::new(),
        };
        conn.register_diesel_sql_functions()
            .map_err(CouldntSetupConfiguration)?;
        Ok(conn)
    }

It seems that it's always executing register_diesel_sql_functions before my PRAGMA are executed, which means if this causes a write it's going to lock the database for everybody. It seems It's executing CREATE TRIGGER among other things.

Do you have any suggestions on how to get this to work correctly?

What are you trying to accomplish?

Write a webapp with actix that uses diesel with sqlite.

Are you seeing any additional errors?

No

Steps to reproduce

My full code base is public, but the issue should be trivial to reproduce by copying the relevant r2d2+diesel code.

Checklist

  • I have already looked over the issue tracker for similar issues.
  • This issue can be reproduced on Rust's stable channel. (Your issue will be
    closed if this is not the case)
@weiznich
Copy link
Member

weiznich commented Apr 16, 2020

Can you reproduce this behaviour without any of those PRAGMA statements?

It seems that it's always executing register_diesel_sql_functions before my PRAGMA are executed, which means if this causes a write it's going to lock the database for everybody. It seems It's executing CREATE TRIGGER among other things.

It's right that register_diesel_sql_functions is executed before any user provided code, but this call solely registers a custom sql function via sqlite3_create_function_v2. As far as I know this is something that needs to happen on per connection basis which implies it does not have any shared state, which implies that it cannot create a lock somewhere. The CREATE TRIGGER statements are only executed if you actually use the diesel_manage_updated_at('some_table') function, for example in a migration. So this does not happen at this location.

In general I advice not using sqlite in such a multithreaded context, because it's basically not designed for that use case. Use a full blown dbms like postgresql instead.

@anthraxx
Copy link

In general I advice not using sqlite in such a multithreaded context, because it's basically not designed for that use case. Use a full blown dbms like postgresql instead.

While true for highly multithreaded environments that expect to have concurrent writers to a database with fine graded locking, this is not quite true as a generalized statement but needs to be evaluated per use case. SQLite is perfectly well designed for concurrent systems in WAL mode with lots of readers and just a few writers that can live with being gated to a single writer having append access to the WAL which includes webapps if they fit this prerequisite.

@kpcyrd
Copy link
Contributor Author

kpcyrd commented Apr 16, 2020

Looking into this some more, it seems the batch_execute is failing due to a locked database, and I'm wondering if it's executing an implicit BEGIN or something similar.

@weiznich
Copy link
Member

Looking into this some more, it seems the batch_execute is failing due to a locked database, and I'm wondering if it's executing an implicit BEGIN or something similar.

batch_execute does not execute any implicit BEGIN or any other custom not user provided sql statement. It basically just passes the user provided sql string down to sqlite3_exec.
My debugging strategy here would trying the following things:

  • Try if the block happens without the batch_execute call, if so it's a issue in the diesel setup procedure.
  • Try if the block happens with a trivial query like SELECT 1 as batch_execute query, if so it's caused by sqlite3_exec
  • Try readding the PRAGMA statements one by one. If the block is happening there, it's caused by one of the PRAGMA's. In that case try executing that PRAGMA as last one in a separate call.

While true for highly multithreaded environments that expect to have concurrent writers to a database with fine graded locking, this is not quite true as a generalized statement but needs to be evaluated per use case. SQLite is perfectly well designed for concurrent systems in WAL mode with lots of readers and just a few writers that can live with being gated to a single writer having append access to the WAL which includes webapps if they fit this prerequisite.

I didn't mean that it is not possible to use SQLite in such an environment, only that it is in general easier to just use PostgreSQL here, as SQLite requires some configuration to make this even work and PostgreSQL just works out of the box.

@kpcyrd
Copy link
Contributor Author

kpcyrd commented Oct 29, 2020

I've created a repository that can be used to reproduce this, it's creating a database, spawns 25 threads that increment one field 100 times each and then verifies the field has the expected value. The sql executed is exactly the same, but diesel returns errors for some function calls while rusqlite does not. Removing the PRAGMAs causes both tests to fail.

https://github.com/kpcyrd/rust-diesel-bug-2365

Postgres is only easier for the developer, the user would have to install and configure postgres on their own which would make the setup significantly harder.

@kpcyrd
Copy link
Contributor Author

kpcyrd commented Oct 30, 2020

After some more debugging, rusqlite automatically sets a busy_timeout for 5s. I've prepared a patch to do the same and it fixes the example above. Would you be interested in a patch like that?

@weiznich
Copy link
Member

Thanks for providing the reproducible example. According to the sqlite documentation executing PRAGMA busy_timeout = x; should be equivalent to calling sqlite3_busy_timeout(inner_connection, x); so it is not obviously clear for me how setting the timeout as part of the connection setup will fix the underlying issue. In the end that would be not different than what you are already doing in the provided test case.
To understand this a bit better I played a bit around with your test case and I think I've found another potential fix:

Here you are setting all your pragmas at once. The same thing is done here for the rusqlite connection type. Those look equivalent, but as you've already noticed rusqlite sets a default busy timeout of 5s as part of their connection setup here. As a side note: I find it questionable to set this without documenting it somewhere as this can affect performance in quite a lot of cases or can cause data loose in case of a service crash while some "important" query is waiting for the busy timeout.
So with that all known I just tried to set the busy timeout separately before setting the other pragmas, basically applying the following patch to your example:

diff --git a/src/main.rs b/src/main.rs
index a376adf..10a8ea4 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -101,6 +100,7 @@ struct Diesel {}
 impl Testable for Diesel {
     fn run() -> Result<()> {
         let db = diesel::SqliteConnection::establish(PATH)?;
+        db.execute("PRAGMA busy_timeout = 5000;")?;
         db.batch_execute(SETUP)
             .context("Failed to execute PRAGMAs")?;
         for _ in 0..LOOPS {

to make the code equivalent to the corresponding rusqlite code. Long story short: This fixes the locking issue, now everything is running fine.

So the remaining question is: "Should we do that as part of our connection setup or not?"
I think the answer here is no we shouldn't do that because the busy timeout can affect certain kind of use-cases (for example applications with near real time requirements) or can cause data loose. (I'm happy to hear different opinions here).
As possible actionable consequence of this I propose that the documentation of SqliteConnection should have a paragraph about how to setup sqlite in a multi threaded environment with more than one connection. (Happy to get contributions there)

@mladedav
Copy link

mladedav commented May 8, 2022

Hi, I'm not as knowledgable and this is an old issue I have stumbled upon, so I'll add just a bit.

I have tried the test here and for me simply adding the pragma does not help. Furthermore, it consistently fails for both diesel and rustls for me. I am running it on Windows and am using bundled versions of the crate, but I don't think that should have an effect.

What does work for me is running the setup once before the test begins. I think that the issue is that changing journal mode to WAL needs exclusive lock and it seems to ignore the busy timeout option. With just that one operation in setup both rustls and diesel inetermittenly failed.

So I don't think that this can be solved without retrying. I don't know if that is something that should be handled by diesel.

When I run the setup once before starting the test it acts as described, i.e. rusqlite works everytime and diesel sometimes fails to write pragmas. This is indeed fixed with the busy wait.

I personally view diesel as pretty high level (but I might be wrong here) so I would expect it to set reasonable settings that would prevent errors like this if it can. Maybe 5 seconds is too much, but I think having it set a busy timeout (which could also be configurable) is not a bad thing. But I would rather have it safer by default.

That said I don't understand how setting busy wait should cause data loss. I would rather expect it the other way around when writes fail and users don't check for failure.

@weiznich
Copy link
Member

weiznich commented May 9, 2022

Can you clarify if you refer to the original posted example or to the "fixed" version with the patch from the post above applied. Because I cannot reproduce the issue with the above patch applied. If it's the second case: Please provide an reproducible example that we can use to understand your bug.

I personally view diesel as pretty high level (but I might be wrong here) so I would expect it to set reasonable settings that would prevent errors like this if it can. Maybe 5 seconds is too much, but I think having it set a busy timeout (which could also be configurable) is not a bad thing. But I would rather have it safer by default.

So I don't think that this can be solved without retrying. I don't know if that is something that should be handled by diesel.

Retrying is something that should always be decided on application layer, therefore that's nothing that should be part of diesel itself in my opinion. We try to return meaningful error types there so that you can match on them and retry stuff if that's meaningful for your use case.

That really depends on the use case. It might be fine for your use case, but that's not necessarily true for other use cases, where you really want to know that a query is applied fast or discard the result. If you feel that the default busy timeout should be something different I would argue that the correct place to change that would be in sqlite itself.

That said I don't understand how setting busy wait should cause data loss. I would rather expect it the other way around when writes fail and users don't check for failure.

A to large busy timeout does not necessarily directly cause data loss, but it could block operations longer than expected, which in turn could break other things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants