Skip to content
This repository has been archived by the owner on Mar 9, 2019. It is now read-only.

Clarify transaction concurrency #392

Closed
calmh opened this issue Jun 14, 2015 · 9 comments
Closed

Clarify transaction concurrency #392

calmh opened this issue Jun 14, 2015 · 9 comments

Comments

@calmh
Copy link

calmh commented Jun 14, 2015

The readme says:

Read-only transactions and read-write transactions should not depend on one another and generally shouldn't be opened simultaneously in the same goroutine. This can cause a deadlock as the read-write transaction needs to periodically re-map the data file but it cannot do so while a read-only transaction is open.

However it seems transactions are protected by an RWMutex. That means it's impossible to open a write transaction (from any goroutine) while there is a read transaction still open and vice versa; also opening new read transactions is blocked while a write transaction is waiting to begin. Am I misunderstanding how this works?

@MJDSys
Copy link
Contributor

MJDSys commented Jun 14, 2015

From what I saw of the discussions about this, there is no locking between read and write transactions, as long as the database file doesn't need to grow. Taking a quick glance at the code, this appears to be the case as the write transactions don't pick up any exclusive locks that the read transactions do.

There is a Mutex (not a RWMutex!) called rwlock, but that is only taken in the write path.

@benbjohnson
Copy link
Member

There are only a couple places where locks are used:

  1. When beginning a transaction, the DB.metalock is used to add the new Tx to the list of currently open transactions and to copy the current meta page to the Tx.
  2. When closing a transaction, the DB.metalock is used to remove the Tx from the list of currently open transactions.
  3. When a read-write transaction begins it obtains the DB.rwlock to ensure that two read-write transactions can't happen at the same time. It releases the lock when it closes.
  4. When a read-only transaction is open it obtains a read-lock from DB.mmaplock to ensure that the mmap can't be remapped while the transaction is open.
  5. When a read-write transaction fills up the data file, it has to remap the mmap to resize it larger. When this happens, it obtains a write-lock on the DB.mmaplock.
  6. When stats are read or update the DB.statslock is obtained.

Most of the time Bolt operates with very little contention because the meta lock is obtained very briefly just to update the list of current transactions and to copy the meta page. Once a read transaction is started it doesn't obtain any additional locks until it closes.

However, the reason why you can't safely run a read-only and read-write transaction in the same goroutine is because of mmap remapping. Once the data file fills up, it has to close and reopen the mmap with a new size. This causes the memory addresses of all read transactions to change so we have to wait until all read transactions finish which is why we obtain a write lock on DB.mmaplock. It forces all readers to finish and won't let any more start until the remap is done.

Let me know if that answers your question.

@calmh
Copy link
Author

calmh commented Jun 14, 2015

I'm clearly missing something. To demonstrate what I'm talking about, here's a test case that tries to start a write transaction on one goroutine while a read transaction is alive on another goroutine. It deadlocks on the RWMutex at https://github.com/boltdb/bolt/blob/master/db.go#L210. How should I be doing this?

https://gist.github.com/8dd3c9548677b12cadc3

@calmh
Copy link
Author

calmh commented Jun 14, 2015

Actuallly, your last paragraph

However, the reason why you can't safely run a read-only and read-write transaction in the same goroutine is because of mmap remapping. Once the data file fills up, it has to close and reopen the mmap with a new size. This causes the memory addresses of all read transactions to change so we have to wait until all read transactions finish which is why we obtain a write lock on DB.mmaplock. It forces all readers to finish and won't let any more start until the remap is done.

sounds like what I see, but I don't get the part about "in the same goroutine" - it sounds and seems like no write transactions can start while any read transaction is ongoing?

@calmh
Copy link
Author

calmh commented Jun 14, 2015

I guess my test case is a case of a read transaction depending on a write transaction, as stated in the README. However this is the case in less trivial situations that I don't know how to avoid. For example, in my actual use case, two devices exchange index information when they connect over the network. Sending this index information is done by iterating over a bucket inside a db.View(...), packetizing that data and sending it over the network.

The receiving side gets that data, batches it up and processes it and in the end wants to store it with a db.Update(...). Since this happens simultaneously in both directions, we end up needing to perform a db.Update() while the db.View() is running. If the db.Update() blocks it means we can't receive more data from the network, which means the other sides db.View() stalls, and there is a deadlock...

@benbjohnson
Copy link
Member

You're correct that it's better to say, "read transaction depending on a write transaction". There have been several issues posted where people had the read and write transactions in the same goroutine so it'd been easier to phrase it in that context.

As for your use case, wouldn't the devices be on separate machines and therefore separate Bolt instances so it wouldn't block?

Also, can the View() finish once it has sent it's data? Long running read transactions can be problematic in Bolt because they can cause the data file to grow since their dependent pages can't be reused.

@calmh
Copy link
Author

calmh commented Jun 14, 2015

But that's just the thing - the inability of one goroutine to start an Update() means it can't receive data, which means the other machine can't send more data, which means View() on that machine will never terminate, which means no Update() can run on that machine, which means View() on the first machine will never terminate. :)

So my read transaction finishing depends on a write transaction for that data happening on another machine, and vice versa. The amount of data shuffled can be significant, so just grabbing it all into RAM and then sending isn't feasible.

But disregarding my scenario, there seems to be some confusion here as the first two answers above are "there is no locking between read and write transactions" and "There are only a couple places where locks are used" but it seems we've determined there's no scenario where Update() can run when there's any other kind of transaction running, ever, because starting Update() means grabbing a lock that other read and write transactions also hold (the RWMutex on db.go#210)?

@benbjohnson
Copy link
Member

there's no scenario where Update() can run when there's any other kind of transaction running, ever, because starting Update() means grabbing a lock that other read and write transactions also hold (the RWMutex on db.go#210)

The mmaplock on db.go#210 is only obtained when the mmap is remmaped. That happens the first time the data file is opened and then every time the data file size doubles (up until 1GB and then the DB grows by 1GB increments). Most Update() calls will not have to remap so in general read-only and read-write calls only lock the metalock briefly when they start and finish.

The problem is that we have to specify a size for the mmap when we open it. LMDB requires the caller to specify the size explicitly whereas Bolt chooses to remap incrementally. Bolt could add a fixed size option (similar to LMDB) but it hasn't been an issue in general.

@calmh
Copy link
Author

calmh commented Jun 15, 2015

I'm not sure I understand what it takes to re-mmap the file then. For example

        // Create a database and create a bucket and a value...

    // Start a long running read

    var wg sync.WaitGroup
    wg.Add(1)

    go func() {
        db.View(func(tx *bolt.Tx) error {
            wg.Done()
            select {} // This read takes a long time... We're sending stuff over the network or something.
        })
    }()

    // Make sure the read has started
    wg.Wait()

    // Perform a nil update
    db.Update(func(tx *bolt.Tx) error {
        return nil
    })

    panic("this is never reached")

never reaches the panic. I can't seem reproduce a case where even an empty update can succeed without deadlocking when there's a read in progress?

But anyway, this is academic at this point. If anything I would wish this would be somewhat clearer up front as it means the concurrency is at best undeterministic.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants