There are two separate bugs here being fixed. The first was in
The second bug has to deal with a race between deletions and opens as well as the state of couch_server's mailbox. There's a very specific set of operations that have to happen in a particular order to trigger this bug:
The underlying issue here is that the deletion request clears the
The fix was just to ensure that the opener pid in the
The second of three commits adds a failing test case that is fixed by the third commit.
Its possible that a busy couch_server and a specific ordering and timing of events can end up with an open_async message in the mailbox while a new and unrelated open_async process is spawned. This change just ensure that if we encounter any old messages in the mailbox that we ignore them. The underlying issue here is that a delete request clears out the state in our couch_dbs ets table while not clearing out state in the message queue. In some fairly specific circumstances this leads to the message on in the mailbox satisfying an ets entry for a newer open_async process. This change just includes a match on the opener process. Anything unmatched came before the current open_async request which means it should be ignored.
The issue is reproducible with this script (in a remsh):
When run like this for example:
It might help to have a few dbs open and some activity, like say a bunch of replication jobs running as well.
The node would crash and possibly restart leaving something like this in the log:
nickva left a comment
Awesome find and fix @davisp!
Checked that the fuzzer script that could reproduce the script can't reproduce it any more.