New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix couch server race condition #1596

Merged
merged 3 commits into from Sep 6, 2018

Conversation

Projects
None yet
4 participants
@davisp
Member

davisp commented Sep 6, 2018

Overview

There are two separate bugs here being fixed. The first was in couch_server:terminate/2. If there was an active open_async process when couch_server died it would throw a function clause when it tried to shutdown the database that wasn't yet open. We just filter out any #entry{} record to avoid the issue. This a simple bug but it prevented couch_server from properly generating a SASL report that ended up covering up the second more complex race condition.

The second bug has to deal with a race between deletions and opens as well as the state of couch_server's mailbox. There's a very specific set of operations that have to happen in a particular order to trigger this bug:

  1. An in-flight open or creation request that will ultimately succeed. The '$gen_call' message must be in couch_server's message queue before Step 2 finishes.
  2. A deletion request
  3. A second open or creation request that is enqueued before couch_server processes the '$gen_call' message from Step 1.
  4. The couch_db_updater pid's 'EXIT' signal is delayed until after the open_result from Step 3 is delivered to couch_server

The underlying issue here is that the deletion request clears the couch_dbs ets table state without removing the possible corresponding state in couch_server's message queue. As things progress couch_server ends up mistaking the open_result message from Step 1 as corresponding to the open_async process spawned in Step 3. Currently this results in the client from Step 3 getting an invalid response from couch_server, and then more importantly, couch_server dies while attempting to process the real open_result message because of the state in the couch_dbs ets table being incorrect (it would throw a function_clause error in a list comprehension because #entry.waiters was undefined).

The fix was just to ensure that the opener pid in the #entry{} record matches the pid of the caller. Anything that doesn't match is ignored since the deletion already cleaned up the server state. Although we do kill the couch_db_updater pid for extra security, Though technically its duplicating work that the deletion request already handled (via killing the open_async pid which is linked to it).

Testing recommendations

The second of three commits adds a failing test case that is fixed by the third commit.

make eunit

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;
  • No documentation to change

davisp added some commits Sep 5, 2018

Fix couch_server:terminate/2
If couch_server terminates while there is an active open_async process
it will throw a function_clause exception because `couch_db:get_pid/1`
will fail due to the `#entry.db` member being undefined. Simple fix is
to just filter those out.
Reproduce race condition in couch_server
A rather uncommon bug found in production. Will write more as this is
just for show and tell.

For now this test case just demonstrates the issue that was discovered.
A fix is still being pondered.
Fix couch_server concurrency error
Its possible that a busy couch_server and a specific ordering and timing
of events can end up with an open_async message in the mailbox while a
new and unrelated open_async process is spawned. This change just ensure
that if we encounter any old messages in the mailbox that we ignore
them.

The underlying issue here is that a delete request clears out the state
in our couch_dbs ets table while not clearing out state in the message
queue. In some fairly specific circumstances this leads to the message
on in the mailbox satisfying an ets entry for a newer open_async
process. This change just includes a match on the opener process.
Anything unmatched came before the current open_async request which
means it should be ignored.
@nickva

This comment has been minimized.

Contributor

nickva commented Sep 6, 2018

The issue is reproducible with this script (in a remsh):

f(FuzzIt), FuzzIt = fun(DbName, Tries, WaitMax) ->
        Ref = erlang:monitor(process, whereis(couch_server)),
        [begin
             [begin
                CallRefs = [Open1Ref, DeleteRef, Open2Ref] = [make_ref(), make_ref(), make_ref()],
                whereis(couch_server) ! {'$gen_call', {self(), Open1Ref}, {create, DbName, []}},
                timer:sleep(W),
                whereis(couch_server) ! {'$gen_call', {self(), DeleteRef}, {delete, DbName, []}},
                whereis(couch_server) ! {'$gen_call', {self(), Open2Ref}, {open, DbName, []}},
                [receive {R, _} -> ok end || R <- CallRefs]
             end || W <- lists:seq(0, WaitMax)]
         end || _ <- lists:seq(1, Tries)],
        receive  {'DOWN', Ref, _, _, Reason} ->  {bam, Reason}
        after 5000 -> nope
        end
end.

When run like this for example:

FuzzIt(<<"blah">>, 100, 15).

It might help to have a few dbs open and some activity, like say a bunch of replication jobs running as well.

The node would crash and possibly restart leaving something like this in the log:

[error] 2018-09-06T17:01:48.085468Z node1@127.0.0.1 <0.243.0> -------- couch_server terminating with {function_clause,[{couch_server,'-handle_call/3-lc$^1/1-1-',[undefined],[{file,"src/couch_server.erl"},{line,434}]},{couch_server,handle_call,3,[{file,"src/couch_server.erl"},
@nickva

nickva approved these changes Sep 6, 2018

Awesome find and fix @davisp!

Checked that the fuzzer script that could reproduce the script can't reproduce it any more.

======================== EUnit ========================
module 'couch_server_tests'
  Test for proper deletion of db file
Application crypto was left running!
    rename
      couch_server_tests:69: should_rename_on_delete...[0.003 s] ok
      [done in 0.006 s]
    delete
      couch_server_tests:84: should_delete...[0.003 s] ok
      [done in 0.006 s]
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
    [done in 0.084 s]
  couch_server_tests: bad_engine_option_test_...ok
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
  couch_server_tests: make_interleaved_requests...[0.327 s] ok
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
  [done in 2.507 s]
=======================================================
  All 4 tests passed.
@rnewson

rnewson approved these changes Sep 6, 2018

@iilyak

iilyak approved these changes Sep 6, 2018

+1. Tests pass locally.
Able to reproduce the issue locally with reproducer without patch and unable to reproduce with the patch. Great work!!

@davisp

This comment has been minimized.

Member

davisp commented Sep 6, 2018

@nickva And to be clear, you didn't manage to reproduce it with that script after this fix right?

@davisp

This comment has been minimized.

Member

davisp commented Sep 6, 2018

Ah, @iilyak confirmed. Taking that as all good.

@davisp davisp merged commit f66c734 into master Sep 6, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment