New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle db deletion in load validation funs #1629

Merged
merged 1 commit into from Nov 22, 2018

Conversation

Projects
None yet
2 participants
@iilyak
Contributor

iilyak commented Sep 27, 2018

Overview

Handle db deletion in couch_db:load_validation_funs

Previously there were quite a few problems with load_validation_funs
in the case when clustered database is deleted.

  • the calls to load_validation_funs were failing with internal_server error [1]
  • the deleted database stayed opened because:
    • the caller of the load_validation_funs (update_doc) stayed alive
    • the main_pid of the deleted database wasn't killed either
  • there was an infinite loop in ddoc_cache_entry trying to recover ddoc from deleted database

The solution is:

  • do not call recover for deleted database
  • close main_pid
  • use erlang:error to crash the caller

[1] - The stack trace was:

{database_does_not_exist,[
    {mem3_shards,load_shards_from_db,"bailey/meta",[
        {file,"src/mem3_shards.erl"},{line,394}]},
    {mem3_shards,load_shards_from_disk,1,[
        {file,"src/mem3_shards.erl"},{line,369}]},
    {mem3_shards,for_db,2,[
        {file,"src/mem3_shards.erl"},{line,54}]},
    {fabric_view_all_docs,go,5,[
        {file,"src/fabric_view_all_docs.erl"},{line,24}]},
    {ddoc_cache_entry_validation_funs,recover,1,[
        {file,"src/ddoc_cache_entry_validation_funs.erl"},{line,33}]},
    {ddoc_cache_entry,do_open,1,[
        {file,"src/ddoc_cache_entry.erl"},{line,294}]}]}

Testing recommendations

make eunit apps=couch tests=clustered_db_test
make eunit apps=ddoc_cache

Related Issues or Pull Requests

N/A

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;
  • Documentation reflects the changes;

@iilyak iilyak changed the title from 84736 handle db deletion in load validation funs to handle db deletion in load validation funs Sep 28, 2018

@iilyak iilyak force-pushed the cloudant:84736-handle-db-deletion-in-load_validation_funs branch 7 times, most recently from 519ab45 to 51a27b3 Sep 28, 2018

@iilyak iilyak force-pushed the cloudant:84736-handle-db-deletion-in-load_validation_funs branch from 51a27b3 to 81aeced Oct 12, 2018

@jaydoane

This comment has been minimized.

Contributor

jaydoane commented Oct 15, 2018

Comparing

$ make eunit skip_tests+=couch_epep apps=couch suites=couchdb_db_tests
module 'couchdb_db_tests'
  Checking clustered db API
    DB deletion
      couchdb_db_tests:53: should_close_deleted_db...[0.029 s] ok
      couchdb_db_tests:68: should_kill_caller_from_load_validation_funs_for_deleted_db...[0.033 s] ok
=======================================================
  2 tests passed.

with

$ make eunit skip_tests+=couch_epep apps=ddoc_cache suites=ddoc_cache_open_test
module 'ddoc_cache_open_test'
  ddoc_cache_open_test: check_open_error_test_...ok
  ddoc_cache_open_test: check_open_error_test_...[0.012 s] ok
  ddoc_cache_open_test: check_open_error_test_...[0.015 s] ok
  ddoc_cache_open_test: check_open_error_test_...ok
=======================================================
  All 4 tests passed.

I know it requires more boiler plate to accomplish, but I much prefer to see the actual names of the tests being run with the couchdb_db_tests module. Just seeing 4 identical lines of ddoc_cache_open_test: check_open_error_test_ isn't as useful, IMO.

@jaydoane

I'd really like to see that timeout lowered so we don't add an additional .5 seconds to every test run for eternity. Other changes are optional.

after 2000 ->
throw(timeout_error)
end
end).

This comment has been minimized.

@jaydoane

jaydoane Oct 15, 2018

Contributor

The test name implies that the shard should be closed by the deletion, so I wonder if we should also add an explicit assertion to that effect? In the absence of an is_open function, we could add something like ?assertEqual([], ets:lookup(couch_dbs, DbName)) at the end.

ddoc_cache_lru:open({?MODULE, {deleted, DbName}})),
?assertError(
timeout,
meck:wait(ddoc_cache_entry_validation_funs, recover, '_', 500)).

This comment has been minimized.

@jaydoane

jaydoane Oct 15, 2018

Contributor

Every test run will hit this 500ms, which IMO is excessive. I lowered it to 10 locally, and the test seemed to run fine, in much less time.

This comment has been minimized.

@iilyak

iilyak Oct 16, 2018

Contributor

Good catch. Apparently the meck:wait function has two slightly different semantics depending on arity:

  • meck:wait(foo, '_', '_', 1000) - wait any number of calls for 1 second
  • meck:wait(5, foo, '_', '_', 1000), - wait for 5 calls and return (within 1 second)

I.e. to avoid waiting we would need to change it to wait for 1 call.

@iilyak iilyak force-pushed the cloudant:84736-handle-db-deletion-in-load_validation_funs branch from 81aeced to c366205 Oct 16, 2018

@iilyak

This comment has been minimized.

Contributor

iilyak commented Oct 16, 2018

@jaydoane I've updated the PR with your recommendations.

ddoc_cache_lru:open({?MODULE, {deleted, DbName}})),
?assertError(
timeout,
meck:wait(1, ddoc_cache_entry_validation_funs, recover, '_', 500)).

This comment has been minimized.

@jaydoane

jaydoane Oct 18, 2018

Contributor

@iilyak I'm still seeing this test always take > 500ms:

$ make eunit skip_tests+=couch_epep apps=ddoc_cache suites=ddoc_cache_open_test
======================== EUnit ========================
module 'ddoc_cache_open_test'
Application crypto was left running!
  ddoc_cache_open_test: check_open_error_test_...[0.002 s] ok
  ddoc_cache_open_test: check_open_error_test_...[0.501 s] ok
  ddoc_cache_open_test: check_open_error_test_...[0.015 s] ok
  ddoc_cache_open_test: check_open_error_test_...ok
  [done in 0.530 s]
=======================================================
  All 4 tests passed.

Do we really need to wait this long for that timeout?

This comment has been minimized.

@iilyak

iilyak Nov 20, 2018

Contributor

I don't understand how it can be related to meck. Here is a result of my investigation:

  • meck:wait/5 is dispatched to meck:wait/6 see here
  • meck:wait/6 call is dispatched to meck_proc:wait/6 see here
  • meck_proc:wait/6 does a gen_server call here
  • the handle_call for wait message has following shape
      case times_called(OptFunc, ArgsMatcher, OptCallerPid, History) of
          CalledSoFar when CalledSoFar >= Times ->
               {reply, ok, S};
    
  • the times_called/4 function is pretty simple and just iterate through History term

The only explanation I have is it takes time to process ddoc_cache_lru:open and detect the call on meck side.

I don't like reducing the timeout since there is a high chance of turning the test into a flaky one. Because shorter timeout could be a problem for slower CI systems.

I'll try to reduce the timeout locally to see if it has any effect on the run time of the test suite. If it does it would mean that I don't read meck code correctly. If it doesn't not it would mean that the test is slow.

This comment has been minimized.

@iilyak

iilyak Nov 20, 2018

Contributor

I ended up reducing the timeout value

@iilyak iilyak force-pushed the cloudant:84736-handle-db-deletion-in-load_validation_funs branch from c366205 to e324cb6 Nov 20, 2018

@iilyak

This comment has been minimized.

Contributor

iilyak commented Nov 20, 2018

Comparing

$ make eunit skip_tests+=couch_epep apps=couch suites=couchdb_db_tests
module 'couchdb_db_tests'
  Checking clustered db API
    DB deletion
      couchdb_db_tests:53: should_close_deleted_db...[0.029 s] ok
      couchdb_db_tests:68: should_kill_caller_from_load_validation_funs_for_deleted_db...[0.033 s] ok
=======================================================
  2 tests passed.

with

$ make eunit skip_tests+=couch_epep apps=ddoc_cache suites=ddoc_cache_open_test
module 'ddoc_cache_open_test'
  ddoc_cache_open_test: check_open_error_test_...ok
  ddoc_cache_open_test: check_open_error_test_...[0.012 s] ok
  ddoc_cache_open_test: check_open_error_test_...[0.015 s] ok
  ddoc_cache_open_test: check_open_error_test_...ok
=======================================================
  All 4 tests passed.

I know it requires more boiler plate to accomplish, but I much prefer to see the actual names of the tests being run with the couchdb_db_tests module. Just seeing 4 identical lines of ddoc_cache_open_test: check_open_error_test_ isn't as useful, IMO.

I don't know how to do it with with. All tests in ddoc_cache application rely on combination of ddoc_cache_tutil:start_couch and {setup, _, _, {with, Tests}} construct. The {with, Tests} construct is implemented via shortcut in https://github.com/erlang/otp/blob/master/lib/eunit/src/eunit_data.erl#L294:L295. The author didn't take into account that the {with, X, Tests} construction rely on the eunit_lib:fun_parent(A). The reason the hack doesn't get correct name of the test is because they are wrapping the function one more time on dispatch.

parse({setup, P, S, C, {with, As}}) when is_list(As) ->
     parse({setup, P, S, C, fun (X) -> {with, X, As} end});

I am trying few things to avoid rewriting of all tests for the whole application.

@iilyak iilyak force-pushed the cloudant:84736-handle-db-deletion-in-load_validation_funs branch from 3062904 to 53222dd Nov 20, 2018

@jaydoane

This comment has been minimized.

Contributor

jaydoane commented Nov 20, 2018

New changes look good:

$ make eunit apps=couch tests=clustered_db_test
  DB deletion
    couchdb_db_tests:53: should_close_deleted_db...[0.028 s] ok
    couchdb_db_tests:75: should_kill_caller_from_load_validation_funs_for_deleted_db...[0.032 s] ok
=======================================================
  2 tests passed.
$ make eunit apps=ddoc_cache
...
=======================================================
  All 51 tests passed.

Just a couple minor quibbles:

  • The second line of the commit says "Previously there were few problems with load_validation_funs". I think you mean "a few problems", since omitting the article "a" makes it sound like there were not that many problems. In short, it inverts the meaning.
  • Some functions in the new test modules don't have 2 lines between them in places.
@jaydoane

informal +1

@iilyak iilyak force-pushed the cloudant:84736-handle-db-deletion-in-load_validation_funs branch 2 times, most recently from 3fefd29 to ee7b8cd Nov 20, 2018

Handle db deletion in couch_db:load_validation_funs
Previously there were quite a few problems with load_validation_funs
in the case when clustered database is deleted.

- the calls to load_validation_funs were failing with `internal_server` error [1]
- the deleted database stayed opened because:
  - the caller of the load_validation_funs (update_doc) stayed alive
  - the main_pid of the deleted database wasn't killed either
- there was an infinite loop in ddoc_cache_entry trying to recover ddoc from deleted database

The solution is:
- do not call `recover` for deleted database
- close `main_pid`
- use `erlang:error` to crash the caller

[1] - The stack trace was:
```
{database_does_not_exist,[
    {mem3_shards,load_shards_from_db,"bailey/meta",[
        {file,"src/mem3_shards.erl"},{line,394}]},
    {mem3_shards,load_shards_from_disk,1,[
        {file,"src/mem3_shards.erl"},{line,369}]},
    {mem3_shards,for_db,2,[
        {file,"src/mem3_shards.erl"},{line,54}]},
    {fabric_view_all_docs,go,5,[
        {file,"src/fabric_view_all_docs.erl"},{line,24}]},
    {ddoc_cache_entry_validation_funs,recover,1,[
        {file,"src/ddoc_cache_entry_validation_funs.erl"},{line,33}]},
    {ddoc_cache_entry,do_open,1,[
        {file,"src/ddoc_cache_entry.erl"},{line,294}]}]}
```

@iilyak iilyak force-pushed the cloudant:84736-handle-db-deletion-in-load_validation_funs branch from ee7b8cd to d35b5c9 Nov 22, 2018

@iilyak iilyak merged commit 4b3531b into apache:master Nov 22, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@iilyak iilyak deleted the cloudant:84736-handle-db-deletion-in-load_validation_funs branch Nov 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment