[Jenkins] timeout triggered by all_dbs_active #549

wohali · 2017-05-30T20:10:37Z

Current & Expected Behaviour

During an eunit test (couchdb_mrview_cors_tests) a request is made to retrieve the output of a view. It should succeed. Sometimes, it fails on a timeout.

Possible Solution

In one failure, the attempt to create the database in the eunit test fails, running into an all_dbs_active error. This is unusual because the couch application has just started up and only a single database is created for this test.

Perhaps we have a race condition in startup on in couch_lru?

Your Environment

Jenkins automated build, 2.1.x branch, Debian 8, default Erlang (17), logs uploaded as jenkins-couchdb-1-2017-05-29T02:02:45.043081. Relevant paste here: https://paste.apache.org/sdgk

The text was updated successfully, but these errors were encountered:

wohali · 2017-07-04T18:26:52Z

I looked at this one a bit more closely. The Makefile log output is here: https://builds.apache.org/blue/organizations/jenkins/CouchDB/detail/2.1.x/1/pipeline/49/

~~That all_dbs_active error got me wondering: perhaps a prior test set the max_dbs_open value low, and didn't clear it. A few tests earlier, couchdb_views_tests:couchdb_1283 calls:~~

ok = config:set("couchdb", "max_dbs_open", "3", false),

but I never see a similar config:set call to restore the default. Does our test harness clean these up for us, or could the old setting be hanging around still? If the test harness doesn't clean up config settings for us, there may be a number of places in the eunit tests where this could be biting us in weird ways.

EDIT: Thanks, @rnewson, for helping me see that the harness calls test_util:stop_couch/0 which does a stop_applications(?DEFAULT_APPS) and -define(DEFAULT_APPS, [inets, ibrowse, ssl, config, couch_epi, couch_event, couch]). By stopping the config app it loses all the settings, and that final false in the config:set/4 call prevents the config change from being written to the ini file. So this is not the issue at hand...yet.

wohali · 2017-07-04T19:56:13Z

This is a really odd failure. My guess is we've hit a race condition where an attempt to open the database is happening prior to the database having been created yet, but I'm not comfortable enough with the couch_server code to be sure.

We see in the couch.log file that the PUT to create the eunit test database returns with a 201, so presumably the database has been created. But just a few hundred milliseconds, later, we also see:

[error] 2017-05-29T02:02:35.181959Z nonode@nohost <0.5150.1> -------- Could not open file /tmp/tmp.c3FrN4iLsk/apache-couchdb-2.0.0-79067c9/tmp/data/shards/40000000-5fffffff/eunit-test-db-149623354967440.1496023354.couch: no such file or directory

Question 1: Is it normal for a PUT to get back a 201 before all the shards have been created on disk?

During the open call, we hit couch_server:handle_call({open...}) > couch_server:make_room/2 > couch_server:maybe_close_lru_db/1, which is the only place in the code that returns all_dbs_active. We must be walking this code path. What's especially odd is that, because we get all_dbs_active back, we must be failing this guard:

maybe_close_lru_db(#server{dbs_open=NumOpen, max_dbs_open=MaxOpen}=Server)
        when NumOpen < MaxOpen ->
    {ok, Server};

Question 2: Why did we fail that guard?

So we get past the guard, head into couch_lru:close/1 which attempts close_int(gb_trees:next(gb_trees:iterator(Tree)), Cache). Inside of close_int/2 I see:

close_int(none, _) ->
    false;

In this particular failure, we can't possibly have >100 dbs open already. I wonder if gb_trees:iterator(Tree) is returning an empty list, possibly because nothing is actually open yet, and this bubbles back up as a false?

Question 3: Should we handle the special case of couch_lru:close/1 being called on an empty LRU to not return false?

Also: I noticed

couch/src/couch_server.erl:-define(MAX_DBS_OPEN, 100).

but default.ini defines max_dbs_open = 500. Should we bump the default in couch_server.erl? 500 seems like a more reasonable default to me. I'll open a new issue for this.

wohali · 2017-07-04T21:16:53Z

16:50 <+rnewson> "Is it normal for a PUT to get back a 201 before all the
                 shards have been created on disk?"
16:50 <+rnewson> no, fabric_db_create waits for all shard file creation before
                 responding

davisp · 2017-07-13T17:09:12Z

This is a duplicate of #571.

The timeout error is the first error seen when you scroll up in the log. But if you keep going you'll see this bit in the logs:

module 'couchdb_os_daemons_tests'
  OS Daemons tests
    couchdb_os_daemons_tests:107: should_check_daemon...ok
    couchdb_os_daemons_tests:113: should_check_daemon_table_form...ok
    couchdb_os_daemons_tests:120: should_clean_tables_on_daemon_remove...[0.544 s] ok
    couchdb_os_daemons_tests:127: should_spawn_multiple_daemons...[0.618 s] ok
    undefined
    *** context cleanup failed ***
**in function test_util:stop_sync_throw/4 (src/test_util.erl, line 160)
in call from couchdb_os_daemons_tests:teardown/2 (test/couchdb_os_daemons_tests.erl, line 58)
**throw:{timeout,os_daemon_stop}


  undefined
  *** context setup failed ***
**in function meck_proc:start/2 (src/meck_proc.erl, line 96)
  called as start(couch_stats,[passthrough])
in call from test_util:mock/1 (src/test_util.erl, line 244)
in call from test_util:'-mock/1-lc$^0/1-0-'/1 (src/test_util.erl, line 237)
in call from test_util:start/3 (src/test_util.erl, line 225)
in call from couchdb_os_daemons_tests:setup/1 (test/couchdb_os_daemons_tests.erl, line 47)
**error:{already_started,<0.3624.1>}

We failed to shutdown couch properly which leaves it in a weird state which causes a lot of other failures as well. Given that this was fixed I'm closing and we can reopen if it happens again.

Add missing word "key" in "What is a partition?"

wohali added the testsuite label May 30, 2017

wohali modified the milestone: 2.1.0 Jul 3, 2017

davisp closed this as completed Jul 13, 2017

nickva pushed a commit to nickva/couchdb that referenced this issue Sep 7, 2022

Merge pull request apache#549 from Frederik-Baetens/patch-1

77851a8

Add missing word "key" in "What is a partition?"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Jenkins] timeout triggered by all_dbs_active #549

[Jenkins] timeout triggered by all_dbs_active #549

wohali commented May 30, 2017

wohali commented Jul 4, 2017 •

edited

wohali commented Jul 4, 2017

wohali commented Jul 4, 2017

davisp commented Jul 13, 2017 •

edited

[Jenkins] timeout triggered by all_dbs_active #549

[Jenkins] timeout triggered by all_dbs_active #549

Comments

wohali commented May 30, 2017

Current & Expected Behaviour

Possible Solution

Your Environment

wohali commented Jul 4, 2017 • edited

wohali commented Jul 4, 2017

wohali commented Jul 4, 2017

davisp commented Jul 13, 2017 • edited

wohali commented Jul 4, 2017 •

edited

davisp commented Jul 13, 2017 •

edited