Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jenkins] timeout triggered by all_dbs_active #549

Closed
wohali opened this issue May 30, 2017 · 4 comments
Closed

[Jenkins] timeout triggered by all_dbs_active #549

wohali opened this issue May 30, 2017 · 4 comments
Milestone

Comments

@wohali
Copy link
Member

wohali commented May 30, 2017

Current & Expected Behaviour

During an eunit test (couchdb_mrview_cors_tests) a request is made to retrieve the output of a view. It should succeed. Sometimes, it fails on a timeout.

Possible Solution

In one failure, the attempt to create the database in the eunit test fails, running into an all_dbs_active error. This is unusual because the couch application has just started up and only a single database is created for this test.

Perhaps we have a race condition in startup on in couch_lru?

Your Environment

Jenkins automated build, 2.1.x branch, Debian 8, default Erlang (17), logs uploaded as jenkins-couchdb-1-2017-05-29T02:02:45.043081. Relevant paste here: https://paste.apache.org/sdgk

@wohali wohali modified the milestone: 2.1.0 Jul 3, 2017
@wohali
Copy link
Member Author

wohali commented Jul 4, 2017

I looked at this one a bit more closely. The Makefile log output is here: https://builds.apache.org/blue/organizations/jenkins/CouchDB/detail/2.1.x/1/pipeline/49/

That all_dbs_active error got me wondering: perhaps a prior test set the max_dbs_open value low, and didn't clear it. A few tests earlier, couchdb_views_tests:couchdb_1283 calls:

ok = config:set("couchdb", "max_dbs_open", "3", false),

but I never see a similar config:set call to restore the default. Does our test harness clean these up for us, or could the old setting be hanging around still? If the test harness doesn't clean up config settings for us, there may be a number of places in the eunit tests where this could be biting us in weird ways.

EDIT: Thanks, @rnewson, for helping me see that the harness calls test_util:stop_couch/0 which does a stop_applications(?DEFAULT_APPS) and -define(DEFAULT_APPS, [inets, ibrowse, ssl, config, couch_epi, couch_event, couch]). By stopping the config app it loses all the settings, and that final false in the config:set/4 call prevents the config change from being written to the ini file. So this is not the issue at hand...yet.

@wohali
Copy link
Member Author

wohali commented Jul 4, 2017

This is a really odd failure. My guess is we've hit a race condition where an attempt to open the database is happening prior to the database having been created yet, but I'm not comfortable enough with the couch_server code to be sure.

We see in the couch.log file that the PUT to create the eunit test database returns with a 201, so presumably the database has been created. But just a few hundred milliseconds, later, we also see:

[error] 2017-05-29T02:02:35.181959Z nonode@nohost <0.5150.1> -------- Could not open file /tmp/tmp.c3FrN4iLsk/apache-couchdb-2.0.0-79067c9/tmp/data/shards/40000000-5fffffff/eunit-test-db-149623354967440.1496023354.couch: no such file or directory

Question 1: Is it normal for a PUT to get back a 201 before all the shards have been created on disk?

During the open call, we hit couch_server:handle_call({open...}) > couch_server:make_room/2 > couch_server:maybe_close_lru_db/1, which is the only place in the code that returns all_dbs_active. We must be walking this code path. What's especially odd is that, because we get all_dbs_active back, we must be failing this guard:

maybe_close_lru_db(#server{dbs_open=NumOpen, max_dbs_open=MaxOpen}=Server)
        when NumOpen < MaxOpen ->
    {ok, Server};

Question 2: Why did we fail that guard?

So we get past the guard, head into couch_lru:close/1 which attempts close_int(gb_trees:next(gb_trees:iterator(Tree)), Cache). Inside of close_int/2 I see:

close_int(none, _) ->
    false;

In this particular failure, we can't possibly have >100 dbs open already. I wonder if gb_trees:iterator(Tree) is returning an empty list, possibly because nothing is actually open yet, and this bubbles back up as a false?

Question 3: Should we handle the special case of couch_lru:close/1 being called on an empty LRU to not return false?


Also: I noticed

couch/src/couch_server.erl:-define(MAX_DBS_OPEN, 100).

but default.ini defines max_dbs_open = 500. Should we bump the default in couch_server.erl? 500 seems like a more reasonable default to me. I'll open a new issue for this.

@wohali
Copy link
Member Author

wohali commented Jul 4, 2017

16:50 <+rnewson> "Is it normal for a PUT to get back a 201 before all the
                 shards have been created on disk?"
16:50 <+rnewson> no, fabric_db_create waits for all shard file creation before
                 responding

@davisp
Copy link
Member

davisp commented Jul 13, 2017

This is a duplicate of #571.

The timeout error is the first error seen when you scroll up in the log. But if you keep going you'll see this bit in the logs:

module 'couchdb_os_daemons_tests'
  OS Daemons tests
    couchdb_os_daemons_tests:107: should_check_daemon...ok
    couchdb_os_daemons_tests:113: should_check_daemon_table_form...ok
    couchdb_os_daemons_tests:120: should_clean_tables_on_daemon_remove...[0.544 s] ok
    couchdb_os_daemons_tests:127: should_spawn_multiple_daemons...[0.618 s] ok
    undefined
    *** context cleanup failed ***
**in function test_util:stop_sync_throw/4 (src/test_util.erl, line 160)
in call from couchdb_os_daemons_tests:teardown/2 (test/couchdb_os_daemons_tests.erl, line 58)
**throw:{timeout,os_daemon_stop}


  undefined
  *** context setup failed ***
**in function meck_proc:start/2 (src/meck_proc.erl, line 96)
  called as start(couch_stats,[passthrough])
in call from test_util:mock/1 (src/test_util.erl, line 244)
in call from test_util:'-mock/1-lc$^0/1-0-'/1 (src/test_util.erl, line 237)
in call from test_util:start/3 (src/test_util.erl, line 225)
in call from couchdb_os_daemons_tests:setup/1 (test/couchdb_os_daemons_tests.erl, line 47)
**error:{already_started,<0.3624.1>}

We failed to shutdown couch properly which leaves it in a weird state which causes a lot of other failures as well. Given that this was fixed I'm closing and we can reopen if it happens again.

@davisp davisp closed this as completed Jul 13, 2017
nickva pushed a commit to nickva/couchdb that referenced this issue Sep 7, 2022
Add missing word "key" in "What is a partition?"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants