couch_mrview_changes_since_tests gen_server failing with unknown_info #649

wohali · 2017-07-06T07:51:59Z

Possible recurrence of the #548 issue:

https://builds.apache.org/blue/organizations/jenkins/CouchDB/detail/master/57/pipeline/50

module 'couch_mrview_changes_since_tests'
  changes_since tests
    couch_mrview_changes_since_tests:109: test_basic...ok
    couch_mrview_changes_since_tests:132: test_basic_since...ok
    couch_mrview_changes_since_tests:145: test_basic_count...ok
    couch_mrview_changes_since_tests:154: test_basic_count_since...ok
    undefined
    *** instantiation of subtests failed ***
**in function gen_server:call/3 (gen_server.erl, line 212)
in call from couch_mrview_util:get_view_index_state/5 (src/couch_mrview_util.erl, line 83)
in call from couch_mrview_util:get_view/4 (src/couch_mrview_util.erl, line 45)
in call from couch_mrview:count_view_changes_since/5 (src/couch_mrview.erl, line 304)
in call from couch_mrview_changes_since_tests:test_compact/1 (test/couch_mrview_changes_since_tests.erl, line 163)
**exit:{unknown_info,{gen_server,call,[<0.22793.1>,{get_state,11},infinity]}}

Searching on <0.22793.1> in the couch.log turned up:

[info] 2017-07-06T04:37:28.456116Z nonode@nohost <0.22793.1> -------- Opening index for db: eunit-test-db-1499315848442098 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383"
[info] 2017-07-06T04:37:28.456384Z nonode@nohost <0.22798.1> -------- Compaction started for db: eunit-test-db-1499315848442098 idx: _design/bar
[info] 2017-07-06T04:37:28.456583Z nonode@nohost <0.22796.1> -------- Starting index update for db: eunit-test-db-1499315848442098 idx: _design/bar
[notice] 2017-07-06T04:37:28.461342Z nonode@nohost <0.22793.1> -------- Compaction swap for view /tmp/tmp.VHCxH73tOc/apache-couchdb-2.1.0-858088e/tmp/data/.eunit-test-db-1499315848442098_design/mrview/7d577d5ace56a11c2e7289f38c77f383.view 51 4200
[info] 2017-07-06T04:37:28.461710Z nonode@nohost <0.22796.1> -------- Restarting index update for db: eunit-test-db-1499315848442098 idx: _design/bar
[info] 2017-07-06T04:37:28.461894Z nonode@nohost <0.22798.1> -------- Compaction finished for db: eunit-test-db-1499315848442098 idx: _design/bar
[error] 2017-07-06T04:37:28.462242Z nonode@nohost <0.22796.1> -------- gen_server <0.22796.1> terminated with reason: unknown_info
  last msg: {'EXIT',<0.22799.1>,{updated,<0.22799.1>,{mrst,<<125,87,125,90,206,86,161,28,46,114,137,243,140,119,243,131>>,<0.22795.1>,#Ref<0.0.786433.256464>,<<"eunit-test-db-1499315848442098">>,<<"_design/bar">>,<<"javascript">>,[{<<"seq_indexed">>,true}],true,false,{[]},[{mrview,0,11,0,[<<"baz">>],[],<<"function(doc) {emit(doc.val.toString(), doc.val);}">>,{btree,<0.22795.1>,{363,{10,[]},139},undefined,undefined,#Fun<couch_ejson_compare.less_json_ids.2>,#Fun<couch_mrview_util.17.88980446>,snappy},{btree,<0.22795.1>,{502,10,376},undefined,undefined,undefined,#Fun<couch_db_updater.btree_by_seq_reduce.2>,snappy},nil,true,false,[]}],{btree,<0.22795.1>,{51,[],135},undefined,undefined,undefined,nil,snappy},{btree,<0.22795.1>,{186,[],177},undefined,undefined,undefined,nil,snappy},11,0,undefined,undefined,undefined,undefined,undefined,nil}}}
     state: {st,<0.22793.1>,couch_mrview_index,<0.22807.1>}
[error] 2017-07-06T04:37:28.462545Z nonode@nohost <0.22796.1> -------- CRASH REPORT Process  (<0.22796.1>) with 2 neighbors exited with reason: unknown_info at gen_server:terminate/7(line:826) <= proc_lib:init_p_do_apply/3(line:240); initial_call: {couch_index_updater,init,['Argument__1']}, ancestors: [<0.22793.1>,<0.22792.1>], messages: [], links: [<0.22793.1>], dictionary: [], trap_exit: true, status: running, heap_size: 1598, stack_size: 27, reductions: 1734
[error] 2017-07-06T04:37:28.462718Z nonode@nohost <0.22797.1> -------- gen_server <0.22797.1> terminated with reason: unknown_info
  last msg: {'EXIT',<0.22793.1>,unknown_info}
     state: {st,<0.22793.1>,couch_mrview_index,<0.22798.1>}
[error] 2017-07-06T04:37:28.463037Z nonode@nohost <0.22797.1> -------- CRASH REPORT Process  (<0.22797.1>) with 0 neighbors exited with reason: unknown_info at gen_server:terminate/7(line:826) <= proc_lib:init_p_do_apply/3(line:240); initial_call: {couch_index_compactor,init,['Argument__1']}, ancestors: [<0.22793.1>,<0.22792.1>], messages: [], links: [], dictionary: [], trap_exit: true, status: running, heap_size: 376, stack_size: 27, reductions: 158

@eiri This feels a lot like what we ran into previously.

/cc @kocolosk @janl @davisp

The text was updated successfully, but these errors were encountered:

eiri · 2017-07-06T12:01:18Z

No, this is different, just looks the same, because it's happening during getting view state and the cause is also race condition.

In previous case we've been crashing on old index_server exit or be gone already, so exceptions been normal (for exit reason) or noproc.

Here exception is unknown_info and it is actually couch_index_updater server crashing (and bringing index server with it, because this clause doesn't match. It seems that the updater spawned two updating workers due to restart on compaction, then kept pid of the second, but got exit message from the first one. This is interesting, I'd expect this swap to be sequential, so I wonder if we have a bug in couch_util:shutdown_sync/1 and it is not that "sync".

On the other hand, why do we have a compaction happening during a test? I thought that new compaction daemon supposed to be turned off for eunits?

wohali · 2017-07-08T00:09:23Z

I think this is because of #652 in the eunit test - but do we want to keep this issue to track fixing the race condition? Keep in mind that once the PR is merged, we won't easily reproduce this bug anymore.

wohali · 2017-07-08T06:20:34Z

Closing because PR#654 landed. Will reopen if it recurs.

wohali · 2017-07-08T07:14:00Z

Sadly, it recurred already.

https://builds.apache.org/blue/organizations/jenkins/CouchDB/detail/master/64/pipeline/49/

wohali · 2017-07-08T07:18:09Z

The Makefile shows <0.22474.1> as the pid that is being monitored:

module 'couch_mrview_changes_since_tests'
  changes_since tests
    couch_mrview_changes_since_tests:109: test_basic...ok
    couch_mrview_changes_since_tests:132: test_basic_since...ok
    couch_mrview_changes_since_tests:145: test_basic_count...ok
    couch_mrview_changes_since_tests:154: test_basic_count_since...ok
    undefined
    *** instantiation of subtests failed ***
**in function gen_server:call/3 (gen_server.erl, line 190)
in call from couch_mrview_util:get_view_index_state/5 (src/couch_mrview_util.erl, line 83)
in call from couch_mrview_util:get_view/4 (src/couch_mrview_util.erl, line 45)
in call from couch_mrview:count_view_changes_since/5 (src/couch_mrview.erl, line 304)
in call from couch_mrview_changes_since_tests:test_compact/1 (test/couch_mrview_changes_since_tests.erl, line 163)
**exit:{unknown_info,{gen_server,call,[<0.22474.1>,{get_state,11},infinity]}}

couch.log shows the compaction in the test, but a strange exit:

[info] 2017-07-08T06:34:17.054113Z nonode@nohost <0.7.0> -------- Application couch started on node nonode@nohost
[notice] 2017-07-08T06:34:17.054117Z nonode@nohost <0.22188.1> -------- config: [compactions] _default deleted for reason nil
[info] 2017-07-08T06:34:17.364621Z nonode@nohost <0.22402.1> -------- Opening index for db: eunit-test-db-149949565754465 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383"
[info] 2017-07-08T06:34:17.364889Z nonode@nohost <0.22405.1> -------- Starting index update for db: eunit-test-db-149949565754465 idx: _design/bar
[info] 2017-07-08T06:34:17.370701Z nonode@nohost <0.22405.1> -------- Index update finished for db: eunit-test-db-149949565754465 idx: _design/bar
[info] 2017-07-08T06:34:17.374392Z nonode@nohost <0.22402.1> -------- Index shutdown by monitor notice for db: eunit-test-db-149949565754465 idx: _design/bar
[info] 2017-07-08T06:34:17.374925Z nonode@nohost <0.22402.1> -------- Closing index for db: eunit-test-db-149949565754465 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383" because normal
[info] 2017-07-08T06:34:17.388247Z nonode@nohost <0.22420.1> -------- Opening index for db: eunit-test-db-1499495657375222 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383"
[info] 2017-07-08T06:34:17.388478Z nonode@nohost <0.22423.1> -------- Starting index update for db: eunit-test-db-1499495657375222 idx: _design/bar
[info] 2017-07-08T06:34:17.394489Z nonode@nohost <0.22423.1> -------- Index update finished for db: eunit-test-db-1499495657375222 idx: _design/bar
[info] 2017-07-08T06:34:17.397271Z nonode@nohost <0.22420.1> -------- Index shutdown by monitor notice for db: eunit-test-db-1499495657375222 idx: _design/bar
[info] 2017-07-08T06:34:17.397739Z nonode@nohost <0.22420.1> -------- Closing index for db: eunit-test-db-1499495657375222 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383" because normal
[info] 2017-07-08T06:34:17.410658Z nonode@nohost <0.22438.1> -------- Opening index for db: eunit-test-db-1499495657398013 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383"
[info] 2017-07-08T06:34:17.411090Z nonode@nohost <0.22441.1> -------- Starting index update for db: eunit-test-db-1499495657398013 idx: _design/bar
[info] 2017-07-08T06:34:17.420887Z nonode@nohost <0.22441.1> -------- Index update finished for db: eunit-test-db-1499495657398013 idx: _design/bar
[info] 2017-07-08T06:34:17.424261Z nonode@nohost <0.22438.1> -------- Index shutdown by monitor notice for db: eunit-test-db-1499495657398013 idx: _design/bar
[info] 2017-07-08T06:34:17.424604Z nonode@nohost <0.22438.1> -------- Closing index for db: eunit-test-db-1499495657398013 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383" because normal
[info] 2017-07-08T06:34:17.437426Z nonode@nohost <0.22456.1> -------- Opening index for db: eunit-test-db-1499495657424766 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383"
[info] 2017-07-08T06:34:17.437689Z nonode@nohost <0.22459.1> -------- Starting index update for db: eunit-test-db-1499495657424766 idx: _design/bar
[info] 2017-07-08T06:34:17.443562Z nonode@nohost <0.22459.1> -------- Index update finished for db: eunit-test-db-1499495657424766 idx: _design/bar
[info] 2017-07-08T06:34:17.446443Z nonode@nohost <0.22456.1> -------- Index shutdown by monitor notice for db: eunit-test-db-1499495657424766 idx: _design/bar
[info] 2017-07-08T06:34:17.447092Z nonode@nohost <0.22456.1> -------- Closing index for db: eunit-test-db-1499495657424766 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383" because normal
[info] 2017-07-08T06:34:17.471464Z nonode@nohost <0.22474.1> -------- Opening index for db: eunit-test-db-1499495657447342 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383"
[info] 2017-07-08T06:34:17.471735Z nonode@nohost <0.22479.1> -------- Compaction started for db: eunit-test-db-1499495657447342 idx: _design/bar
[info] 2017-07-08T06:34:17.472199Z nonode@nohost <0.22477.1> -------- Starting index update for db: eunit-test-db-1499495657447342 idx: _design/bar
[notice] 2017-07-08T06:34:17.478663Z nonode@nohost <0.22474.1> -------- Compaction swap for view /tmp/tmp.8hwe87r9Aw/apache-couchdb-2.1.0-da7aa54/tmp/data/.eunit-test-db-1499495657447342_design/mrview/7d577d5ace56a11c2e7289f38c77f383.view 51 4200
[info] 2017-07-08T06:34:17.479292Z nonode@nohost <0.22477.1> -------- Restarting index update for db: eunit-test-db-1499495657447342 idx: _design/bar
[info] 2017-07-08T06:34:17.479623Z nonode@nohost <0.22479.1> -------- Compaction finished for db: eunit-test-db-1499495657447342 idx: _design/bar
[error] 2017-07-08T06:34:17.479875Z nonode@nohost <0.22477.1> -------- gen_server <0.22477.1> terminated with reason: unknown_info
  last msg: {'EXIT',<0.22480.1>,{updated,<0.22480.1>,{mrst,<<125,87,125,90,206,86,161,28,46,114,137,243,140,119,243,131>>,<0.22476.1>,#Ref<0.0.6.164863>,<<"eunit-test-db-1499495657447342">>,<<"_design/bar">>,<<"javascript">>,[{<<"seq_indexed">>,true}],true,false,{[]},[{mrview,0,11,0,[<<"baz">>],[],<<"function(doc) {emit(doc.val.toString(), doc.val);}">>,{btree,<0.22476.1>,{363,{10,[]},139},undefined,undefined,#Fun<couch_ejson_compare.less_json_ids.2>,#Fun<couch_mrview_util.17.34658527>,snappy},{btree,<0.22476.1>,{502,10,376},undefined,undefined,undefined,#Fun<couch_db_updater.btree_by_seq_reduce.2>,snappy},nil,true,false,[]}],{btree,<0.22476.1>,{51,[],135},undefined,undefined,undefined,nil,snappy},{btree,<0.22476.1>,{186,[],177},undefined,undefined,undefined,nil,snappy},11,0,undefined,undefined,undefined,undefined,undefined,nil}}}
     state: {st,<0.22474.1>,couch_mrview_index,<0.22488.1>}
[error] 2017-07-08T06:34:17.480273Z nonode@nohost <0.22477.1> -------- CRASH REPORT Process  (<0.22477.1>) with 2 neighbors exited with reason: unknown_info at gen_server:terminate/6(line:737) <= proc_lib:init_p_do_apply/3(line:237); initial_call: {couch_index_updater,init,['Argument__1']}, ancestors: [<0.22474.1>,<0.22473.1>], messages: [], links: [<0.22474.1>], dictionary: [], trap_exit: true, status: running, heap_size: 1598, stack_size: 27, reductions: 1732
[error] 2017-07-08T06:34:17.480508Z nonode@nohost <0.22478.1> -------- gen_server <0.22478.1> terminated with reason: unknown_info
  last msg: {'EXIT',<0.22474.1>,unknown_info}
     state: {st,<0.22474.1>,couch_mrview_index,<0.22479.1>}
[error] 2017-07-08T06:34:17.480850Z nonode@nohost <0.22478.1> -------- CRASH REPORT Process  (<0.22478.1>) with 0 neighbors exited with reason: unknown_info at gen_server:terminate/6(line:737) <= proc_lib:init_p_do_apply/3(line:237); initial_call: {couch_index_compactor,init,['Argument__1']}, ancestors: [<0.22474.1>,<0.22473.1>], messages: [], links: [], dictionary: [], trap_exit: true, status: running, heap_size: 376, stack_size: 27, reductions: 145

@eiri @nickva are either of you able to help out here?

Looking into #649 I realized there's a pretty terrible race condition if an index is compacted quickly followed by an index update. Since we don't check the index updater message it would be possible for us to swap out a compaction change, followed by immediately resetting to the new state from the index updater. This would be bad as we'd possibly end up with a situation where our long lived index would be operating on a file that no longer existed on disk.

This was encountered during the test suite runs on Travis. It turns out that when we restart the indexer its possible to already have the 'EXIT' message in our mailbox. When we do we'll then crash with an unknown_info error since our updater pid was changed during the restart. This change simple filters any 'EXIT' message from the old updater from the mailbox before restarting thew new index updater. Fixes #649

Looking into #649 I realized there's a pretty terrible race condition if an index is compacted quickly followed by an index update. Since we don't check the index updater message it would be possible for us to swap out a compaction change, followed by immediately resetting to the new state from the index updater. This would be bad as we'd possibly end up with a situation where our long lived index would be operating on a file that no longer existed on disk.

This was encountered during the test suite runs on Travis. It turns out that when we restart the indexer its possible to already have the 'EXIT' message in our mailbox. When we do we'll then crash with an unknown_info error since our updater pid was changed during the restart. This change simple filters any 'EXIT' message from the old updater from the mailbox before restarting thew new index updater. Fixes #649

Looking into #649 I realized there's a pretty terrible race condition if an index is compacted quickly followed by an index update. Since we don't check the index updater message it would be possible for us to swap out a compaction change, followed by immediately resetting to the new state from the index updater. This would be bad as we'd possibly end up with a situation where our long lived index would be operating on a file that no longer existed on disk.

wohali added dbcore viewserver labels Jul 6, 2017

wohali added this to the 2.1.0 milestone Jul 6, 2017

This was referenced Jul 8, 2017

Compaction daemon is interfering with eunit tests (all_dbs_active errors) #652

Closed

Disable compaction daemon on eunit run couch startups #654

Merged

wohali added the testsuite label Jul 8, 2017

wohali closed this as completed Jul 8, 2017

wohali reopened this Jul 8, 2017

davisp mentioned this issue Jul 12, 2017

649 fix mrview changes since test #667

Merged

davisp closed this as completed in 6586102 Jul 13, 2017

davisp marked this as a duplicate of #685 Jul 19, 2017

davisp mentioned this issue Jul 19, 2017

undefined error in couch_mrview_changes_since_tests:test_compact/1 #685

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

couch_mrview_changes_since_tests gen_server failing with unknown_info #649

couch_mrview_changes_since_tests gen_server failing with unknown_info #649

wohali commented Jul 6, 2017

eiri commented Jul 6, 2017

wohali commented Jul 8, 2017 •

edited

Loading

wohali commented Jul 8, 2017

wohali commented Jul 8, 2017

wohali commented Jul 8, 2017 •

edited

Loading

couch_mrview_changes_since_tests gen_server failing with unknown_info #649

couch_mrview_changes_since_tests gen_server failing with unknown_info #649

Comments

wohali commented Jul 6, 2017

eiri commented Jul 6, 2017

wohali commented Jul 8, 2017 • edited Loading

wohali commented Jul 8, 2017

wohali commented Jul 8, 2017

wohali commented Jul 8, 2017 • edited Loading

wohali commented Jul 8, 2017 •

edited

Loading

wohali commented Jul 8, 2017 •

edited

Loading