Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

couch_mrview_changes_since_tests gen_server failing with unknown_info #649

Closed
wohali opened this issue Jul 6, 2017 · 5 comments
Closed

Comments

@wohali
Copy link
Member

wohali commented Jul 6, 2017

Possible recurrence of the #548 issue:

https://builds.apache.org/blue/organizations/jenkins/CouchDB/detail/master/57/pipeline/50

module 'couch_mrview_changes_since_tests'
  changes_since tests
    couch_mrview_changes_since_tests:109: test_basic...ok
    couch_mrview_changes_since_tests:132: test_basic_since...ok
    couch_mrview_changes_since_tests:145: test_basic_count...ok
    couch_mrview_changes_since_tests:154: test_basic_count_since...ok
    undefined
    *** instantiation of subtests failed ***
**in function gen_server:call/3 (gen_server.erl, line 212)
in call from couch_mrview_util:get_view_index_state/5 (src/couch_mrview_util.erl, line 83)
in call from couch_mrview_util:get_view/4 (src/couch_mrview_util.erl, line 45)
in call from couch_mrview:count_view_changes_since/5 (src/couch_mrview.erl, line 304)
in call from couch_mrview_changes_since_tests:test_compact/1 (test/couch_mrview_changes_since_tests.erl, line 163)
**exit:{unknown_info,{gen_server,call,[<0.22793.1>,{get_state,11},infinity]}}

Searching on <0.22793.1> in the couch.log turned up:

[info] 2017-07-06T04:37:28.456116Z nonode@nohost <0.22793.1> -------- Opening index for db: eunit-test-db-1499315848442098 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383"
[info] 2017-07-06T04:37:28.456384Z nonode@nohost <0.22798.1> -------- Compaction started for db: eunit-test-db-1499315848442098 idx: _design/bar
[info] 2017-07-06T04:37:28.456583Z nonode@nohost <0.22796.1> -------- Starting index update for db: eunit-test-db-1499315848442098 idx: _design/bar
[notice] 2017-07-06T04:37:28.461342Z nonode@nohost <0.22793.1> -------- Compaction swap for view /tmp/tmp.VHCxH73tOc/apache-couchdb-2.1.0-858088e/tmp/data/.eunit-test-db-1499315848442098_design/mrview/7d577d5ace56a11c2e7289f38c77f383.view 51 4200
[info] 2017-07-06T04:37:28.461710Z nonode@nohost <0.22796.1> -------- Restarting index update for db: eunit-test-db-1499315848442098 idx: _design/bar
[info] 2017-07-06T04:37:28.461894Z nonode@nohost <0.22798.1> -------- Compaction finished for db: eunit-test-db-1499315848442098 idx: _design/bar
[error] 2017-07-06T04:37:28.462242Z nonode@nohost <0.22796.1> -------- gen_server <0.22796.1> terminated with reason: unknown_info
  last msg: {'EXIT',<0.22799.1>,{updated,<0.22799.1>,{mrst,<<125,87,125,90,206,86,161,28,46,114,137,243,140,119,243,131>>,<0.22795.1>,#Ref<0.0.786433.256464>,<<"eunit-test-db-1499315848442098">>,<<"_design/bar">>,<<"javascript">>,[{<<"seq_indexed">>,true}],true,false,{[]},[{mrview,0,11,0,[<<"baz">>],[],<<"function(doc) {emit(doc.val.toString(), doc.val);}">>,{btree,<0.22795.1>,{363,{10,[]},139},undefined,undefined,#Fun<couch_ejson_compare.less_json_ids.2>,#Fun<couch_mrview_util.17.88980446>,snappy},{btree,<0.22795.1>,{502,10,376},undefined,undefined,undefined,#Fun<couch_db_updater.btree_by_seq_reduce.2>,snappy},nil,true,false,[]}],{btree,<0.22795.1>,{51,[],135},undefined,undefined,undefined,nil,snappy},{btree,<0.22795.1>,{186,[],177},undefined,undefined,undefined,nil,snappy},11,0,undefined,undefined,undefined,undefined,undefined,nil}}}
     state: {st,<0.22793.1>,couch_mrview_index,<0.22807.1>}
[error] 2017-07-06T04:37:28.462545Z nonode@nohost <0.22796.1> -------- CRASH REPORT Process  (<0.22796.1>) with 2 neighbors exited with reason: unknown_info at gen_server:terminate/7(line:826) <= proc_lib:init_p_do_apply/3(line:240); initial_call: {couch_index_updater,init,['Argument__1']}, ancestors: [<0.22793.1>,<0.22792.1>], messages: [], links: [<0.22793.1>], dictionary: [], trap_exit: true, status: running, heap_size: 1598, stack_size: 27, reductions: 1734
[error] 2017-07-06T04:37:28.462718Z nonode@nohost <0.22797.1> -------- gen_server <0.22797.1> terminated with reason: unknown_info
  last msg: {'EXIT',<0.22793.1>,unknown_info}
     state: {st,<0.22793.1>,couch_mrview_index,<0.22798.1>}
[error] 2017-07-06T04:37:28.463037Z nonode@nohost <0.22797.1> -------- CRASH REPORT Process  (<0.22797.1>) with 0 neighbors exited with reason: unknown_info at gen_server:terminate/7(line:826) <= proc_lib:init_p_do_apply/3(line:240); initial_call: {couch_index_compactor,init,['Argument__1']}, ancestors: [<0.22793.1>,<0.22792.1>], messages: [], links: [], dictionary: [], trap_exit: true, status: running, heap_size: 376, stack_size: 27, reductions: 158

@eiri This feels a lot like what we ran into previously.

/cc @kocolosk @janl @davisp

@wohali wohali added this to the 2.1.0 milestone Jul 6, 2017
@eiri
Copy link
Member

eiri commented Jul 6, 2017

No, this is different, just looks the same, because it's happening during getting view state and the cause is also race condition.

In previous case we've been crashing on old index_server exit or be gone already, so exceptions been normal (for exit reason) or noproc.

Here exception is unknown_info and it is actually couch_index_updater server crashing (and bringing index server with it, because this clause doesn't match. It seems that the updater spawned two updating workers due to restart on compaction, then kept pid of the second, but got exit message from the first one. This is interesting, I'd expect this swap to be sequential, so I wonder if we have a bug in couch_util:shutdown_sync/1 and it is not that "sync".

On the other hand, why do we have a compaction happening during a test? I thought that new compaction daemon supposed to be turned off for eunits?

@wohali
Copy link
Member Author

wohali commented Jul 8, 2017

I think this is because of #652 in the eunit test - but do we want to keep this issue to track fixing the race condition? Keep in mind that once the PR is merged, we won't easily reproduce this bug anymore.

@wohali
Copy link
Member Author

wohali commented Jul 8, 2017

Closing because PR#654 landed. Will reopen if it recurs.

@wohali wohali closed this as completed Jul 8, 2017
@wohali
Copy link
Member Author

wohali commented Jul 8, 2017

@wohali wohali reopened this Jul 8, 2017
@wohali
Copy link
Member Author

wohali commented Jul 8, 2017

The Makefile shows <0.22474.1> as the pid that is being monitored:

module 'couch_mrview_changes_since_tests'
  changes_since tests
    couch_mrview_changes_since_tests:109: test_basic...ok
    couch_mrview_changes_since_tests:132: test_basic_since...ok
    couch_mrview_changes_since_tests:145: test_basic_count...ok
    couch_mrview_changes_since_tests:154: test_basic_count_since...ok
    undefined
    *** instantiation of subtests failed ***
**in function gen_server:call/3 (gen_server.erl, line 190)
in call from couch_mrview_util:get_view_index_state/5 (src/couch_mrview_util.erl, line 83)
in call from couch_mrview_util:get_view/4 (src/couch_mrview_util.erl, line 45)
in call from couch_mrview:count_view_changes_since/5 (src/couch_mrview.erl, line 304)
in call from couch_mrview_changes_since_tests:test_compact/1 (test/couch_mrview_changes_since_tests.erl, line 163)
**exit:{unknown_info,{gen_server,call,[<0.22474.1>,{get_state,11},infinity]}}

couch.log shows the compaction in the test, but a strange exit:

[info] 2017-07-08T06:34:17.054113Z nonode@nohost <0.7.0> -------- Application couch started on node nonode@nohost
[notice] 2017-07-08T06:34:17.054117Z nonode@nohost <0.22188.1> -------- config: [compactions] _default deleted for reason nil
[info] 2017-07-08T06:34:17.364621Z nonode@nohost <0.22402.1> -------- Opening index for db: eunit-test-db-149949565754465 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383"
[info] 2017-07-08T06:34:17.364889Z nonode@nohost <0.22405.1> -------- Starting index update for db: eunit-test-db-149949565754465 idx: _design/bar
[info] 2017-07-08T06:34:17.370701Z nonode@nohost <0.22405.1> -------- Index update finished for db: eunit-test-db-149949565754465 idx: _design/bar
[info] 2017-07-08T06:34:17.374392Z nonode@nohost <0.22402.1> -------- Index shutdown by monitor notice for db: eunit-test-db-149949565754465 idx: _design/bar
[info] 2017-07-08T06:34:17.374925Z nonode@nohost <0.22402.1> -------- Closing index for db: eunit-test-db-149949565754465 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383" because normal
[info] 2017-07-08T06:34:17.388247Z nonode@nohost <0.22420.1> -------- Opening index for db: eunit-test-db-1499495657375222 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383"
[info] 2017-07-08T06:34:17.388478Z nonode@nohost <0.22423.1> -------- Starting index update for db: eunit-test-db-1499495657375222 idx: _design/bar
[info] 2017-07-08T06:34:17.394489Z nonode@nohost <0.22423.1> -------- Index update finished for db: eunit-test-db-1499495657375222 idx: _design/bar
[info] 2017-07-08T06:34:17.397271Z nonode@nohost <0.22420.1> -------- Index shutdown by monitor notice for db: eunit-test-db-1499495657375222 idx: _design/bar
[info] 2017-07-08T06:34:17.397739Z nonode@nohost <0.22420.1> -------- Closing index for db: eunit-test-db-1499495657375222 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383" because normal
[info] 2017-07-08T06:34:17.410658Z nonode@nohost <0.22438.1> -------- Opening index for db: eunit-test-db-1499495657398013 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383"
[info] 2017-07-08T06:34:17.411090Z nonode@nohost <0.22441.1> -------- Starting index update for db: eunit-test-db-1499495657398013 idx: _design/bar
[info] 2017-07-08T06:34:17.420887Z nonode@nohost <0.22441.1> -------- Index update finished for db: eunit-test-db-1499495657398013 idx: _design/bar
[info] 2017-07-08T06:34:17.424261Z nonode@nohost <0.22438.1> -------- Index shutdown by monitor notice for db: eunit-test-db-1499495657398013 idx: _design/bar
[info] 2017-07-08T06:34:17.424604Z nonode@nohost <0.22438.1> -------- Closing index for db: eunit-test-db-1499495657398013 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383" because normal
[info] 2017-07-08T06:34:17.437426Z nonode@nohost <0.22456.1> -------- Opening index for db: eunit-test-db-1499495657424766 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383"
[info] 2017-07-08T06:34:17.437689Z nonode@nohost <0.22459.1> -------- Starting index update for db: eunit-test-db-1499495657424766 idx: _design/bar
[info] 2017-07-08T06:34:17.443562Z nonode@nohost <0.22459.1> -------- Index update finished for db: eunit-test-db-1499495657424766 idx: _design/bar
[info] 2017-07-08T06:34:17.446443Z nonode@nohost <0.22456.1> -------- Index shutdown by monitor notice for db: eunit-test-db-1499495657424766 idx: _design/bar
[info] 2017-07-08T06:34:17.447092Z nonode@nohost <0.22456.1> -------- Closing index for db: eunit-test-db-1499495657424766 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383" because normal
[info] 2017-07-08T06:34:17.471464Z nonode@nohost <0.22474.1> -------- Opening index for db: eunit-test-db-1499495657447342 idx: _design/bar sig: "7d577d5ace56a11c2e7289f38c77f383"
[info] 2017-07-08T06:34:17.471735Z nonode@nohost <0.22479.1> -------- Compaction started for db: eunit-test-db-1499495657447342 idx: _design/bar
[info] 2017-07-08T06:34:17.472199Z nonode@nohost <0.22477.1> -------- Starting index update for db: eunit-test-db-1499495657447342 idx: _design/bar
[notice] 2017-07-08T06:34:17.478663Z nonode@nohost <0.22474.1> -------- Compaction swap for view /tmp/tmp.8hwe87r9Aw/apache-couchdb-2.1.0-da7aa54/tmp/data/.eunit-test-db-1499495657447342_design/mrview/7d577d5ace56a11c2e7289f38c77f383.view 51 4200
[info] 2017-07-08T06:34:17.479292Z nonode@nohost <0.22477.1> -------- Restarting index update for db: eunit-test-db-1499495657447342 idx: _design/bar
[info] 2017-07-08T06:34:17.479623Z nonode@nohost <0.22479.1> -------- Compaction finished for db: eunit-test-db-1499495657447342 idx: _design/bar
[error] 2017-07-08T06:34:17.479875Z nonode@nohost <0.22477.1> -------- gen_server <0.22477.1> terminated with reason: unknown_info
  last msg: {'EXIT',<0.22480.1>,{updated,<0.22480.1>,{mrst,<<125,87,125,90,206,86,161,28,46,114,137,243,140,119,243,131>>,<0.22476.1>,#Ref<0.0.6.164863>,<<"eunit-test-db-1499495657447342">>,<<"_design/bar">>,<<"javascript">>,[{<<"seq_indexed">>,true}],true,false,{[]},[{mrview,0,11,0,[<<"baz">>],[],<<"function(doc) {emit(doc.val.toString(), doc.val);}">>,{btree,<0.22476.1>,{363,{10,[]},139},undefined,undefined,#Fun<couch_ejson_compare.less_json_ids.2>,#Fun<couch_mrview_util.17.34658527>,snappy},{btree,<0.22476.1>,{502,10,376},undefined,undefined,undefined,#Fun<couch_db_updater.btree_by_seq_reduce.2>,snappy},nil,true,false,[]}],{btree,<0.22476.1>,{51,[],135},undefined,undefined,undefined,nil,snappy},{btree,<0.22476.1>,{186,[],177},undefined,undefined,undefined,nil,snappy},11,0,undefined,undefined,undefined,undefined,undefined,nil}}}
     state: {st,<0.22474.1>,couch_mrview_index,<0.22488.1>}
[error] 2017-07-08T06:34:17.480273Z nonode@nohost <0.22477.1> -------- CRASH REPORT Process  (<0.22477.1>) with 2 neighbors exited with reason: unknown_info at gen_server:terminate/6(line:737) <= proc_lib:init_p_do_apply/3(line:237); initial_call: {couch_index_updater,init,['Argument__1']}, ancestors: [<0.22474.1>,<0.22473.1>], messages: [], links: [<0.22474.1>], dictionary: [], trap_exit: true, status: running, heap_size: 1598, stack_size: 27, reductions: 1732
[error] 2017-07-08T06:34:17.480508Z nonode@nohost <0.22478.1> -------- gen_server <0.22478.1> terminated with reason: unknown_info
  last msg: {'EXIT',<0.22474.1>,unknown_info}
     state: {st,<0.22474.1>,couch_mrview_index,<0.22479.1>}
[error] 2017-07-08T06:34:17.480850Z nonode@nohost <0.22478.1> -------- CRASH REPORT Process  (<0.22478.1>) with 0 neighbors exited with reason: unknown_info at gen_server:terminate/6(line:737) <= proc_lib:init_p_do_apply/3(line:237); initial_call: {couch_index_compactor,init,['Argument__1']}, ancestors: [<0.22474.1>,<0.22473.1>], messages: [], links: [], dictionary: [], trap_exit: true, status: running, heap_size: 376, stack_size: 27, reductions: 145

@eiri @nickva are either of you able to help out here?

davisp added a commit that referenced this issue Jul 12, 2017
Looking into #649 I realized there's a pretty terrible race condition if
an index is compacted quickly followed by an index update. Since we
don't check the index updater message it would be possible for us to
swap out a compaction change, followed by immediately resetting to the
new state from the index updater. This would be bad as we'd possibly end
up with a situation where our long lived index would be operating on a
file that no longer existed on disk.
davisp added a commit that referenced this issue Jul 12, 2017
This was encountered during the test suite runs on Travis. It turns out
that when we restart the indexer its possible to already have the 'EXIT'
message in our mailbox. When we do we'll then crash with an unknown_info
error since our updater pid was changed during the restart.

This change simple filters any 'EXIT' message from the old updater from
the mailbox before restarting thew new index updater.

Fixes #649
davisp added a commit that referenced this issue Jul 12, 2017
Looking into #649 I realized there's a pretty terrible race condition if
an index is compacted quickly followed by an index update. Since we
don't check the index updater message it would be possible for us to
swap out a compaction change, followed by immediately resetting to the
new state from the index updater. This would be bad as we'd possibly end
up with a situation where our long lived index would be operating on a
file that no longer existed on disk.
davisp added a commit that referenced this issue Jul 12, 2017
This was encountered during the test suite runs on Travis. It turns out
that when we restart the indexer its possible to already have the 'EXIT'
message in our mailbox. When we do we'll then crash with an unknown_info
error since our updater pid was changed during the restart.

This change simple filters any 'EXIT' message from the old updater from
the mailbox before restarting thew new index updater.

Fixes #649
davisp added a commit that referenced this issue Jul 13, 2017
Looking into #649 I realized there's a pretty terrible race condition if
an index is compacted quickly followed by an index update. Since we
don't check the index updater message it would be possible for us to
swap out a compaction change, followed by immediately resetting to the
new state from the index updater. This would be bad as we'd possibly end
up with a situation where our long lived index would be operating on a
file that no longer existed on disk.
@davisp davisp closed this as completed in 6586102 Jul 13, 2017
@davisp davisp marked this as a duplicate of #685 Jul 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants