couch_replicator_scheduler killing "completed normally" replication? #644

wohali · 2017-07-05T18:00:57Z

Current & Expected Behavior

Normally the couch_replicator_use_checkpoints_tests pass fine. However, we have an unusual failure in a Travis run today. The replication scheduler says it "completed normally", but then it appears it proceeds to KILL it?! Odd.

Logs

Travis: https://travis-ci.org/apache/couchdb/jobs/250436888#L5400-L5407
couch.log: https://couchdb-vm2.apache.org/ci_errorlogs/travis-couchdb-250436888-2017-07-05T17%3A16%3A05.960154/couchlog.tar.gz
Useful excerpt: https://gist.github.com/wohali/0bc580a1edf96205a52df30a72e55ddc

Assigning to @nickva because this feels like a scheduler bug.

The text was updated successfully, but these errors were encountered:

nickva · 2017-07-11T04:10:41Z

Sprinkled some ?debug macros on that test and made Travis run it 10 times in a row.

cloudant@fb13cf5

One failure:

https://travis-ci.org/cloudant/couchdb/jobs/252251466

test/couch_replicator_use_checkpoints_tests.erl:77:<0.30989.1>: <-

        test/couch_replicator_use_checkpoints_tests.erl:80:<0.30989.1>: <-

[done in 0.659 s]

test/couch_replicator_use_checkpoints_tests.erl:87:<0.30989.1>: <-

test/couch_replicator_use_checkpoints_tests.erl:80:<0.30989.1>: <-

test/couch_replicator_use_checkpoints_tests.erl:89:<0.30989.1>: <-

test/couch_replicator_use_checkpoints_tests.erl:91:<0.30989.1>: <-

test/couch_replicator_use_checkpoints_tests.erl:93:<0.30989.1>: <-

    *unexpected termination of test process*

::killed

test/couch_replicator_use_checkpoints_tests.erl:50:<0.916.0>: <-

  [done in 5.419 s]

[done in 5.419 s]

Because `stop/1` is asynchronous, it only casts a stop message and as result the client process could end getting killed during termination/cleanup phase if this sequence of events took place: 1. Client calls `stop(ListerPid).` 2. couch_event_sup casts a `stop` message to couch_event_sup gen_server 3. `stop` message is delayed and client continues executing. 4. Client calls something like application:stop/1`. 5. `application:stop/1`couch_replicator) terminates couch_event_sup gen_server. 6. Termination of the application kills client process because it is still linked Issue apache#644

Because `stop/1` is asynchronous, and casts a stop message and as result the client process could end getting killed during termination/cleanup phase if this sequence of events took place: 1. Client calls `stop(ListerPid).` 2. couch_event_sup casts a `stop` message to couch_event_sup gen_server 3. `stop` message is delayed and client continues executing. 4. Client calls something like application:stop/1`. 5. `application:stop/1` terminates couch_event_sup gen_server. 6. App termination kills client process because it is still linked. So this make the stop synchrounous by using call instead of cast. Issue apache#644

This is to help to monitor test flakiness progress. Issue apache#644

Because `stop/1` is asynchronous, and casts a stop message and as result the client process could end getting killed during termination/cleanup phase if this sequence of events took place: 1. Client calls `stop(ListerPid).` 2. couch_event_sup casts a `stop` message to couch_event_sup gen_server 3. `stop` message is delayed and client continues executing. 4. Client calls something like application:stop/1`. 5. `application:stop/1` terminates couch_event_sup gen_server. 6. App termination kills client process because it is still linked. So this make the stop synchrounous by using call instead of cast. Issue #644

This is to help to monitor test flakiness progress. Issue apache#644

wohali · 2017-07-11T18:37:55Z

Closing until we see a recurrence of this because of #662.

travis-ci: otp 20.0.2 -> 20.0.4

…ache#644)

wohali assigned nickva Jul 5, 2017

nickva mentioned this issue Jul 11, 2017

Make couch_event_sup:stop/1 synchronous #662

Merged

nickva added a commit to cloudant/couchdb that referenced this issue Jul 11, 2017

Temporarily try running more Travis builds

8fa24fa

This is to help to monitor test flakiness progress. Issue apache#644

nickva added a commit to cloudant/couchdb that referenced this issue Jul 11, 2017

Temporarily try running more Travis builds

7fe5053

This is to help to monitor test flakiness progress. Issue apache#644

nickva mentioned this issue Jul 11, 2017

Temporarily try running more Travis builds #663

Closed

wohali closed this as completed Jul 11, 2017

lag-linaro pushed a commit to lag-linaro/couchdb that referenced this issue Oct 25, 2018

Merge pull request apache#644 from tuncer/travis-otp-20.0.4

66c175d

travis-ci: otp 20.0.2 -> 20.0.4

nickva pushed a commit to nickva/couchdb that referenced this issue Sep 7, 2022

Add alternate location for local.ini for macOS, closes apache#575 (ap…

d12790b

…ache#644)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

couch_replicator_scheduler killing "completed normally" replication? #644

couch_replicator_scheduler killing "completed normally" replication? #644

wohali commented Jul 5, 2017

nickva commented Jul 11, 2017

wohali commented Jul 11, 2017

couch_replicator_scheduler killing "completed normally" replication? #644

couch_replicator_scheduler killing "completed normally" replication? #644

Comments

wohali commented Jul 5, 2017

Current & Expected Behavior

Logs

nickva commented Jul 11, 2017

wohali commented Jul 11, 2017