Jepsen transient failures under network partition conditions #7549

Closed
pilvitaneli opened this Issue Sep 3, 2014 · 5 comments

Comments

Projects
None yet
2 participants
@pilvitaneli

Hi! Jepsen tests include five nemeses (test scenarios) that introduce different types of network partitions (see here). The tests add documents to index before, during and after these partitions, and verify that the documents which were acknowledged during the partitions are retrievable afterwards. Sometimes the tests indicate that a number of documents were indexed, but are not retrievable---however, this does not happen on every run (of the same scenario). For example, in a run of 20 times each (against 598854d), the following :lost-frac amounts were reported:

isolate-self-primaries-nemesis 244/361, 2/733, 1/607, 1/603, 1/213, 65/216 (and 14 times 0)
nemesis/partition-random-halves 1/355, 1/226, 4/733, 1/433 (and 16 times 0)
nemesis/partition-halves 1/65, 1/438, 4/715, 2/457, 6/731, 1/435, 9/433 (and 13 times 0)
nemesis/partitioner nemesis/bridge 2/415, 3/253, 2/383, 7/754, 1/786, 1/767 (and 14 times 0)
nemesis/partition-random-node does not report any lost documents.

In total, out of a 100 runs, 23 failed.

@dakrone

This comment has been minimized.

Show comment
Hide comment
@dakrone

dakrone Sep 3, 2014

Member

Hi @pilvitaneli, thanks for the testing results!

We're actively investigating Jepsen tests on top of our own tests, which resulted in #7572. The Jepsen tests helped verify that we fixed the split brain issue (it no longer happens). In all of our runs though, we couldn't simulate a result similar to your first run (the isolate-self-primaries-nemesis where you lost 244/361), still trying, but I might circle back with you to figure out how you ended up with those results. We do manage to simulate the smaller scale data loss that we believe relates to #7572, but this is also still under investigation.

I'll let you know how our continued testing with Jepsen goes, thanks again for your results!

Member

dakrone commented Sep 3, 2014

Hi @pilvitaneli, thanks for the testing results!

We're actively investigating Jepsen tests on top of our own tests, which resulted in #7572. The Jepsen tests helped verify that we fixed the split brain issue (it no longer happens). In all of our runs though, we couldn't simulate a result similar to your first run (the isolate-self-primaries-nemesis where you lost 244/361), still trying, but I might circle back with you to figure out how you ended up with those results. We do manage to simulate the smaller scale data loss that we believe relates to #7572, but this is also still under investigation.

I'll let you know how our continued testing with Jepsen goes, thanks again for your results!

@dakrone dakrone added the resiliency label Sep 3, 2014

@pilvitaneli

This comment has been minimized.

Show comment
Hide comment
@pilvitaneli

pilvitaneli Sep 4, 2014

Running just isolate-self-primaries-nemesis 50 times in a succession results in 22 failures:
1/403
404/653
1/583
6/667
287/395
4/583
16/655
3/1037
8/807
1/565
1/555
5/638
1/626
3/784
3/653
2/621
3/632
1/254
1/610
3/307
11/668
1/446

Running just isolate-self-primaries-nemesis 50 times in a succession results in 22 failures:
1/403
404/653
1/583
6/667
287/395
4/583
16/655
3/1037
8/807
1/565
1/555
5/638
1/626
3/784
3/653
2/621
3/632
1/254
1/610
3/307
11/668
1/446

@dakrone

This comment has been minimized.

Show comment
Hide comment
@dakrone

dakrone Oct 21, 2014

Member

@pilvitaneli circling back to this after a while, do you happen to have the commit sha of Jepsen that you are using for running your tests? I'd like to make sure we run the same tests.

Member

dakrone commented Oct 21, 2014

@pilvitaneli circling back to this after a while, do you happen to have the commit sha of Jepsen that you are using for running your tests? I'd like to make sure we run the same tests.

@pilvitaneli

This comment has been minimized.

Show comment
Hide comment
@pilvitaneli

pilvitaneli Oct 21, 2014

I haven't run in a while, but last was with jepsen-io/jepsen@761693b . It does not appear as though there are considerable changes after that, but I could try to re-run with current master.

I haven't run in a while, but last was with jepsen-io/jepsen@761693b . It does not appear as though there are considerable changes after that, but I could try to re-run with current master.

@dakrone dakrone removed their assignment Feb 21, 2016

@dakrone

This comment has been minimized.

Show comment
Hide comment
@dakrone

dakrone Sep 27, 2016

Member

Going to close this as it's been almost 2 years and we have a different issue we are tracking things for the 5.0 release - #20031

Member

dakrone commented Sep 27, 2016

Going to close this as it's been almost 2 years and we have a different issue we are tracking things for the 5.0 release - #20031

@dakrone dakrone closed this Sep 27, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment