Remove publish_capabilities race #230

Merged
merged 1 commit into from Sep 25, 2012

Projects

None yet

3 participants

@reiddraper

See commit message for more detail.

To exhibit the bug, I ran a fresh single-node cluster with the following diff:

diff --git a/src/riak_core_capability.erl b/src/riak_core_capability.erl
index a3a4fe6..e6e371c 100644
--- a/src/riak_core_capability.erl
+++ b/src/riak_core_capability.erl
@@ -378,6 +378,7 @@ publish_supported(State) ->
                 end
         end,
     spawn(fun() ->
+                  timer:sleep(crypto:rand_uniform(1, 1000)),
                   riak_core_ring_manager:ring_trans(F, ok)
           end),
     ok.

After running this a few times, I observed missing capabilities in the ring, as seen by
riak_core_ring_manager:get_my_ring().

All credit goes to @jonmeredith for being suspicious of this call in a spawn.

@reiddraper reiddraper Call ring_trans synchronously, not in a spawn
Calling `add_supported_to_ring` is not threadsafe.
If a process retrieves the member_meta and then
it's concurrently updated by another process,
the original process' changed will be overwritten.

To exhibit the original bug, I added a
timer:sleep(crypto:rand_uniform(1, 1000))
line inside the spawned fun that calls
riak_core_ring_manager:ring_trans(F, ok)
094954d
@jtuple

So, this looks reasonable and makes sense as to how it fixes the race. I do wonder if this was put in originally to avoid some rare deadlock between the ring manager and the capability manager (ie. they both call and wait on each other), but I can't seem to reproduce anything bad like that with manual testing.

I would love to have checked this against the verify_capabilities and rolling_capabilities tests in riak_test, but those tests currently fail on master. Have some ideas on how to fix those, but don't want to hold up getting this merged in and another 1.2.1 RC build made for testing. We can just focus on fixing capabilities test afterwards and make passing capabilities test one of the release criteria.

+1 merge

@jtuple jtuple merged commit 05d8895 into 1.2 Sep 25, 2012

1 check passed

Details default The Travis build passed
@seancribbs seancribbs was assigned Oct 2, 2012
@seancribbs seancribbs deleted the wrd-remove-capability-publish-race branch Apr 1, 2015
@seancribbs seancribbs removed their assignment May 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment