Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Fix ring events monitoring by riak_core_node_watcher #399

Open
wants to merge 8 commits into
from

Conversation

Projects
None yet
3 participants
Contributor

slfritchie commented Sep 27, 2013

Fixes #343: riak_core_node_watcher should tolerate death of riak_core_ring_events

Before this patch, the riak_core_node_watcher.erl code assumed
that if the riak_core_ring_events proc (a gen_event server) died,
then a {gen_event_EXIT,,} message would be sent. However, that
assumption is not correct. If it dies, we get a regular {EXIT,,}
message.

Tested by repeated use of alternating:

  • exit(whereis(riak_core_ring_events), kill).

... and looking at the length of the links list
of process_info(whereis(riak_core_ring_events), links) -- it should
be three, not two.

@slfritchie slfritchie Fixes #343
Before this patch, the riak_core_node_watcher.erl code assumed
that if the riak_core_ring_events proc (a gen_event server) died,
then a {gen_event_EXIT,_,_} message would be sent.  However, that
assumption is not correct.  If it dies, we get a regular {EXIT,_,_}
message.

Tested by repeated use of alternating:
* exit(whereis(riak_core_ring_events), kill).

... and looking at the length of the links list
of process_info(whereis(riak_core_ring_events), links) -- it should
be three, not two.
6be124b

@ghost ghost assigned engelsanchez Sep 30, 2013

@engelsanchez engelsanchez and 1 other commented on an outdated diff Sep 30, 2013

src/riak_core_node_watcher.erl
Self = self(),
Fn = fun(R) ->
gen_server:cast(Self, {ring_update, R})
end,
- riak_core_ring_events:add_sup_callback(Fn).
+ riak_core_ring_events:add_sup_callback(Fn),
+ case riak_core_ring_events:get_pid() of
+ P when P == RingEventsPid ->
+ RingEventsPid;
+ _ ->
@engelsanchez

engelsanchez Sep 30, 2013

Contributor

It is possible although unlikely that at this point the gen_event server died after the call to get_pid and before the call to add_sup_callback. In that case, we are going to re-enter this function and register a second callback. Gen event is dumb and will happily add two, invoking the module twice for each event. It will also send double notifications in the event of a termination, etc. It might be safer, although hacky, to do the following instead:

  • Get list of linked processes
  • Get pid of current event mgr after registering the callback. If in the list above, all is well. Otherwise retry, possibly trying to drain any gen_event_EXIT messages from a fast dying event handler, which would cause registration to blindly happen again, possibly ending up in a double registration.

Or maybe we could protect ourselves on entry: check the pid of the current event handler. Are we linked to it? Then the gen_event_EXIT message is probably stale and we shouldn't re-register. Or go ahead and try to unregister, then register again instead.

What do you think?

@slfritchie

slfritchie Oct 1, 2013

Contributor

Hrm. It's probably best if we just unregister if we're going to loop. Except that the riak_core_ring_events:add_sup_callback/1 function is silly and using make_ref() in a way that makes it impossible to delete. Hrm.

Contributor

slfritchie commented Oct 1, 2013

@engelsanchez How about now?

Minor nitpick: Better to use the assertEqual macro for a non-totally-useless error message if something breaks in the future.

Thanks for making these more useful :)

Contributor

engelsanchez commented Oct 1, 2013

@slfritchie this is what is happening now: The death of the ring event handler is causing the node watcher to immediately try to reregister, causing it to most likely die because the new event handler is not up yet. I suppose that "fixes" the situation :), but would it make sense to have a less tragic ending with less possible side effects? We could make the node watcher wait for a bit for a new ring event handler to pop before trying, and then dying if that doesn't happen. Or we could let the two phoenixes come back from the ashes together and let the Erlang way do its thing.

Contributor

jrwest commented Oct 1, 2013

@slfritchie @engelsanchez haven't looked but if there is something subtle going on here it would be nice to leave a comment so others don't spend time figuring it out down the road :).

Contributor

engelsanchez commented Oct 1, 2013

@jrwest My comment above explains the current situation. I'll try to be clearer: with the changes as they stand now, killing the riak_core_ring_events process has the effect of killing the node watcher, because it now detects that death and tries to call riak_core_ring_events to re-register... but it's dead! So down goes node watcher too with a noproc error. That is something that could happen once in a while anyway: anybody calling any of the gen servers we have the moment they die, before respawning, will die too. I'm stating that this PR as it is makes the node watcher death happen very reliably (every time I tried), so perhaps making it wait a bit for riak_core_ring_events to respawn would be nicer. Is that better?

Contributor

jrwest commented Oct 1, 2013

@engelsanchez i just meant it would be nice to have that comment in the code, if a change is not made to address it. Sorry, if I'm totally off context.

Contributor

slfritchie commented Oct 4, 2013

Engel, did you want to see something like this?

diff --git a/src/riak_core_node_watcher.erl b/src/riak_core_node_watcher.erl
index ab5dc12..fd5a6c7 100644
--- a/src/riak_core_node_watcher.erl
+++ b/src/riak_core_node_watcher.erl
@@ -372,7 +372,9 @@ update_avsn(State) ->
     State#state { avsn = State#state.avsn + 1 }.

 watch_for_ring_events() ->
-    RingEventsPid = riak_core_ring_events:get_pid(),
+    %% Polling isn't (and cannot be) perfect, good enough 99% of the time
+    %% Our supervisor takes care of the last 1%.
+    RingEventsPid = poll_for_riak_core_ring_events_pid(10),
     Self = self(),
     Fn = fun(R) ->
                  gen_server:cast(Self, {ring_update, R})
@@ -392,6 +394,17 @@ watch_for_ring_events() ->
             watch_for_ring_events()
     end.

+poll_for_riak_core_ring_events_pid(0) ->
+    undefined;
+poll_for_riak_core_ring_events_pid(N) ->
+    case riak_core_ring_events:get_pid() of
+        Pid when is_pid(Pid) ->
+            Pid;
+        _ ->
+            timer:sleep(100),
+            poll_for_riak_core_ring_events_pid(N-1)
+    end.
+
 delete_service_mref(Id) ->
     %% Cleanup the monitor if one exists
     case erlang:get(Id) of
Contributor

slfritchie commented Oct 7, 2013

Only X more weeks until the next review deadline. Comments?

Contributor

slfritchie commented Oct 11, 2013

Engel or Jordan, any opinions on the diff above?

If this cross-supervisee-link thing isn't present, then an event handler that riak_core_node_watcher requires is missing, which means that it's broken. Their common supervisor is one_for_one, so we can't rely on the supervisor to solve the problem.

HRM. OK, here's a crazy idea. What if I changed the fix drastically?

  1. Remove all the racy/fiddly crap I've submitted so far.
  2. Add a middle supervisor, with all_for_one strategy.
  3. If either of these fiddly processes dies, the new sup kills the remainder, then restarts both in the required order.

Before:

              |-... many others ...
              |-riak_core_capability----
riak_core_sup-|-riak_core_node_watcher--
              |-riak_core_ring_events---
              |-riak_core_sysmon_minder-
              |-... many others ...

After:

              |-... many others ...
              |-riak_core_capability----------
              |
riak_core_sup-|-riak_core_new_ALL_FOR_ONE_sup-|-riak_core_node_watcher-
              |                               |-riak_core_ring_events--
              |-riak_core_sysmon_minder-------
              |-... many others ...
Contributor

jrwest commented Oct 11, 2013

@slfritchie i like it. its basically implementing the workaround and seems "more OTP". @engelsanchez?

Contributor

engelsanchez commented Oct 11, 2013

@slfritchie this is pretty much what I was about to propose, with the exception that instead of an all for one intermediate supervisor, I was thinking of a rest_for_one with riak_core_ring_events coming before node watcher. The death of the node watcher will be noted already by ring_events, we just need the death of ring events to bring down the node watcher and start them in that sequence again (ring_events, watcher). I would feel a lot better if we remove the hacks built while going down the rabbit hole and implement this.

Contributor

slfritchie commented Oct 15, 2013

Alright ... so, I'm thinking that it's good to leave the riak_core_ring_events.erl changes, because I think it's good to be able to cancel those callbacks, and the old way made it impossible to do that.

The problem is, ha ha, changing the order of startup items makes Riak unstable. Oi oi, do I even computer?

Contributor

slfritchie commented Oct 15, 2013

Background on last comment: the basic ./riak_test -c rtdev -t verify_build_cluster fails quite early:

18:07:31.768 [info] joining Node 2 to the cluster... It takes two to make a thing go right
18:07:31.792 [info] [join] 'dev2@127.0.0.1' to ('dev1@127.0.0.1'): ok
18:07:31.792 [info] Wait until all nodes are ready and there are no pending changes
18:07:31.792 [info] Wait until nodes are ready : ['dev1@127.0.0.1','dev2@127.0.0.1']
18:07:32.795 [info] Wait until all members ['dev1@127.0.0.1','dev2@127.0.0.1'] ['dev1@127.0.0.1','dev2@127.0.0.1']
18:07:32.796 [info] Wait until no pending changes on ['dev1@127.0.0.1','dev2@127.0.0.1']
[... timeout after 10 minutes...]
Contributor

slfritchie commented Oct 16, 2013

Engel, the startup order problem ... my last commit, 6e4f156, still doesn't get it right. If riak_core_ring_events dies & is restarted, things work well. But if riak_core_node_watcher dies & is restarted, things break. Namely, there is no worker process under the riak_core_eventhandler_sup supervisor.

If I use commit 01c0ada, then the riak_core_eventhandler_sup supervisor always has a worker process.

I stared at this for a long time today and didn't understand why the supervisor tree reorganization method doesn't work correctly. So, I recommend using the 01c0ada commit.

Contributor

slfritchie commented Dec 12, 2013

Ping?

Contributor

slfritchie commented Dec 24, 2013

Hrm, we're stuck in code freeze prep logjam. To be revisited soon.

Contributor

jrwest commented Mar 24, 2014

in the end this is an issue that has plagued several previous versions of Riak. Like similar ones, I'm marking this as 2.0.1 (which slips it out of the planned 2.0-RC)

@jrwest jrwest added this to the 2.0.1 milestone Mar 24, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment