health check system #240

lordnull · 2012-10-17T19:37:12Z

issue #388

Tests fail, so not final.

This is done to test things like health check service recovery. Such a recovery would have both a down and an up, so the state progression and conresponding broadchast messages should be checked.

Doesn't play nicely with interweaved node up, service up, service down and node downs yet.

macshonle · 2012-10-30T14:49:51Z

src/riak_core_node_watcher.erl

+                        service_pid :: pid(),
+                        checking_pid :: pid(),
+                        health_failures = 0 :: non_neg_integer(),
+                        callback_failures = 0,


Should this be non_neg_integer() too, like health_failures?

macshonle · 2012-10-31T16:25:36Z

src/riak_core_node_watcher.erl

@@ -26,7 +26,11 @@
 %% API
 -export([start_link/0,
         service_up/2,
+         service_up/3,


It doesn't seem like service_up/3 is used by anyone else; is there another reason it should be exported?

It's just a more concise way to declare service_up/4 with default options. As far as I know, the new service_up's are going to be used for more than what it is currently, so I just tried to be nice and not require the options.

Maybe add some tests for service_up/3?

jtuple · 2012-11-07T23:21:50Z

src/riak_core_node_watcher.erl

 service_down(Id) ->
    gen_server:call(?MODULE, {service_down, Id}, infinity).

+service_down(Id, true) ->
+    gen_server:call(?MODULE, {service_down, Id, health_check}, infintiy);


Typo: infintiy => infinity

jtuple · 2012-11-07T23:36:28Z

The following code sequence (or similar enough sequence that it could be the same) is used in 4 different places:

    %% Remove health check if any
    case orddict:find(Id, State#state.health_checks) of
        error ->
            ok;
        {ok, Check} ->
            health_fsm(remove, Id, Check)
    end,
    Healths = orddict:erase(Id, S3#state.health_checks),

Much better to make a remove_health_check(Id, State) -> NewState function and reuse.

jtuple · 2012-11-07T23:48:26Z

I like the use of an FSM to model the states of the health checkers. However, the code could be a lot cleaner. Specifically, the current state should be a lot more obvious and not just one of several #health_check{} pattern matches. Preferably, making it the first function parameter if possibly to make things easy to scan/read. Also such makes Dialyzer work a lot better at checking state transitions. Similarly, making the "next state" more obvious would also help -- rather than it again just being a value in OutCheck.

As an example, could make FSM functions work similar to gen_fsm and return a 3-tuple {Reply, NextState, OutCheck} and use a simple utility function to extract/update the state value in #health_check{}:

%% Trigger health check FSM transition
health_fsm(Msg, Service, Check=#health_check{state=State}) ->
    {Reply, NextState, Check2} = health_fsm(State, Msg, Service, Check),
    Check3 = Check2#health_check{state=NextState},
    {ok, Check3}.

%% Health FSM
%% Suspended state.
health_fsm(suspended, resume, Service, InCheck) ->
    %% code
    {ok, waiting, OutCheck};
health_fsm(suspended, remove, Service, InCheck) ->
    {ok, suspended, InCheck};

%% Checking state.
health_fsm(checking, suspend, _Service, InCheck) ->
    %% code
    {ok, suspended, OutCheck};
health_fsm(checking, check_health, _Service, InCheck) ->
    {ok, checking, InCheck};

...

jtuple · 2012-11-08T00:24:34Z

All of the health_fsm({'EXIT', Pid, ...}, Service, #health_check{...} = InCheck) -> function clauses for the different cases N, M, and exit status (normal or false) is hard to read / reason about, largely due to massive code duplication. The first 5 function clauses can easily be rewritten to make the logic clear. Not sure if combining with the 6th version is possible, that that doesn't matter much as it's the first 5 that are hard to review. Here's an example of what I mean by re-writing. I think this would be correct, but it's just an example and should be double checked / serve as inspiration:

health_fsm({'EXIT', Pid, Result}, Service, #health_check{checking_pid = Pid,
                                                         health_failures = N,
                                                         max_health_failures = M,
                                                         check_interval=Int} = InCheck) ->
    Tref = next_health_tref(N, Int, Service),
    {Reply, HealthFail, CallbackFail} = handle_failure_response(Result, N, M),
    OutCheck = InCheck#health_check{
        state = waiting,
        checking_pid = undefined,
        health_failures = HealthFail,
        callback_failures = CallbackFail,
        interval_tref = Tref
    },
    {Reply, OutCheck};

handle_failure_response(normal, N, M) when N >= M ->
    {up, 0, 0};
handle_failure_response(normal, N, M) when N < M ->
    {ok, N + 1, 0};

handle_failure_response(false, N, M) when N + 1 == M ->
    {down, N + 1, 0};
handle_failure_response(false, N, M) when N >= M ->
    {ok, N + 1, 0};
handle_failure_response(false, N, M) ->
    {ok, N + 1, 0}.

jtuple · 2012-11-08T00:27:46Z

Simliar all of the health_fsm({'EXIT', Pid, ...}, Service, #health_check{...} = InCheck) -> function clauses do not do any pattern machine on state. Is is desired that these are handled the same way in every health check state, or should this only be handled when state == checking? Part of the reason of making states an explicit parameter as suggested above is to make these things more clear.

jtuple · 2012-11-08T00:45:15Z

A few remaining comments.

Any thoughts on the resolution for health checks being milliseconds rather than seconds? Seems better to allow more fine-grained resolution in case we ever need it. But, who knows. Really depends on the types of checks we end up building.
I'm not a huge fan of using spawn_link + trap_exit + exit(Msg) for message passing between the checker pid and the node watcher. The node watcher is a gen_server and a riak_kv_node_watcher:health_check_result(true/false) that maps to a gen_server:cast seems more idiomatic, obviously we'd still trap the exits to catch when things fail, but that's a special case. In any case, for the purpose of this PR, I'm not going to -1 on this issue. But, it is dirty and something we really want to move away from doing. We already do this approach in the riak_core_handoff_manager but future re-write plans include removing message passing via exits.
I'm also not a huge fan of using the process dictionary to map Pid => ServiceId. Why are we doing this? An extra orddict in #state{} seems like the right approach. Yes, I know node watcher already abuses the process dictionary in places like delete_service_mrefs but that no reason to repeat that mistake twice. Is there something I'm missing that makes the pdict necessary here?
It would be great to write a riak_test that tests health checks and their interaction with the existing net_kernel health check. The desired semantics here are that if a health check fails on a node and that node marks a service as down, that all nodes in the cluster eventually see that service as down for the node. The code here should do that, but much better to have a test for that. Likewise, would be interesting to see what happens during net-splits which trigger existing service down events on remote nodes. Willing to talk about writing a riak_test for this stuff and even open to scheduling pair programming if desired to work on a test together.

Also using a macro to define default rather than a magic number

lordnull · 2012-11-08T19:22:52Z

I've taken most of your comments and implemented them. health_fsm's should be clearer, fixed the typo's.

Implemented
I wanted to avoid sending 2 messages; because the node watcher already traps exit, sending a success/fail message, and then exiting means 2 messages that can serve the same purpose. While a raw spawn_link is dirty, I'm not certain on what makes using the exit message an issue. Though I have more thoughs on this that tie in with 3.
I used the process dictionary to avoid threading the state of the node_watcher through the fsm, or make the return value of the fsm's more complex. In the short term, it was simplier to implement; I agree it's not worth keeping in the long term. As I said in 2, I have more thoughts on this.
A bit of guidance on writing the test, and I go forth and take a stab at it.

The further thoughts I had on 2+3:

Keeping the fsm's in the state of the node_watcher, with the node watcher itself dealing the spawns/deaths is distasteful to me; it was implemented as such to fulfile requirements with minimal structural change to the system. I think a cleaner implementation would be a supervisor specifically for health_check fsm's, those being implemented as gen_fsm. The health_check fsm would then send messages to the node_watcher if a service is down or not. The node_watcher service_up code would have the health_check_sup spawn_link a new fsm, meaning exits are either normal or callback failure. The above has a much better seperation of concerns, and would likely have a cleaner implementation. I'm just not sure how well adding another supervisory process to the mix would do, or if the idea has legs.

reiddraper · 2012-11-09T18:27:42Z

I used the process dictionary to avoid threading the state of the node_watcher through the fsm, or make the return value of the fsm's more complex. In the short term, it was simplier to implement; I agree it's not worth keeping in the long term. As I said in 2, I have more thoughts on this.

IMHO using the pdict is more complicated than just normal functions.

slfritchie · 2012-11-16T23:04:10Z

A friendly note to all participants ... today is code freeze day.

Change incorrect use of #health_check.interval_tref to correct use of #health_check.check_interval. Fix badmatch error by changing handle_fsm_exit to return a 2-tuple as expected at the call-site rather than a 3-tuple. Changed determine_time to return an integer as required by Erlang as a timeout value. Previously, the function could return a float and trigger a badarg.

Fix typo DEFUATL_HEALTH_CHECK_INTERVAL to DEFAULT_HEALTH_CHECK_INTERVAL. Wrap extraordinarily long lines in riak_core_node_watcher to enable viewing code / commits on Github without horizontal scrolling. Typically, we aim to limit lines to 79 characters for optimal terminal viewing as well, but we've all been guilty of longer lines here and there so I only modified lines that were easily changed and/or too long for Github. Changed a few comment-only lines from '%' to '%%' to match Riak convention, and make Emacs erlang-mode auto-indent happy.

Extend riak_core_node_watcher to support the registration of health check logic that monitors a given service, automatically marking the service as down when unhealthy and back-up when healthy.

jtuple · 2012-12-15T04:38:29Z

+1. Going to merging as part of larger health check work. Related pull-requests and reviews by @jrwest at #257 and basho/riak_kv#447

jtuple · 2012-12-15T04:44:13Z

While PR was against 1.2 branch, this code is actually merging into master for Riak 1.3. Merged in commit: 08f1f29

There are two remaining items to address in a future pull-request:

We need to provide the ability to disable the health check functionality to provide an escape hatch in production if something turns up to have an issue.
We need to get rid of the spurious "crash messages" that are printed whenever a health check returns false. These messages occur because of the use of exit for signaling between checker pids and the node watcher, and Erlang prints errors for non-normal exits. Example message:

14:58:35.514 [error] CRASH REPORT Process <0.2509.0> with 0 neighbours exited with reason: false in riak_core_node_watcher:'-start_health_check/2-fun-0-'/4 line 723

Both should be addressed before the Riak 1.3 code freeze if at all possible.

seancribbs · 2013-01-03T13:30:13Z

@jtuple This is closed (and still in the "review" column), are your two remaining items unaddressed?

lordnull added 13 commits October 4, 2012 11:37

Added tests for new functionality (to be added).

8ed7621

Fixed typo.

81a6b17

Rough implementation of health check system

d0bc513

Tests fail, so not final.

Breaking up validate_broadcast to handle many bcasts.

c748deb

This is done to test things like health check service recovery. Such a recovery would have both a down and an up, so the state progression and conresponding broadchast messages should be checked.

adding helthy service works

1a3b2c2

Removed unused tests.

6a036f5

Fixed typo

a6e4ecf

Further progress, but still unhealthy service no worky

e261413

Fixed unhealthy_service test

183b7a4

Much improved/useful health check: qc not passing still.

7190e91

Doesn't play nicely with interweaved node up, service up, service down and node downs yet.

heath tests full pass

d4bc109

Added documentation and specs.

8d318b4

Fixed warnings.

50a8b85

macshonle reviewed Oct 30, 2012
View reviewed changes

Added a spec.

552f323

macshonle reviewed Oct 31, 2012
View reviewed changes

Added tests for service_up/3

f85fcfa

ghost assigned jtuple Nov 7, 2012

jtuple reviewed Nov 7, 2012
View reviewed changes

lordnull added 2 commits November 8, 2012 11:06

Fixed typos and a mis-spec

530d0aa

Removed code duplication for service add/drop

2a0907c

lordnull added 2 commits November 8, 2012 12:51

Refactored health_fsm to make fsm part explicit

89f958d

Health check is set in milliseconds now

28d1351

Also using a macro to define default rather than a magic number

jtuple added 2 commits December 5, 2012 13:29

This was referenced Dec 7, 2012

Enable riak_core apps to provide a health_check callback #257

Closed

Add basic health check to Riak KV basho/riak_kv#447

Merged

ghost assigned jrwest and jtuple Dec 7, 2012

jtuple closed this Dec 15, 2012

seancribbs deleted the issue_388 branch April 1, 2015 23:00

ooshlablu unassigned jtuple Apr 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

health check system #240

health check system #240

lordnull commented Oct 17, 2012

macshonle Oct 30, 2012

macshonle Oct 31, 2012

lordnull Oct 31, 2012

macshonle Oct 31, 2012

jtuple Nov 7, 2012

jtuple commented Nov 7, 2012

jtuple commented Nov 7, 2012

jtuple commented Nov 8, 2012

jtuple commented Nov 8, 2012

jtuple commented Nov 8, 2012

lordnull commented Nov 8, 2012

reiddraper commented Nov 9, 2012

slfritchie commented Nov 16, 2012

jtuple commented Dec 15, 2012

jtuple commented Dec 15, 2012

seancribbs commented Jan 3, 2013

health check system #240

health check system #240

Conversation

lordnull commented Oct 17, 2012

macshonle Oct 30, 2012

Choose a reason for hiding this comment

macshonle Oct 31, 2012

Choose a reason for hiding this comment

lordnull Oct 31, 2012

Choose a reason for hiding this comment

macshonle Oct 31, 2012

Choose a reason for hiding this comment

jtuple Nov 7, 2012

Choose a reason for hiding this comment

jtuple commented Nov 7, 2012

jtuple commented Nov 7, 2012

jtuple commented Nov 8, 2012

jtuple commented Nov 8, 2012

jtuple commented Nov 8, 2012

lordnull commented Nov 8, 2012

reiddraper commented Nov 9, 2012

slfritchie commented Nov 16, 2012

jtuple commented Dec 15, 2012

jtuple commented Dec 15, 2012

seancribbs commented Jan 3, 2013