Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health check system #240

Closed
wants to merge 21 commits into from
Closed

health check system #240

wants to merge 21 commits into from

Conversation

lordnull
Copy link
Contributor

issue #388

service_pid :: pid(),
checking_pid :: pid(),
health_failures = 0 :: non_neg_integer(),
callback_failures = 0,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be non_neg_integer() too, like health_failures?

@@ -26,7 +26,11 @@
%% API
-export([start_link/0,
service_up/2,
service_up/3,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like service_up/3 is used by anyone else; is there another reason it should be exported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a more concise way to declare service_up/4 with default options. As far as I know, the new service_up's are going to be used for more than what it is currently, so I just tried to be nice and not require the options.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some tests for service_up/3?

@ghost ghost assigned jtuple Nov 7, 2012
service_down(Id) ->
gen_server:call(?MODULE, {service_down, Id}, infinity).

service_down(Id, true) ->
gen_server:call(?MODULE, {service_down, Id, health_check}, infintiy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: infintiy => infinity

@jtuple
Copy link
Contributor

jtuple commented Nov 7, 2012

The following code sequence (or similar enough sequence that it could be the same) is used in 4 different places:

    %% Remove health check if any
    case orddict:find(Id, State#state.health_checks) of
        error ->
            ok;
        {ok, Check} ->
            health_fsm(remove, Id, Check)
    end,
    Healths = orddict:erase(Id, S3#state.health_checks),

Much better to make a remove_health_check(Id, State) -> NewState function and reuse.

@jtuple
Copy link
Contributor

jtuple commented Nov 7, 2012

I like the use of an FSM to model the states of the health checkers. However, the code could be a lot cleaner. Specifically, the current state should be a lot more obvious and not just one of several #health_check{} pattern matches. Preferably, making it the first function parameter if possibly to make things easy to scan/read. Also such makes Dialyzer work a lot better at checking state transitions. Similarly, making the "next state" more obvious would also help -- rather than it again just being a value in OutCheck.

As an example, could make FSM functions work similar to gen_fsm and return a 3-tuple {Reply, NextState, OutCheck} and use a simple utility function to extract/update the state value in #health_check{}:

%% Trigger health check FSM transition
health_fsm(Msg, Service, Check=#health_check{state=State}) ->
    {Reply, NextState, Check2} = health_fsm(State, Msg, Service, Check),
    Check3 = Check2#health_check{state=NextState},
    {ok, Check3}.

%% Health FSM
%% Suspended state.
health_fsm(suspended, resume, Service, InCheck) ->
    %% code
    {ok, waiting, OutCheck};
health_fsm(suspended, remove, Service, InCheck) ->
    {ok, suspended, InCheck};

%% Checking state.
health_fsm(checking, suspend, _Service, InCheck) ->
    %% code
    {ok, suspended, OutCheck};
health_fsm(checking, check_health, _Service, InCheck) ->
    {ok, checking, InCheck};

...

@jtuple
Copy link
Contributor

jtuple commented Nov 8, 2012

All of the health_fsm({'EXIT', Pid, ...}, Service, #health_check{...} = InCheck) -> function clauses for the different cases N, M, and exit status (normal or false) is hard to read / reason about, largely due to massive code duplication. The first 5 function clauses can easily be rewritten to make the logic clear. Not sure if combining with the 6th version is possible, that that doesn't matter much as it's the first 5 that are hard to review. Here's an example of what I mean by re-writing. I think this would be correct, but it's just an example and should be double checked / serve as inspiration:

health_fsm({'EXIT', Pid, Result}, Service, #health_check{checking_pid = Pid,
                                                         health_failures = N,
                                                         max_health_failures = M,
                                                         check_interval=Int} = InCheck) ->
    Tref = next_health_tref(N, Int, Service),
    {Reply, HealthFail, CallbackFail} = handle_failure_response(Result, N, M),
    OutCheck = InCheck#health_check{
        state = waiting,
        checking_pid = undefined,
        health_failures = HealthFail,
        callback_failures = CallbackFail,
        interval_tref = Tref
    },
    {Reply, OutCheck};

handle_failure_response(normal, N, M) when N >= M ->
    {up, 0, 0};
handle_failure_response(normal, N, M) when N < M ->
    {ok, N + 1, 0};

handle_failure_response(false, N, M) when N + 1 == M ->
    {down, N + 1, 0};
handle_failure_response(false, N, M) when N >= M ->
    {ok, N + 1, 0};
handle_failure_response(false, N, M) ->
    {ok, N + 1, 0}.

@jtuple
Copy link
Contributor

jtuple commented Nov 8, 2012

Simliar all of the health_fsm({'EXIT', Pid, ...}, Service, #health_check{...} = InCheck) -> function clauses do not do any pattern machine on state. Is is desired that these are handled the same way in every health check state, or should this only be handled when state == checking? Part of the reason of making states an explicit parameter as suggested above is to make these things more clear.

@jtuple
Copy link
Contributor

jtuple commented Nov 8, 2012

A few remaining comments.

  1. Any thoughts on the resolution for health checks being milliseconds rather than seconds? Seems better to allow more fine-grained resolution in case we ever need it. But, who knows. Really depends on the types of checks we end up building.
  2. I'm not a huge fan of using spawn_link + trap_exit + exit(Msg) for message passing between the checker pid and the node watcher. The node watcher is a gen_server and a riak_kv_node_watcher:health_check_result(true/false) that maps to a gen_server:cast seems more idiomatic, obviously we'd still trap the exits to catch when things fail, but that's a special case. In any case, for the purpose of this PR, I'm not going to -1 on this issue. But, it is dirty and something we really want to move away from doing. We already do this approach in the riak_core_handoff_manager but future re-write plans include removing message passing via exits.
  3. I'm also not a huge fan of using the process dictionary to map Pid => ServiceId. Why are we doing this? An extra orddict in #state{} seems like the right approach. Yes, I know node watcher already abuses the process dictionary in places like delete_service_mrefs but that no reason to repeat that mistake twice. Is there something I'm missing that makes the pdict necessary here?
  4. It would be great to write a riak_test that tests health checks and their interaction with the existing net_kernel health check. The desired semantics here are that if a health check fails on a node and that node marks a service as down, that all nodes in the cluster eventually see that service as down for the node. The code here should do that, but much better to have a test for that. Likewise, would be interesting to see what happens during net-splits which trigger existing service down events on remote nodes. Willing to talk about writing a riak_test for this stuff and even open to scheduling pair programming if desired to work on a test together.

@lordnull
Copy link
Contributor Author

lordnull commented Nov 8, 2012

I've taken most of your comments and implemented them. health_fsm's should be clearer, fixed the typo's.

  1. Implemented

  2. I wanted to avoid sending 2 messages; because the node watcher already traps exit, sending a success/fail message, and then exiting means 2 messages that can serve the same purpose. While a raw spawn_link is dirty, I'm not certain on what makes using the exit message an issue. Though I have more thoughs on this that tie in with 3.

  3. I used the process dictionary to avoid threading the state of the node_watcher through the fsm, or make the return value of the fsm's more complex. In the short term, it was simplier to implement; I agree it's not worth keeping in the long term. As I said in 2, I have more thoughts on this.

  4. A bit of guidance on writing the test, and I go forth and take a stab at it.

The further thoughts I had on 2+3:

Keeping the fsm's in the state of the node_watcher, with the node watcher itself dealing the spawns/deaths is distasteful to me; it was implemented as such to fulfile requirements with minimal structural change to the system. I think a cleaner implementation would be a supervisor specifically for health_check fsm's, those being implemented as gen_fsm. The health_check fsm would then send messages to the node_watcher if a service is down or not. The node_watcher service_up code would have the health_check_sup spawn_link a new fsm, meaning exits are either normal or callback failure. The above has a much better seperation of concerns, and would likely have a cleaner implementation. I'm just not sure how well adding another supervisory process to the mix would do, or if the idea has legs.

@reiddraper
Copy link
Contributor

  1. I used the process dictionary to avoid threading the state of the node_watcher through the fsm, or make the return value of the fsm's more complex. In the short term, it was simplier to implement; I agree it's not worth keeping in the long term. As I said in 2, I have more thoughts on this.

IMHO using the pdict is more complicated than just normal functions.

@slfritchie
Copy link
Contributor

A friendly note to all participants ... today is code freeze day.

Change incorrect use of #health_check.interval_tref to correct
use of #health_check.check_interval.

Fix badmatch error by changing handle_fsm_exit to return a 2-tuple as
expected at the call-site rather than a 3-tuple.

Changed determine_time to return an integer as required by Erlang as a
timeout value. Previously, the function could return a float and
trigger a badarg.
Fix typo DEFUATL_HEALTH_CHECK_INTERVAL to DEFAULT_HEALTH_CHECK_INTERVAL.

Wrap extraordinarily long lines in riak_core_node_watcher to enable
viewing code / commits on Github without horizontal scrolling. Typically,
we aim to limit lines to 79 characters for optimal terminal viewing as
well, but we've all been guilty of longer lines here and there so I only
modified lines that were easily changed and/or too long for Github.

Changed a few comment-only lines from '%' to '%%' to match Riak convention,
and make Emacs erlang-mode auto-indent happy.
@ghost ghost assigned jrwest and jtuple Dec 7, 2012
jtuple added a commit that referenced this pull request Dec 15, 2012
Extend riak_core_node_watcher to support the registration of health
check logic that monitors a given service, automatically marking the
service as down when unhealthy and back-up when healthy.
@jtuple
Copy link
Contributor

jtuple commented Dec 15, 2012

+1. Going to merging as part of larger health check work. Related pull-requests and reviews by @jrwest at #257 and basho/riak_kv#447

@jtuple
Copy link
Contributor

jtuple commented Dec 15, 2012

While PR was against 1.2 branch, this code is actually merging into master for Riak 1.3. Merged in commit: 08f1f29

There are two remaining items to address in a future pull-request:

  1. We need to provide the ability to disable the health check functionality to provide an escape hatch in production if something turns up to have an issue.
  2. We need to get rid of the spurious "crash messages" that are printed whenever a health check returns false. These messages occur because of the use of exit for signaling between checker pids and the node watcher, and Erlang prints errors for non-normal exits. Example message:
14:58:35.514 [error] CRASH REPORT Process <0.2509.0> with 0 neighbours exited with reason: false in riak_core_node_watcher:'-start_health_check/2-fun-0-'/4 line 723

Both should be addressed before the Riak 1.3 code freeze if at all possible.

@jtuple jtuple closed this Dec 15, 2012
@seancribbs
Copy link
Contributor

@jtuple This is closed (and still in the "review" column), are your two remaining items unaddressed?

@seancribbs seancribbs deleted the issue_388 branch April 1, 2015 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants