New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
health check system #240
health check system #240
Conversation
Tests fail, so not final.
This is done to test things like health check service recovery. Such a recovery would have both a down and an up, so the state progression and conresponding broadchast messages should be checked.
Doesn't play nicely with interweaved node up, service up, service down and node downs yet.
service_pid :: pid(), | ||
checking_pid :: pid(), | ||
health_failures = 0 :: non_neg_integer(), | ||
callback_failures = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be non_neg_integer() too, like health_failures?
@@ -26,7 +26,11 @@ | |||
%% API | |||
-export([start_link/0, | |||
service_up/2, | |||
service_up/3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like service_up/3 is used by anyone else; is there another reason it should be exported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a more concise way to declare service_up/4 with default options. As far as I know, the new service_up's are going to be used for more than what it is currently, so I just tried to be nice and not require the options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some tests for service_up/3?
service_down(Id) -> | ||
gen_server:call(?MODULE, {service_down, Id}, infinity). | ||
|
||
service_down(Id, true) -> | ||
gen_server:call(?MODULE, {service_down, Id, health_check}, infintiy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: infintiy => infinity
The following code sequence (or similar enough sequence that it could be the same) is used in 4 different places: %% Remove health check if any
case orddict:find(Id, State#state.health_checks) of
error ->
ok;
{ok, Check} ->
health_fsm(remove, Id, Check)
end,
Healths = orddict:erase(Id, S3#state.health_checks), Much better to make a |
I like the use of an FSM to model the states of the health checkers. However, the code could be a lot cleaner. Specifically, the current state should be a lot more obvious and not just one of several As an example, could make FSM functions work similar to %% Trigger health check FSM transition
health_fsm(Msg, Service, Check=#health_check{state=State}) ->
{Reply, NextState, Check2} = health_fsm(State, Msg, Service, Check),
Check3 = Check2#health_check{state=NextState},
{ok, Check3}.
%% Health FSM
%% Suspended state.
health_fsm(suspended, resume, Service, InCheck) ->
%% code
{ok, waiting, OutCheck};
health_fsm(suspended, remove, Service, InCheck) ->
{ok, suspended, InCheck};
%% Checking state.
health_fsm(checking, suspend, _Service, InCheck) ->
%% code
{ok, suspended, OutCheck};
health_fsm(checking, check_health, _Service, InCheck) ->
{ok, checking, InCheck};
... |
All of the health_fsm({'EXIT', Pid, Result}, Service, #health_check{checking_pid = Pid,
health_failures = N,
max_health_failures = M,
check_interval=Int} = InCheck) ->
Tref = next_health_tref(N, Int, Service),
{Reply, HealthFail, CallbackFail} = handle_failure_response(Result, N, M),
OutCheck = InCheck#health_check{
state = waiting,
checking_pid = undefined,
health_failures = HealthFail,
callback_failures = CallbackFail,
interval_tref = Tref
},
{Reply, OutCheck};
handle_failure_response(normal, N, M) when N >= M ->
{up, 0, 0};
handle_failure_response(normal, N, M) when N < M ->
{ok, N + 1, 0};
handle_failure_response(false, N, M) when N + 1 == M ->
{down, N + 1, 0};
handle_failure_response(false, N, M) when N >= M ->
{ok, N + 1, 0};
handle_failure_response(false, N, M) ->
{ok, N + 1, 0}. |
Simliar all of the |
A few remaining comments.
|
Also using a macro to define default rather than a magic number
I've taken most of your comments and implemented them. health_fsm's should be clearer, fixed the typo's.
The further thoughts I had on 2+3: Keeping the fsm's in the state of the node_watcher, with the node watcher itself dealing the spawns/deaths is distasteful to me; it was implemented as such to fulfile requirements with minimal structural change to the system. I think a cleaner implementation would be a supervisor specifically for health_check fsm's, those being implemented as gen_fsm. The health_check fsm would then send messages to the node_watcher if a service is down or not. The node_watcher service_up code would have the health_check_sup spawn_link a new fsm, meaning exits are either normal or callback failure. The above has a much better seperation of concerns, and would likely have a cleaner implementation. I'm just not sure how well adding another supervisory process to the mix would do, or if the idea has legs. |
IMHO using the pdict is more complicated than just normal functions. |
A friendly note to all participants ... today is code freeze day. |
Change incorrect use of #health_check.interval_tref to correct use of #health_check.check_interval. Fix badmatch error by changing handle_fsm_exit to return a 2-tuple as expected at the call-site rather than a 3-tuple. Changed determine_time to return an integer as required by Erlang as a timeout value. Previously, the function could return a float and trigger a badarg.
Fix typo DEFUATL_HEALTH_CHECK_INTERVAL to DEFAULT_HEALTH_CHECK_INTERVAL. Wrap extraordinarily long lines in riak_core_node_watcher to enable viewing code / commits on Github without horizontal scrolling. Typically, we aim to limit lines to 79 characters for optimal terminal viewing as well, but we've all been guilty of longer lines here and there so I only modified lines that were easily changed and/or too long for Github. Changed a few comment-only lines from '%' to '%%' to match Riak convention, and make Emacs erlang-mode auto-indent happy.
Extend riak_core_node_watcher to support the registration of health check logic that monitors a given service, automatically marking the service as down when unhealthy and back-up when healthy.
+1. Going to merging as part of larger health check work. Related pull-requests and reviews by @jrwest at #257 and basho/riak_kv#447 |
While PR was against 1.2 branch, this code is actually merging into master for Riak 1.3. Merged in commit: 08f1f29 There are two remaining items to address in a future pull-request:
Both should be addressed before the Riak 1.3 code freeze if at all possible. |
@jtuple This is closed (and still in the "review" column), are your two remaining items unaddressed? |
issue #388