PULSE test & fix riak_pipe_fitting #73

Merged
merged 4 commits into from May 1, 2013

Conversation

Projects
None yet
2 participants
Contributor

beerriot commented Apr 23, 2013

In an attempt to locate the source of noproc messages in issues 48 and 49, I wrote the attached tests to exercise part riak_pipe_fitting under PULSE. The specific part I was targeting was the 'eoi' behavior encountered when no workers are active. To facilitate Riak KV's mapreduce need to evaluate a reduce phase even if no inputs are received, riak_pipe_fitting will evaluate the worker behavior in-process in this case.

The PULSE test found that in addition to a noproc exit when sending synchronous messages to a sink that does not exists, normal (or any other reason) exits may happen if the sink exits after the synchronous message has been delivered. The riak_pipe_sink:send_to_sink_fsm/5 function has been changed to catch these other exit reasons as well.

To see the tests fail, checkout the first (and second, if you wish) commits in this PR, and run ./rebar get-deps && make clean pulse They should fail with a reason like (partial Eunit output):

    {function_clause,
        [{reduce_fitting_pulse,exit_reason_is_normalish,
             [{normal,
                  {pulse_gen_fsm,sync_send_event,
                      [<0.700.0>,{p">>...

And may also include:

Exit Reasons: [{client,after_eoi},
               {sink,normal},
               {fitting,
                   {normal,
                       {pulse_gen_fsm,sync_send_event,
                           [<0.675.0>,
                            {pipe_eoi,#Ref<0.0.0.11121>},
                            infinity]}}},
               {builder,normal}]

And:

::error:{assertion_failed,[{module,reduce_fitting_pulse},
                         {line,299},
                         {expression,"eqc : quickcheck ( eqc : numtests ( 5000 , prop_fitting_dies_normal ( ) ) )"},
                         {expected,true},
                         {value,false}]}
  in function reduce_fitting_pulse:'-death_test_/0-fun-2-'/0 [test/reduce_fitting_pulse.erl:299]

If they instead fail for a reason like timeout, you may be missing the pulse_otp package, or other PULSE components.

To see the tests pass, aadd third (and fourth, if you wish) commit to your checkout, and make pulse again.

Bryan Fink added some commits Apr 18, 2013

Bryan Fink
first fitting pulse test to produce an error
If a fitting has no active workers when eoi arrives, it may evaluate
the worker behavior in-process. This might cause the fitting to
exit non-normal for any number of reasons, but PULSE found specifically
that the sink exiting while a synchronous message is being sent to it
(fsm-type sink, on the sync period), the gen_fsm call will exit
abnormally, but not necessarily in any of the ways already caught.
The specific exit seen here was {normal, {gen...}}.
Bryan Fink
add 'raw' sink time to fitting pulse test
Checking to see whether this can reproduce issue 48 or 49. It didn't,
but the coverage may be useful anyway.
Bryan Fink
catch sink fsm exits for all reasons
The reduce_fitting_pulse test showed that other exit reasons might
be seen here. For example, 'normal' can happen if the sink shuts
down after the sync event has been delivered.

This commit broadens the catch to prevent those other reasons from
killing the current process, and also narrows it to make sure exit
is bubbling from the sync_send_event, and nowhere else.
Bryan Fink
doc & clean reduce_fitting_pulse
Cleanups:
 - compile non-pulse parts even when PULSE is not defined,
   to allow sanity checking with PULSE disabled
 - use eqc:testing_time instead of eqc:numtests to better
   match the eunit timeout expectation

@ghost ghost assigned beerriot Apr 23, 2013

Contributor

beerriot commented Apr 25, 2013

Just a note: this test caught this problem killing the fitting process. However, it is also possible for the same issue to kill the vnode process. Any point in riak_pipe_vnode that says ?T_ERR or ?T(..., [error], ...) might try to send a synchronous message to the sink during a Riak KV MapReduce query. The fix included here fixes the problem for vnodes as well.

@@ -126,7 +134,14 @@ send_to_sink_fsm(Pid, Msg, Timeout, true, _Count) ->
put(sink_sync, 0),
ok
catch
- exit:{timeout,_} -> {error, timeout};
- exit:{noproc,_} -> {error, sink_died}
@coderoshi

coderoshi May 1, 2013

Contributor

I see that there are other reasons for exit, but noproc is no longer possible?

@beerriot

beerriot May 1, 2013

Contributor

Indeed, noproc is still possible. The full spec of the exit that was being caught before was {noproc, {gen_fsm, sync_send_event, Args}}. The new code still catches that noproc as well as other exit reasons now.

@@ -2,6 +2,7 @@ REPO ?= riak_pipe
RIAK_TAG = $(shell git describe --tags)
REVISION ?= $(shell echo $(RIAK_TAG) | sed -e 's/^$(REPO)-//')
PKG_VERSION ?= $(shell echo $(REVISION) | tr - .)
+PULSE_TESTS = reduce_fitting_pulse
@coderoshi

coderoshi May 1, 2013

Contributor

This isn't the sexiest bash, but you could automatically add any test ending with *_pulse.erl:

PULSE_TESTS = $(shell cd test && ls -1 *_pulse.erl|awk 'gsub(".erl","")')

@beerriot

beerriot May 1, 2013

Contributor

I'm going to push this off to the first PR to add a second pulse test. The same fix will need to be made in basho/riak_core's Makefile.

Contributor

coderoshi commented May 1, 2013

Pulsing as advertised +1

beerriot pushed a commit that referenced this pull request May 1, 2013

@beerriot beerriot merged commit 194842a into master May 1, 2013

1 check passed

default The Travis build passed
Details

beerriot pushed a commit that referenced this pull request May 1, 2013

Bryan Fink
cherry-pick commit 041cee5 into 1.3 branch
It was desired to have the sink-sync-send crash fix backported to
the 1.3 branch. This is a cherry-pick of the commit that fixed the
issue from pull request #73. The tests tests that demonstrate the
issue have not been backported, to save merge hassle.

Original commit comment:

catch sink fsm exits for all reasons

The reduce_fitting_pulse test showed that other exit reasons might
be seen here. For example, 'normal' can happen if the sink shuts
down after the sync event has been delivered.

This commit broadens the catch to prevent those other reasons from
killing the current process, and also narrows it to make sure exit
is bubbling from the sync_send_event, and nowhere else.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment