Testing bitcask with QuickCheck and PULSE #41

Closed
wants to merge 17 commits into
from

Conversation

Projects
None yet
5 participants
Contributor

hanssv commented Mar 28, 2012

Jon, Dizzy,

We created a test framework using QuickCheck, PULSE and eqc_temporal to test for race conditions in the bitcask application. Please note that you need QuickCheck version newer than eqc-1.25.2 to run those tests.

We changed the rebar configuration file (in comments) to be able to execute this code.

Regards
Quviq

hanssv and others added some commits Mar 28, 2012

Removing old pulse-test
The old eqc pulse property does not work with the
new pulse/rebar.config.
Changed rebar.config
The new PULSE tests consider a lot more side-effects and uses
pulse_gen_server, pulse_supervisor, etc.

We also need to instrument the C-code, therefore there is a line
in port_envs for setting -D PULSE.
Add pulse_instrument compile options
More bitcask modules are instrumented by PULSE.
All compile options are ifdef:ed with PULSE
Patching bitcask_nifs for PULSE usage
Erlang messages are sent from the C-code (bitcask_nifs.c). This
will not work when scheduled by PULSE. Since it is not possible

Patching bitcask_nifs for PULSE usage

Erlang messages are sent from the C-code (bitcask_nifs.c). This
will not work when scheduled by PULSE. Since it is not possible
to send messages to registered processes from C, quite a bit of
machinery is needed to keep track of the current PULSE process
so that messages can be re-routed properly.

Everything is ifdef:ed, so when compiling normally, no code is
added to bitcask.so.

Usage:
bitcask_nifs:set_pulse_pid(Pid) is used to store the current
PULSE pid, after that messages sent from the C code will be
routed properly.
If no call is made to bitcask_nifs:set_pulse_pid is made, the
behavior is unspecified!!
Added a note to rebar.config
The tests can't be run without the pulsified versions of some OTP
libraries (pulse_gen, pulse_gen_server, etc...) and a new version of
PULSE.
Updated pulse_send
Now uses the correct format required by the improved PULSE. Also turned the
function call into a Macro in order to get access to source location!
No need to explicitly start pulse_application_controller
This is done automatically when calling (pulse_)application:start/1 in
the latest pulse_otp.
added no_side_effect on erlang:now()
also use erlang:now() instead of pulse:seed() to make
quickcheck pick a fresh seed in every shrinking step
erlang:now() doesn't make a good pulse seed
it improves shrinking, but makes recheck not work
Contributor

gburd commented Oct 24, 2012

@hanssv if you rebase this to master I'll review, test and work to approve it.

Contributor

jonmeredith commented Oct 24, 2012

Has master moved significantly since this work was done. I feel really bad this issue has sat for so long.

Contributor

slfritchie commented Oct 25, 2012

@hanssv and @UlfNorell, I too apologize for how this PR has languished. I've made an attempt at rebasing your work onto bitcask's master branch: the result the 'gh41-quviq-pulse-and-qc' branch, https://github.com/basho/bitcask/tree/gh41-quviq-pulse-and-qc ... If one of you could peruse to see that I didn't break anything awful when dealing with a few rebasing conflicts, we'll resume reviewing using the gh41-quviq-pulse-and-qc branch.

I've made it as far as checking that the pre-rebased tests don't fail any more. Bug #60 appears to mask a QuickCheck EUnit test failure, but at least I haven't made things any worse.

Next, I'll try uncommenting all of the PULSE stuff in rebar.config and see how things go.

Contributor

slfritchie commented Oct 25, 2012

@hanssv and @UlfNorell, where would I find the pulsified OTP components? They don't appear to be part of the "Quviq QuickCheck version 1.27.2" distribution zip file. Feel free to drop me an email note at scott@basho.com.

                  %% NOTE2: You'll need a recent (1.25.2+) version of PULSE to run the tests
                  %% and a copy of the pulsified OTP componentes (pulse_gen, etc...)

Attempts by a test to the tests without them do stuff like this:

{'EXIT',
    {{case_clause,
         {badrpc,
             {'EXIT',
                 {pulse,
                     [{undef,
                          [{pulse_application,start,[bitcask],[]},
                           {bitcask_eqc,'-run_commands_on_node/3-fun-9-',2,
                               [{file,"test/pulse/bitcask_eqc.erl"},
                                {line,275}]},
                           {eqc,'-f140_0/4-fun-0-',2,
                               [{file,
                                    "../src/../../pulse-project/src/pulse_scheduler.hrl"},
                                {line,147}]}]}]}}}},
     [{bitcask_eqc,'-prop_pulse/1-fun-4-',3,
          [{file,"test/pulse/bitcask_eqc.erl"},{line,306}]}]}}
Contributor

hanssv commented Oct 25, 2012

I won't have time to look at this during this week I am afraid.

The pulsified versions of the OTP components will be part of the next major QuickCheck release (1.28), until then they can be retreived from (https://github.com/Quviq/pulse_otp).

Contributor

slfritchie commented Oct 25, 2012

Many thanks for the pointer to git://github.com/Quviq/pulse_otp.git ... I've cloned it, compiled, and successfully run the PULSE tests. I believe the 'gh41-quviq-pulse-and-qc' branch is ready for Basho internal review.

Contributor

slfritchie commented Oct 26, 2012

For what it's worth, I see this failure very occasionally. It isn't clear to me where the deadlock atom is coming from.

Failed! Reason: 
{'EXIT',{{case_clause,{badrpc,{'EXIT',deadlock}}},
         [{bitcask_eqc,'-prop_pulse/1-fun-4-',3,
                       [{file,"test/pulse/bitcask_eqc.erl"},{line,306}]}]}}
Contributor

hanssv commented Oct 26, 2012

PULSE can detect deadlocks, and if it does, it exits with the atom deadlock as reason.

It looks like this is not properly handled in prop_pulse. It looks like the run_on_node returns {badrpc, {'EXIT', Err}} rather than the expected {'EXIT', Err}, this is not handled in the case at line 306. I'll try to have a quick look at this later today, but I thought I'd better write this comment now if I fail to get back to this problem later today...

Contributor

slfritchie commented Oct 26, 2012

Hans, I'll cobble up an extra tuple wrapper around line 306 and a cousin on line 307.

I've a theory about a problem with the model with respect to repeated closing & opening: in some races, it seems possible that the internal refcount (upon close) can drop to 0, whereas the model still seems to think that some actor has got the bitcask open. If the refcount drops to zero, then all entries in the memory keydir are dropped, which then plays havoc on the model's expectation of get & fold & fold_keys results ought to be.

Once a counter-example is found, shrinking takes quite a while ... so I don't have enough data to go on yet, alas.

Contributor

slfritchie commented Oct 26, 2012

Unrelated, but interesting:

=ERROR REPORT==== 26-Oct-2012::03:13:01 ===
Unknown command for pulse: 49
Contributor

hanssv commented Oct 26, 2012

Yes, shrinking might be a bit time consuming, there are quite many scheduling decisions in there and I suspect that only a very particular kind of schedules actually provoke the error.

That last bit is puzzling; it means that someone sent 49 to the outermost pulse scheduler loop, which seems very strange, does it happen regularly?

Contributor

slfritchie commented Oct 26, 2012

The 49 command has only happened once. I have seen a variation that appears to halt/deadlock/pause-at-very-least PULSE; I haven't waited long enough to see if work resumes.

=ERROR REPORT==== 26-Oct-2012::03:46:36 ===
Unknown command for pulse: [{'fork.__Result7',yield},
                            {'fork.__Result7',yield},
                            {root,yield},
                            {root,{deliver,'fork.__Result6'}},
                            {spawn_opt1,{deliver,root}},
                            {'handle_call.WorkerPid1',yield},
                            {'handle_call.WorkerPid1',yield},
                            {'start.AC',{deliver,'handle_call.WorkerPid1'}},
                            ....
Contributor

slfritchie commented Nov 16, 2012

Closing: replaced by #68 which is a superset of this PR.

@slfritchie slfritchie closed this Nov 16, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment