Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Problems with commit? #405

Open
rzezeski opened this Issue · 4 comments

3 participants

@rzezeski

Over the weekend (9/28/13) I noticed that the Yokozuna riak tests were
failing very often on the rt:build_cluster call. It was failing
on the wait_until_nodes_ready function which verifies that all nodes
are a member of riak_core_ring:ready_members. Leaving the nodes up
after the test I was able to run member-status and ring-status and
notice that 1 or more nodes was still considered to be in the
join(ing) state rather than the valid state. Something was going
wrong with join/plan/commit. I wasn't sure which.

My first guess was that calling cluster join (rt:staged_join) too
fast was causing some kind of race in the joining node set. I added a
1s sleep to that function and I was able to get 15 successful runs.
Previously my longest streak was 4. But in most cases it fails every
other time. At this point I thought I solved it but then I decided to
investigate more.

The riak_core_claimant:plan call returns {ok, Changes, NextRings}
in the case of a successful plan. I modified the rt:plan_and_commit
function to print the Changes list. Without or without the sleep in
staged_join this list always included all nodes with the state
join. This leads to believe it can't possibly be a race in join.
How could all nodes be in the joining list if there was a race??

Then I started adding the 1s sleep in the plan_and_commit call. I
put it after matching the result but before calling commit.

{ok, _, _} ->
    timer:sleep(1000),
    ok = rpc:call(Node, riak_core_claimant, commit, [])

With this sleep I'm pretty sure I've never seen a failure. So that
made me think there is a race between plan and commit. Talking with
@jtuple he told me to check the NextRings value. He explained that
when plan is called the generated plan is stored in the claimant's
state. Then when commit is called a new plan is generated and
verified that it matches the saved plan in the claimant. If they
don't match the commit bails. Joe explained that I should inspect the
NextRings value and verify that all nodes are included as 'valid' in
there. Here are my print-outs below. One case is for when
build_cluster passed, the other case it didn't pass. Notice they are IDENTICAL.

%%% worked %%%

03:19:29.226 [info] 1 : {[{'dev1@127.0.0.1',valid},{'dev2@127.0.0.1',joining},{'dev3@127.0.0.1',joining},{'dev4@127.0.0.1',joining}], [{'dev1@127.0.0.1',valid},{'dev2@127.0.0.1',valid},{'dev3@127.0.0.1',valid},{'dev4@127.0.0.1',valid}]}
03:19:29.227 [info] 2 : {[{'dev1@127.0.0.1',valid},{'dev2@127.0.0.1',valid},{'dev3@127.0.0.1',valid},{'dev4@127.0.0.1',valid}], [{'dev1@127.0.0.1',valid},{'dev2@127.0.0.1',valid},{'dev3@127.0.0.1',valid},{'dev4@127.0.0.1',valid}]}


%%% no work %%%%

03:22:44.363 [info] 1 : {[{'dev1@127.0.0.1',valid},{'dev2@127.0.0.1',joining},{'dev3@127.0.0.1',joining},{'dev4@127.0.0.1',joining}], [{'dev1@127.0.0.1',valid},{'dev2@127.0.0.1',valid},{'dev3@127.0.0.1',valid},{'dev4@127.0.0.1',valid}]}
03:22:44.363 [info] 2 : {[{'dev1@127.0.0.1',valid},{'dev2@127.0.0.1',valid},{'dev3@127.0.0.1',valid},{'dev4@127.0.0.1',valid}], [{'dev1@127.0.0.1',valid},{'dev2@127.0.0.1',valid},{'dev3@127.0.0.1',valid},{'dev4@127.0.0.1',valid}]}

At this point Joe and myself gave up. It was the weekend after all.
The new assumption is that there might be something in the commit code
path that races with something else and transitions nodes back into a
joining state when it shouldn't.

But for whatever reason. Adding a 1s sleep in either rt:staged_join
or in rt:plan_and_commit seems to make the problem go away.

Here is the excerpt of ring-status and member-status when build
cluster fails and cluster gets in semi-valid state.

%%% dev1 ring-status %%%

================================== Claimant ===================================
Claimant:  'dev1@127.0.0.1'
Status:     up
Ring Ready: true

============================== Ownership Handoff ==============================
Owner:      dev1@127.0.0.1
Next Owner: dev2@127.0.0.1

Index: 22835963083295358096932575511191922182123945984
  All transfers complete. Waiting for claimant to change ownership.

Index: 114179815416476790484662877555959610910619729920
  All transfers complete. Waiting for claimant to change ownership.

Index: 205523667749658222872393179600727299639115513856
  All transfers complete. Waiting for claimant to change ownership.

Index: 296867520082839655260123481645494988367611297792
  All transfers complete. Waiting for claimant to change ownership.

Index: 388211372416021087647853783690262677096107081728
  All transfers complete. Waiting for claimant to change ownership.

Index: 479555224749202520035584085735030365824602865664
  All transfers complete. Waiting for claimant to change ownership.

Index: 570899077082383952423314387779798054553098649600
  All transfers complete. Waiting for claimant to change ownership.

Index: 662242929415565384811044689824565743281594433536
  All transfers complete. Waiting for claimant to change ownership.

Index: 753586781748746817198774991869333432010090217472
  All transfers complete. Waiting for claimant to change ownership.

Index: 844930634081928249586505293914101120738586001408
  All transfers complete. Waiting for claimant to change ownership.

Index: 936274486415109681974235595958868809467081785344
  All transfers complete. Waiting for claimant to change ownership.

Index: 1027618338748291114361965898003636498195577569280
  All transfers complete. Waiting for claimant to change ownership.

Index: 1118962191081472546749696200048404186924073353216
  All transfers complete. Waiting for claimant to change ownership.

Index: 1210306043414653979137426502093171875652569137152
  All transfers complete. Waiting for claimant to change ownership.

Index: 1301649895747835411525156804137939564381064921088
  All transfers complete. Waiting for claimant to change ownership.

Index: 1392993748081016843912887106182707253109560705024
  All transfers complete. Waiting for claimant to change ownership.

-------------------------------------------------------------------------------
Owner:      dev1@127.0.0.1
Next Owner: dev3@127.0.0.1

Index: 45671926166590716193865151022383844364247891968
  All transfers complete. Waiting for claimant to change ownership.

Index: 137015778499772148581595453067151533092743675904
  All transfers complete. Waiting for claimant to change ownership.

Index: 228359630832953580969325755111919221821239459840
  All transfers complete. Waiting for claimant to change ownership.

Index: 319703483166135013357056057156686910549735243776
  All transfers complete. Waiting for claimant to change ownership.

Index: 411047335499316445744786359201454599278231027712
  All transfers complete. Waiting for claimant to change ownership.

Index: 502391187832497878132516661246222288006726811648
  All transfers complete. Waiting for claimant to change ownership.

Index: 593735040165679310520246963290989976735222595584
  All transfers complete. Waiting for claimant to change ownership.

Index: 685078892498860742907977265335757665463718379520
  All transfers complete. Waiting for claimant to change ownership.

Index: 776422744832042175295707567380525354192214163456
  All transfers complete. Waiting for claimant to change ownership.

Index: 867766597165223607683437869425293042920709947392
  All transfers complete. Waiting for claimant to change ownership.

Index: 959110449498405040071168171470060731649205731328
  All transfers complete. Waiting for claimant to change ownership.

Index: 1050454301831586472458898473514828420377701515264
  All transfers complete. Waiting for claimant to change ownership.

Index: 1141798154164767904846628775559596109106197299200
  All transfers complete. Waiting for claimant to change ownership.

Index: 1233142006497949337234359077604363797834693083136
  All transfers complete. Waiting for claimant to change ownership.

Index: 1324485858831130769622089379649131486563188867072
  All transfers complete. Waiting for claimant to change ownership.

Index: 1415829711164312202009819681693899175291684651008
  All transfers complete. Waiting for claimant to change ownership.

-------------------------------------------------------------------------------
Owner:      dev1@127.0.0.1
Next Owner: dev4@127.0.0.1

Index: 68507889249886074290797726533575766546371837952
  All transfers complete. Waiting for claimant to change ownership.

Index: 159851741583067506678528028578343455274867621888
  All transfers complete. Waiting for claimant to change ownership.

Index: 251195593916248939066258330623111144003363405824
  All transfers complete. Waiting for claimant to change ownership.

Index: 342539446249430371453988632667878832731859189760
  All transfers complete. Waiting for claimant to change ownership.

Index: 433883298582611803841718934712646521460354973696
  All transfers complete. Waiting for claimant to change ownership.

Index: 525227150915793236229449236757414210188850757632
  All transfers complete. Waiting for claimant to change ownership.

Index: 616571003248974668617179538802181898917346541568
  All transfers complete. Waiting for claimant to change ownership.

Index: 707914855582156101004909840846949587645842325504
  All transfers complete. Waiting for claimant to change ownership.

Index: 799258707915337533392640142891717276374338109440
  All transfers complete. Waiting for claimant to change ownership.

Index: 890602560248518965780370444936484965102833893376
  All transfers complete. Waiting for claimant to change ownership.

Index: 981946412581700398168100746981252653831329677312
  All transfers complete. Waiting for claimant to change ownership.

Index: 1073290264914881830555831049026020342559825461248
  All transfers complete. Waiting for claimant to change ownership.

Index: 1164634117248063262943561351070788031288321245184
  All transfers complete. Waiting for claimant to change ownership.

Index: 1255977969581244695331291653115555720016817029120
  All transfers complete. Waiting for claimant to change ownership.

Index: 1347321821914426127719021955160323408745312813056
  All transfers complete. Waiting for claimant to change ownership.

Index: 1438665674247607560106752257205091097473808596992
  All transfers complete. Waiting for claimant to change ownership.

-------------------------------------------------------------------------------

============================== Unreachable Nodes ==============================
All nodes are up and reachable

%%% dev1 member-status %%%

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
joining     0.0%     25.0%    'dev4@127.0.0.1'
valid     100.0%     25.0%    'dev1@127.0.0.1'
valid       0.0%     25.0%    'dev2@127.0.0.1'
valid       0.0%     25.0%    'dev3@127.0.0.1'
-------------------------------------------------------------------------------
Valid:3 / Leaving:0 / Exiting:0 / Joining:1 / Down:0
@beerriot

Just adding a note to say I've seen this issue on the giddyup runs of pipe_verify_sink_types, both OSS and EE, with YZ disabled:

http://giddyup.basho.com/#/projects/riak/scorecards/51/51-707-pipe_verify_sink_types-centos-6-64/19352

I also ran into it while running the new cluster_meta_basic on my laptop.

@rzezeski
It looks like basho/riak_test#397 might be a fix for this.

EDIT: After some discussion with @Vagabond and @jrwest today it seems we might still be hitting this case even with the basho/riak_test#397 fix. More investigation is still needed.

@jrwest

I just ran into this as well https://gist.github.com/jrwest/c48f714cc72d7dce4e05

cluster was also setup quickly but not by riak_test [1].

[1] https://github.com/jrwest/devrel-mode/blob/master/devrel-mode.el#L155-L165

@jrwest

@jtuple can you confirm this issue was actually a problem in some wait_until_* logic in riak_test (iirc) before I close.

Otherwise, marking as 2.1 since any bug has existed in many previous versions of Riak and won't be fixed this cycle.

@jrwest jrwest added this to the 2.1 milestone
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.