Skip to content
This repository

Put FSM crashes if forwarded node goes down. #300

jonmeredith opened this Issue · 2 comments

2 participants

Jon Meredith Jeffrey Massung
Jon Meredith

In 1.0 and above with vnode_vclocks enabled, nodes must be a member of the preference list to coordinate a put.
Currently nodes forward put requests to the first member of the preference list by talking to riak_kv_up_fsm_sup on the remote node - code here

If the node goes down while starting the remote process instead of getting an {error, Reason} response from the remote supervisor, the start_put_fsm call throws an uncaught exception which crashes the local put FSM.

2012-03-09 06:11:00 =SUPERVISOR REPORT====
     Supervisor: {local,riak_kv_put_fsm_sup}
     Context:    child_terminated
     Reason:     {{nodedown,'riak@host1'},{gen_server,call,[{riak_kv_put_fsm_sup,'riak@host1'},{start_child,[{raw,12339479,<0.4491.12>},{r_object,<<"bucket">>,<<"key">>,[{r_content,{dict,6,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[[<<"Links">>]],[],[],[],[],[],[],[],[[<<"content-type">>,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110],[<<"X-Riak-VTag">>,52,113,74,65,119,52,87,90,51,116,119,86,107,73,112,113,101,65,48,52,73,53]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1331,269199,636663}]],[],[[<<"X-Riak-Meta">>]]}}},<<"body">>}],[{<<54,74,223,68,79,76,172,101>>,{2,63498487778}}],{dict,4,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[[<<"Links">>]],[],[],[],[],[],[],[],[[<<"content-type">>,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110]],[[<<"index">>]],[],[],[],[[<<"X-Riak-Meta">>]]}}},<<"{"started":1331268781747}">>},[{w,default},{dw,default},{pw,default},{timeout,60000}]]},infinity]}}
     Offender:   [{pid,<0.4817.12>},{name,undefined},{mfargs,{riak_kv_put_fsm,start_link,undefined}},{restart_type,temporary},{shutdown,5000},{child_type,worker}]

If the node is under heavy load you'll get a crash dump for however many puts ran in the last 4*net_tick_timeout seconds (typically 60s) which can be in the thousands and overload the logging subsystem.

The easiest path seems to be to replace the remote supervisor call with an rpc:call to the remote node which will then start the supervisor, that way we can supply a timeout that honors the requested timeout for the FSM (which I suppose could worse-case be 2*timeout if the forward requests then the node uses the same timeout again - but distributed time is hard).

To prove the issue is resolved, please do a comparative run of the current code vs the fixed code
I'd suggest using basho_bench against a 4 node cluster, setting N=1 for the test bucket (so that 75% of requests are forwarded) with the memory backend over protobuffs.

If the times are within 95% of the current solution we'll accept the performance trade off for robustness. This failure can be enough to take out a node due to running out of processes/memory.

Please make sure the fix can be applied against the 1.0 branch as well as 1.1 and master.

Jeffrey Massung

#301 has a fix for this.

Jon Meredith

Merged to 1.0 and 1.1 branches, Jared will sync up master when he merges 1.1 in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.