Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip a partition replication due to race condition on fullsync [JIRA: RIAK-2551] #743

Open
ksauzz opened this issue May 10, 2016 · 0 comments

Comments

@ksauzz
Copy link

ksauzz commented May 10, 2016

There is a race condition on riak_repl_keylist_server:bloom_fold of fullsync, which can throw a function_clause error, but full_sync manager treats this as a normal finish of the partition replication. So a user cannot notice all keys could be not replicated to the sink cluster even if a partition could be skipped to be replicated.

The Cause

In keylisting fullsync, bloom_fold as fold function on vnode worker waits for resume_pause after sending a batch to the sink node. But somehow other fold message was received at this waiting worker. (See last message in crash.log) Then, the branch works, ?TRACE macro returns ok atom as a accumulator which causes vnode worker's crash.

Reproduction Steps

Couldn't find it.

Occurrence Frequency

Sometimes this have been observed by a customer. For them, it looks this happens randomly.

error.log

2016-04-07 23:22:01.094 [error] <0.9942.66> gen_server <0.9942.66> terminated with reason: no function clause matching riak_repl_keylist_server:bloom_fold({<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<121,107,250,67,143,8,73,...>>}, <<53,1,0,0,0,34,131,108,0,0,0,1,104,2,109,0,0,0,8,239,219,125,147,226,32,130,220,104,2,97,1,110,...>>, ok) line 675
2016-04-07 23:22:01.111 [error] <0.9942.66> CRASH REPORT Process <0.9942.66> with 0 neighbours exited with reason: no function clause matching riak_repl_keylist_server:bloom_fold({<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<121,107,250,67,143,8,73,...>>}, <<53,1,0,0,0,34,131,108,0,0,0,1,104,2,109,0,0,0,8,239,219,125,147,226,32,130,220,104,2,97,1,110,...>>, ok) line 675 in gen_server:terminate/6 line 744
2016-04-07 23:22:01.111 [error] <0.1131.0> Supervisor {<0.1131.0>,poolboy_sup} had child riak_core_vnode_worker started with {riak_core_vnode_worker,start_link,undefined} at <0.9942.66> exit with reason no function clause matching riak_repl_keylist_server:bloom_fold({<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<121,107,250,67,143,8,73,...>>}, <<53,1,0,0,0,34,131,108,0,0,0,1,104,2,109,0,0,0,8,239,219,125,147,226,32,130,220,104,2,97,1,110,...>>, ok) line 675 in context child_terminated

crash.log

binary data was replaced as <<"ommited binary">>

2016-04-13 10:40:31 =ERROR REPORT====
** Generic server <0.916.0> terminating 
** Last message in was {'$gen_cast',{work,{fold,#Fun<riak_cs_kv_multi_backend.9.110104299>,#Fun<riak_kv_vnode.35.88487897>},{raw,#Ref<0.0.4.162629>,<0.22168.4>},<0.884.0>}}
** When Server state == {state,riak_kv_worker,{state,1118962191081472546749696200048404186924073353216}}
** Reason for termination == 
** {function_clause,[{riak_repl_keylist_server,bloom_fold,[{<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<35,217,96,165,236,160,69,118,179,100,39,110,92,174,247,160,0,0,0,0>>},<<"omitted binary">>,ok],[{file,"src/riak_repl_keylist_server.erl"},{line,675}]},{bitcask_fileops,fold_int_loop,5,[{file,"src/bitcask_fileops.erl"},{line,554}]},{bitcask_fileops,fold_file_loop,8,[{file,"src/bitcask_fileops.erl"},{line,720}]},{bitcask_fileops,fold,3,[{file,"src/bitcask_fileops.erl"},{line,391}]},{bitcask,subfold,3,[{file,"src/bitcask.erl"},{line,506}]},{bitcask_nifs,keydir_frozen,4,[{file,"src/bitcask_nifs.erl"},{line,304}]},{riak_kv_bitcask_backend,'-fold_objects/4-fun-0-',5,[{file,"src/riak_kv_bitcask_backend.erl"},{line,351}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]}]}
2016-04-13 10:40:31 =CRASH REPORT====
  crasher:
    initial call: riak_core_vnode_worker:init/1
    pid: <0.916.0>
    registered_name: []
    exception exit: {{function_clause,[{riak_repl_keylist_server,bloom_fold,[{<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<35,217,96,165,236,160,69,118,179,100,39,110,92,174,247,160,0,0,0,0>>},<<"omitted binary">>,ok],[{file,"src/riak_repl_keylist_server.erl"},{line,675}]},{bitcask_fileops,fold_int_loop,5,[{file,"src/bitcask_fileops.erl"},{line,554}]},{bitcask_fileops,fold_file_loop,8,[{file,"src/bitcask_fileops.erl"},{line,720}]},{bitcask_fileops,fold,3,[{file,"src/bitcask_fileops.erl"},{line,391}]},{bitcask,subfold,3,[{file,"src/bitcask.erl"},{line,506}]},{bitcask_nifs,keydir_frozen,4,[{file,"src/bitcask_nifs.erl"},{line,304}]},{riak_kv_bitcask_backend,'-fold_objects/4-fun-0-',5,[{file,"src/riak_kv_bitcask_backend.erl"},{line,351}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]}]},[{gen_server,terminate,6,[{file,"gen_server.erl"},{line,744}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
    ancestors: [<0.887.0>,<0.885.0>,<0.884.0>,<0.680.0>,riak_core_vnode_sup,riak_core_sup,<0.220.0>]
    messages: [bloom_resume]
    links: [<0.887.0>,<0.885.0>]
    dictionary: [{bitcask_file_mod,bitcask_file},{bitcask_time_fudge,no_testing}]
    trap_exit: false
    status: running
    heap_size: 6772
    stack_size: 27
    reductions: 20747615
  neighbours:
2016-04-13 10:40:31 =SUPERVISOR REPORT====
     Supervisor: {<0.887.0>,poolboy_sup}
     Context:    child_terminated
     Reason:     {function_clause,[{riak_repl_keylist_server,bloom_fold,[{<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<35,217,96,165,236,160,69,118,179,100,39,110,92,174,247,160,0,0,0,0>>},<<"omitted binary">>,ok],[{file,"src/riak_repl_keylist_server.erl"},{line,675}]},{bitcask_fileops,fold_int_loop,5,[{file,"src/bitcask_fileops.erl"},{line,554}]},{bitcask_fileops,fold_file_loop,8,[{file,"src/bitcask_fileops.erl"},{line,720}]},{bitcask_fileops,fold,3,[{file,"src/bitcask_fileops.erl"},{line,391}]},{bitcask,subfold,3,[{file,"src/bitcask.erl"},{line,506}]},{bitcask_nifs,keydir_frozen,4,[{file,"src/bitcask_nifs.erl"},{line,304}]},{riak_kv_bitcask_backend,'-fold_objects/4-fun-0-',5,[{file,"src/riak_kv_bitcask_backend.erl"},{line,351}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]}]}
     Offender:   [{pid,<0.916.0>},{name,riak_core_vnode_worker},{mfargs,{riak_core_vnode_worker,start_link,undefined}},{restart_type,temporary},{shutdown,5000},{child_type,worker}]
@Basho-JIRA Basho-JIRA changed the title Skip a partition replication due to race condition on fullsync Skip a partition replication due to race condition on fullsync [JIRA: RIAK-2551] May 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants