Map-reduce jobs fail during rolling upgrade to 1.1 #144

Closed
jtuple opened this Issue Feb 24, 2012 · 2 comments

Projects

None yet

1 participant

@jtuple
Contributor
jtuple commented Feb 24, 2012

Map reduce jobs fail against a mixed 1.1/1.0 cluster or 1.1/0.14 cluster. Mixed clusters are common during rolling upgrades, where nodes are upgraded one at a time until the entire cluster is at the new version (1.1 in this case).

In addition to the jobs failing, the Riak 1.1 nodes will print several instances of the following error and then eventually shutdown:

2012-02-22 00:56:16.429 [error] <0.1615.0>
gen_server riak_pipe_vnode_master terminated with reason: no function clause matching
riak_core_vnode_master:handle_call(
  {return_vnode,
    {riak_vnode_req_v1,479555224749202520035584085735030365824602865664,
      {raw,#Ref<6584.0.7263.108930>,...},...}},
  {<6604.7292.421>,#Ref<6604.0.6697.250987>},
  {state,undefined,undefined,riak_pipe_vnode,undefined})

2012-02-22 00:56:16.431 [error] <0.1615.0> CRASH REPORT
Process riak_pipe_vnode_master with 0 neighbours crashed with reason:
no function clause matching riak_core_vnode_master:handle_call(
  {return_vnode,
    {riak_vnode_req_v1,479555224749202520035584085735030365824602865664,
      {raw,#Ref<6584.0.7263.108930>,...},...}},
  {<6604.7292.421>,#Ref<6604.0.6697.250987>},
  {state,undefined,undefined,riak_pipe_vnode,undefined})

The error arises from Riak 1.1 no longer implementing riak_core_vnode_master:handle_call({return_vnode, ...). Yet, pre-1.1 nodes will still send messages to Riak 1.1 nodes in a mixed cluster that must be handled by this now-missing clause.

@jtuple
Contributor
jtuple commented Feb 24, 2012

There is also a related error. In addition to the above error, it is also possible to see the following errors during a map-reduce query to a mixed 1.1 and pre-1.1 cluster.

2012-02-24 14:58:04.501 [error] <0.26014.0> gen_fsm <0.26014.0> in
state wait_for_input terminated with reason: processing_error

2012-02-24 14:58:04.502 [error] <0.26014.0> CRASH REPORT Process
<0.26014.0> with 0 neighbours crashed with reason:
{processing_error,[{gen_fsm,terminate,7},{proc_lib,init_p_do_apply,3}]}

2012-02-24 14:58:04.503 [error] <0.321.0> Supervisor
riak_pipe_vnode_worker_sup had child undefined started with
{riak_pipe_vnode_worker,start_link,undefined} at <0.26014.0> exit with
reason processing_error in context child_terminated

While, processing_error is a bit non-descriptive, the map-reduce job will return more detailed information as the query response:

{"phase":0,
  "error":
    "{badmatch, {'EXIT',noproc}}",
    "input": <snip>,
    "type": "error",
    "stack":
      "[{riak_core_vnode_proxy,call,2},
        {riak_pipe_vnode,queue_work_send,4},
        {riak_pipe_vnode,queue_work_erracc,6},
        {riak_kv_mrc_map,'-send_results/2-lc$^0/1-0-',3},
        {riak_kv_mrc_map,send_results,2},
        {riak_kv_mrc_map,process,3},
        {riak_pipe_vnode_worker,process_input,3},
        {riak_pipe_vnode_worker,wait_for_input,2}]
}

This is another issue with the vnode proxy refactoring in Riak 1.1 This time, it is the 1.1 nodes sending requests to the older 1.1 or 0.14 nodes.

Both issues need to be fixed to allow map-reduce to work during a rolling upgrade.

@jtuple jtuple added a commit that referenced this issue Feb 24, 2012
@jtuple jtuple Fix map_reduce during rolling upgrade to 1.1.
Resolve issue #144.

Change riak_core_vnode_master to still handle return_vnode messages
given that pre-1.1 nodes may still send those messages to a 1.1 node.

Add legacy routing logic to riak_core_vnode_master:command_return_vnode
in order to properly send messages to pre-1.0 nodes that do not have
vnode proxy processes.
0a371a7
@jtuple
Contributor
jtuple commented Feb 25, 2012

Fixed by pull-request #145. Will be available in Riak 1.1.1.

@jtuple jtuple closed this Feb 25, 2012
@jtuple jtuple was assigned Feb 25, 2012
@jtuple jtuple was unassigned by ooshlablu Apr 25, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment