inbound handoffs never cleanup #185

Closed
rzezeski opened this Issue Jun 4, 2012 · 2 comments

Comments

Projects
None yet
2 participants
Contributor

rzezeski commented Jun 4, 2012

There are cases where the handoff sender, or its connection, may
disappear and the receiver doesn't know about it. In this case it
will sit forever in {active, once} waiting for the sender. This has
been seen in production. If you get the handoff status for all nodes
you'll see all inbound handoff.

rpc:multicall([node() | nodes()], riak_core_vnode_manager, force_handoffs, []).

[{'riak1@one.foo.com',[{{undefined,undefined},
                                                undefined,inbound,active,[]},
                                               {{undefined,undefined},undefined,inbound,active,[]}]},
 {'riak@two.foo.com',[{{undefined,undefined},
                                                undefined,inbound,active,[]},
                                               {{undefined,undefined},undefined,inbound,active,[]}]},
 {riak@three.foo.com',[{{undefined,undefined},
                                                undefined,inbound,active,[]},
                                               {{undefined,undefined},undefined,inbound,active,[]}]},
...

This stalls handoff because each node has reached the concurrency
limit. Since these inbounds will never be reaped they will block
forever. Calling the force handoff API will not resume handoff
because it still respects the concurrency limit. To resume you must
kill the handoffs via the following code snippet.

f(Members).
Members = riak_core_ring:all_members(element(2, riak_core_ring_manager:get_raw_ring())).
rp(rpc:multicall(Members, riak_core_handoff_manager, set_concurrency, [0])).
rp(rpc:multicall(Members, riak_core_handoff_manager, set_concurrency, [2])).

This is related to basho/riak_core#153 where the same situation was
fixed for the sender (by setting timeouts on recv) but the handoff
receiver was never fixed.

The easiest way to fix this is to add a timeout to the receiver so
that inactive connections will be noticed and reaped. This timeout
could be set to some arbitrary limit like 5-10 minutes since it's a
msg per object. The assumption being that if an object takes 10
minutes then chances are something is wrong and it's better to
restart. Handoff will still stall but only for a limited period of
time and it won't require manual intervention to resume.

Contributor

evanmcc commented Oct 5, 2012

Still working on a good way to test this, but I see it so often at customers that I wanted to try and get a patch in before the next minor.

jaredmorrow pushed a commit that referenced this issue Feb 14, 2013

Contributor

evanmcc commented Aug 9, 2013

I haven't run into this one in a while, so hopefully the partial fix actually worked.

@evanmcc evanmcc closed this Aug 9, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment