Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

inbound handoffs never cleanup #185

Closed
rzezeski opened this Issue · 2 comments

2 participants

@rzezeski

There are cases where the handoff sender, or its connection, may
disappear and the receiver doesn't know about it. In this case it
will sit forever in {active, once} waiting for the sender. This has
been seen in production. If you get the handoff status for all nodes
you'll see all inbound handoff.

rpc:multicall([node() | nodes()], riak_core_vnode_manager, force_handoffs, []).

[{'riak1@one.foo.com',[{{undefined,undefined},
                                                undefined,inbound,active,[]},
                                               {{undefined,undefined},undefined,inbound,active,[]}]},
 {'riak@two.foo.com',[{{undefined,undefined},
                                                undefined,inbound,active,[]},
                                               {{undefined,undefined},undefined,inbound,active,[]}]},
 {riak@three.foo.com',[{{undefined,undefined},
                                                undefined,inbound,active,[]},
                                               {{undefined,undefined},undefined,inbound,active,[]}]},
...

This stalls handoff because each node has reached the concurrency
limit. Since these inbounds will never be reaped they will block
forever. Calling the force handoff API will not resume handoff
because it still respects the concurrency limit. To resume you must
kill the handoffs via the following code snippet.

f(Members).
Members = riak_core_ring:all_members(element(2, riak_core_ring_manager:get_raw_ring())).
rp(rpc:multicall(Members, riak_core_handoff_manager, set_concurrency, [0])).
rp(rpc:multicall(Members, riak_core_handoff_manager, set_concurrency, [2])).

This is related to basho/riak_core#153 where the same situation was
fixed for the sender (by setting timeouts on recv) but the handoff
receiver was never fixed.

The easiest way to fix this is to add a timeout to the receiver so
that inactive connections will be noticed and reaped. This timeout
could be set to some arbitrary limit like 5-10 minutes since it's a
msg per object. The assumption being that if an object takes 10
minutes then chances are something is wrong and it's better to
restart. Handoff will still stall but only for a limited period of
time and it won't require manual intervention to resume.

@evanmcc evanmcc referenced this issue from a commit
@evanmcc evanmcc potential fix for #185 a544199
@evanmcc

Still working on a good way to test this, but I see it so often at customers that I wanted to try and get a patch in before the next minor.

@evanmcc

I haven't run into this one in a while, so hopefully the partial fix actually worked.

@evanmcc evanmcc closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.