Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Intermittent hang with handoff sender #153
On a 1.1.1 cluster we've seen multiple cases of handoff sender processes hanging when the TCP socket has gone away.
Polling riak_core_handoff_manager:status() on all the nodes in a cluster we saw this - with no TCP connections alive
Digging the TCP port information
Looks like it only output 14 bytes which would match with being stuck at the gen_tcp:recv here https://github.com/basho/riak_core/blob/1.1/src/riak_core_handoff_sender.erl#L75
If handoff seems stuck on a cluster, you can verify if the issue is present by running this from the Riak console (after
If the output has more outbound than inbound connections then some are probably stuck. While we work on a patch for this issue, you can unstick the handoffs by disabling and re-enabling handoffs across the cluster (this example shows 2 handoffs per-node).
Any stuck handoff should resume within about a minute and you should be able to verify with the status call
Note: also fixes #152
Verified by manually testing as follows:
Lowered the riak_core_handoff_timeout to 1000, added a 2000 timer:sleep() to handoff_receiver. First at the initial sync, then in the final sync.
Finally to test visit_item timeout I reduced the ?ACK count to 5 in handoff_sender and added a call_count to the receiver state, when the appropriate sync call was processed I increment that count, and added a case block that adds a timer:sleep(2000) after the first sync all.
Can probably provide code as diffs if it helps.
The issue this is really guarding against observed behavior where the socket has gone away at the kernel level but erlang doesn't notice for some reason, so we sit receiving forever. Each time we've seen this (on repl and now in handoff) it has only been stuck in a recv call.