Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Intermittent hang with handoff sender #153

Merged
merged 2 commits into from Mar 22, 2012

Conversation

Projects
None yet
3 participants
Contributor

russelldb commented Mar 16, 2012

On a 1.1.1 cluster we've seen multiple cases of handoff sender processes hanging when the TCP socket has gone away.

Polling riak_core_handoff_manager:status() on all the nodes in a cluster we saw this - with no TCP connections alive

[{'riak@1.2.3.195',[]},
 {'riak@1.2.3.197',[]},
 {'riak@1.2.3.199',[{{riak_kv_vnode,713623846352979940529142984724747568191373312000},
                               'riak@1.2.3.197',outbound,active,[]}]},
 {'riak@1.2.3.201',[{{riak_kv_vnode,359666418561901890026688064301272774368452149248},
                               'riak@1.2.3.197',outbound,active,[]}]},
 {'riak@1.2.3.203',[{{riak_kv_vnode,536645132457440915277915524513010171279912730624},
                               'riak@1.2.3.197',outbound,active,[]}]},
 {'riak@1.2.3.205',[]},
 {'riak@1.2.3.207',[]},
 {'riak@1.2.3.209',[]},
 {'riak@1.2.3.211',[{{riak_kv_vnode,108470824645652950960429733678161630365088743424},
                               'riak@1.2.4.214',outbound,active,[]}]},
 {'riak@1.2.3.213',[]},
 {'riak@1.2.3.215',[]},
 {'riak@1.2.3.217',[{{riak_kv_vnode,171269723124715185726994316333939416365929594880},
                               'riak@1.2.4.254',outbound,active,[]}]},
 {'riak@1.2.3.219',[]},
 {'riak@1.2.4.208',[{{riak_kv_vnode,804967698686161372916873286769515256919869095936},
                              'riak@1.2.4.214',outbound,active,[]}]},
 {'riak@1.2.4.210',[{{riak_kv_vnode,753586781748746817198774991869333432010090217472},
                              'riak@1.2.4.208',outbound,active,[]}]},
 {'riak@1.2.4.212',[{{riak_kv_vnode,576608067853207791947547531657596035098629636096},
                              'riak@1.2.4.208',outbound,active,[]}]},
 {'riak@1.2.4.214',[]},
 {'riak@1.2.4.216',[]},
 {'riak@1.2.4.218',[]},
 {'riak@1.2.4.220',[]},
 {'riak@1.2.4.222',[]},
 {'riak@1.2.4.224',[]},
 {'riak@1.2.4.226',[]},
 {'riak@1.2.4.228',[]},
 {'riak@1.2.4.230',[]},
 {'riak@1.2.4.232',[]},
 {'riak@1.2.4.234',[{{riak_kv_vnode,639406966332270026714112114313373821099470487552},
                              'riak@1.2.3.199',outbound,active,[]}]},
 {'riak@1.2.4.236',[]},
 {'riak@1.2.4.238',[]},
 {'riak@1.2.4.240',[]},
 {'riak@1.2.4.242',[]},
 {'riak@1.2.4.244',[]},
 {'riak@1.2.4.246',[]},
 {'riak@1.2.4.248',[]},
 {'riak@1.2.4.250',[]},
 {'riak@1.2.4.252',[]},
 {'riak@1.2.4.254',[]}]
(riaksearch@10.28.60.208)2> io:format("~s\n", [element(2, erlang:process_info(pid(0,15806,280), backtrace))]).  
Program counter: 0x00007f96f70c2bf8 (prim_inet:recv0/3 + 224)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00007f912bd180a0 Return addr 0x00007f96e933cdb0 (riak_core_handoff_sender:start_fold/5 + 1936)
y(0)     4394
y(1)     #Port<0.142646101>

0x00007f912bd180b8 Return addr 0x0000000000870018 (<terminate process normally>)
y(0)     []
y(1)     []
y(2)     []
y(3)     riak_kv_vnode_master
y(4)     #Port<0.142646101>
y(5)     gen_tcp
y(6)     <0.31027.245>
y(7)     804967698686161372916873286769515256919869095936
y(8)     riak_kv_vnode
y(9)     'riaksearch@10.28.60.214'
y(10)    Catch 0x00007f96e933e2e0 (riak_core_handoff_sender:start_fold/5 + 7360)

Digging the TCP port information

(riaksearch@10.28.60.208)10> erlang:port_info(Port).
[{name,"tcp_inet"},
 {links,[<0.15806.280>]},
 {id,142646101},
 {connected,<0.15806.280>},
 {input,0},
 {output,14}]

Looks like it only output 14 bytes which would match with being stuck at the gen_tcp:recv here https://github.com/basho/riak_core/blob/1.1/src/riak_core_handoff_sender.erl#L75

11> size(<<1:8,(atom_to_binary(riak_kv_vnode, utf8))/binary>>).
14

@russelldb russelldb was assigned Mar 15, 2012

Contributor

jonmeredith commented Mar 15, 2012

If handoff seems stuck on a cluster, you can verify if the issue is present by running this from the Riak console (after riak attach, press ^D to disconnect)

f(Members).
Members = riak_core_ring:all_members(element(2, riak_core_ring_manager:get_raw_ring())).
[{N, rpc:call(N, riak_core_handoff_manager, status, [])} || N <- Members].

If the output has more outbound than inbound connections then some are probably stuck. While we work on a patch for this issue, you can unstick the handoffs by disabling and re-enabling handoffs across the cluster (this example shows 2 handoffs per-node).

f(Members).
Members = riak_core_ring:all_members(element(2, riak_core_ring_manager:get_raw_ring())).
rp(rpc:multicall(Members, riak_core_handoff_manager, set_concurrency, [0])). 
rp(rpc:multicall(Members, riak_core_handoff_manager, set_concurrency, [2])). 

Any stuck handoff should resume within about a minute and you should be able to verify with the status call

f(Members).
Members = riak_core_ring:all_members(element(2, riak_core_ring_manager:get_raw_ring())).
[{N, rpc:call(N, riak_core_handoff_manager, status, [])} || N <- Members].
Contributor

jonmeredith commented Mar 15, 2012

See also customer issues zd://1091 and zd://1081

@russelldb russelldb Add timeout to all handoff sender's receives
Don't bother sending the final 'sync' message if handoff failed
5247f79
Member

russelldb commented Mar 16, 2012

Note: also fixes #152

Verified by manually testing as follows:

Lowered the riak_core_handoff_timeout to 1000, added a 2000 timer:sleep() to handoff_receiver. First at the initial sync, then in the final sync.

Finally to test visit_item timeout I reduced the ?ACK count to 5 in handoff_sender and added a call_count to the receiver state, when the appropriate sync call was processed I increment that count, and added a case block that adds a timer:sleep(2000) after the first sync all.

Can probably provide code as diffs if it helps.

Contributor

rzezeski commented Mar 21, 2012

What about the send calls? Reading the gen_tcp docs send also waits for an indefinite amount of time by default.

Contributor

jonmeredith commented Mar 21, 2012

The send calls only block if the TCP buffer is filled. Repl has a back pressure system that should prevent any permanent deadlocking, so I'm not too worried about it for this PR.

Contributor

jonmeredith commented Mar 21, 2012

The issue this is really guarding against observed behavior where the socket has gone away at the kernel level but erlang doesn't notice for some reason, so we sit receiving forever. Each time we've seen this (on repl and now in handoff) it has only been stuck in a recv call.

Member

russelldb commented Mar 21, 2012

Repl?

That's a shame, cos I went ahead and made the changes…is up to you if we keep them or not.

@russelldb russelldb Correct env var name
Fix typo in timeout var name
cd33460

@rzezeski rzezeski was assigned Mar 21, 2012

Contributor

rzezeski commented Mar 21, 2012

+1 to merge

I've verified timeouts for initial start, finish and during handoff. All three worked. I saw no processes leaks. During testing I did find an issue but it is orthogonal to this change: #154.

@russelldb russelldb added a commit that referenced this pull request Mar 22, 2012

@russelldb russelldb Merge pull request #153 from basho/gh153_rdb_hoff_timeout
Intermittent hang with handoff sender
6617c2a

@russelldb russelldb merged commit 6617c2a into master Mar 22, 2012

@russelldb russelldb was assigned Mar 22, 2012

Contributor

jonmeredith commented Apr 6, 2012

Also applied commits again 1.1 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment