Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent hang with handoff sender #153

Merged
merged 2 commits into from Mar 22, 2012

Conversation

Projects
None yet
3 participants
@russelldb
Copy link
Member

commented Mar 16, 2012

On a 1.1.1 cluster we've seen multiple cases of handoff sender processes hanging when the TCP socket has gone away.

Polling riak_core_handoff_manager:status() on all the nodes in a cluster we saw this - with no TCP connections alive

[{'riak@1.2.3.195',[]},
 {'riak@1.2.3.197',[]},
 {'riak@1.2.3.199',[{{riak_kv_vnode,713623846352979940529142984724747568191373312000},
                               'riak@1.2.3.197',outbound,active,[]}]},
 {'riak@1.2.3.201',[{{riak_kv_vnode,359666418561901890026688064301272774368452149248},
                               'riak@1.2.3.197',outbound,active,[]}]},
 {'riak@1.2.3.203',[{{riak_kv_vnode,536645132457440915277915524513010171279912730624},
                               'riak@1.2.3.197',outbound,active,[]}]},
 {'riak@1.2.3.205',[]},
 {'riak@1.2.3.207',[]},
 {'riak@1.2.3.209',[]},
 {'riak@1.2.3.211',[{{riak_kv_vnode,108470824645652950960429733678161630365088743424},
                               'riak@1.2.4.214',outbound,active,[]}]},
 {'riak@1.2.3.213',[]},
 {'riak@1.2.3.215',[]},
 {'riak@1.2.3.217',[{{riak_kv_vnode,171269723124715185726994316333939416365929594880},
                               'riak@1.2.4.254',outbound,active,[]}]},
 {'riak@1.2.3.219',[]},
 {'riak@1.2.4.208',[{{riak_kv_vnode,804967698686161372916873286769515256919869095936},
                              'riak@1.2.4.214',outbound,active,[]}]},
 {'riak@1.2.4.210',[{{riak_kv_vnode,753586781748746817198774991869333432010090217472},
                              'riak@1.2.4.208',outbound,active,[]}]},
 {'riak@1.2.4.212',[{{riak_kv_vnode,576608067853207791947547531657596035098629636096},
                              'riak@1.2.4.208',outbound,active,[]}]},
 {'riak@1.2.4.214',[]},
 {'riak@1.2.4.216',[]},
 {'riak@1.2.4.218',[]},
 {'riak@1.2.4.220',[]},
 {'riak@1.2.4.222',[]},
 {'riak@1.2.4.224',[]},
 {'riak@1.2.4.226',[]},
 {'riak@1.2.4.228',[]},
 {'riak@1.2.4.230',[]},
 {'riak@1.2.4.232',[]},
 {'riak@1.2.4.234',[{{riak_kv_vnode,639406966332270026714112114313373821099470487552},
                              'riak@1.2.3.199',outbound,active,[]}]},
 {'riak@1.2.4.236',[]},
 {'riak@1.2.4.238',[]},
 {'riak@1.2.4.240',[]},
 {'riak@1.2.4.242',[]},
 {'riak@1.2.4.244',[]},
 {'riak@1.2.4.246',[]},
 {'riak@1.2.4.248',[]},
 {'riak@1.2.4.250',[]},
 {'riak@1.2.4.252',[]},
 {'riak@1.2.4.254',[]}]
(riaksearch@10.28.60.208)2> io:format("~s\n", [element(2, erlang:process_info(pid(0,15806,280), backtrace))]).  
Program counter: 0x00007f96f70c2bf8 (prim_inet:recv0/3 + 224)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00007f912bd180a0 Return addr 0x00007f96e933cdb0 (riak_core_handoff_sender:start_fold/5 + 1936)
y(0)     4394
y(1)     #Port<0.142646101>

0x00007f912bd180b8 Return addr 0x0000000000870018 (<terminate process normally>)
y(0)     []
y(1)     []
y(2)     []
y(3)     riak_kv_vnode_master
y(4)     #Port<0.142646101>
y(5)     gen_tcp
y(6)     <0.31027.245>
y(7)     804967698686161372916873286769515256919869095936
y(8)     riak_kv_vnode
y(9)     'riaksearch@10.28.60.214'
y(10)    Catch 0x00007f96e933e2e0 (riak_core_handoff_sender:start_fold/5 + 7360)

Digging the TCP port information

(riaksearch@10.28.60.208)10> erlang:port_info(Port).
[{name,"tcp_inet"},
 {links,[<0.15806.280>]},
 {id,142646101},
 {connected,<0.15806.280>},
 {input,0},
 {output,14}]

Looks like it only output 14 bytes which would match with being stuck at the gen_tcp:recv here https://github.com/basho/riak_core/blob/1.1/src/riak_core_handoff_sender.erl#L75

11> size(<<1:8,(atom_to_binary(riak_kv_vnode, utf8))/binary>>).
14

@ghost ghost assigned russelldb Mar 15, 2012

@jonmeredith

This comment has been minimized.

Copy link
Contributor Author

commented Mar 15, 2012

If handoff seems stuck on a cluster, you can verify if the issue is present by running this from the Riak console (after riak attach, press ^D to disconnect)

f(Members).
Members = riak_core_ring:all_members(element(2, riak_core_ring_manager:get_raw_ring())).
[{N, rpc:call(N, riak_core_handoff_manager, status, [])} || N <- Members].

If the output has more outbound than inbound connections then some are probably stuck. While we work on a patch for this issue, you can unstick the handoffs by disabling and re-enabling handoffs across the cluster (this example shows 2 handoffs per-node).

f(Members).
Members = riak_core_ring:all_members(element(2, riak_core_ring_manager:get_raw_ring())).
rp(rpc:multicall(Members, riak_core_handoff_manager, set_concurrency, [0])). 
rp(rpc:multicall(Members, riak_core_handoff_manager, set_concurrency, [2])). 

Any stuck handoff should resume within about a minute and you should be able to verify with the status call

f(Members).
Members = riak_core_ring:all_members(element(2, riak_core_ring_manager:get_raw_ring())).
[{N, rpc:call(N, riak_core_handoff_manager, status, [])} || N <- Members].
@jonmeredith

This comment has been minimized.

Copy link
Contributor Author

commented Mar 15, 2012

See also customer issues zd://1091 and zd://1081

Add timeout to all handoff sender's receives
Don't bother sending the final 'sync' message if handoff failed
@russelldb

This comment has been minimized.

Copy link
Member

commented Mar 16, 2012

Note: also fixes #152

Verified by manually testing as follows:

Lowered the riak_core_handoff_timeout to 1000, added a 2000 timer:sleep() to handoff_receiver. First at the initial sync, then in the final sync.

Finally to test visit_item timeout I reduced the ?ACK count to 5 in handoff_sender and added a call_count to the receiver state, when the appropriate sync call was processed I increment that count, and added a case block that adds a timer:sleep(2000) after the first sync all.

Can probably provide code as diffs if it helps.

@rzezeski

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2012

What about the send calls? Reading the gen_tcp docs send also waits for an indefinite amount of time by default.

@jonmeredith

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2012

The send calls only block if the TCP buffer is filled. Repl has a back pressure system that should prevent any permanent deadlocking, so I'm not too worried about it for this PR.

@jonmeredith

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2012

The issue this is really guarding against observed behavior where the socket has gone away at the kernel level but erlang doesn't notice for some reason, so we sit receiving forever. Each time we've seen this (on repl and now in handoff) it has only been stuck in a recv call.

@russelldb

This comment has been minimized.

Copy link
Member

commented Mar 21, 2012

Repl?

That's a shame, cos I went ahead and made the changes…is up to you if we keep them or not.

Correct env var name
Fix typo in timeout var name

@ghost ghost assigned rzezeski Mar 21, 2012

@rzezeski

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2012

+1 to merge

I've verified timeouts for initial start, finish and during handoff. All three worked. I saw no processes leaks. During testing I did find an issue but it is orthogonal to this change: #154.

russelldb added a commit that referenced this pull request Mar 22, 2012

Merge pull request #153 from basho/gh153_rdb_hoff_timeout
Intermittent hang with handoff sender

@russelldb russelldb merged commit 6617c2a into master Mar 22, 2012

@ghost ghost assigned russelldb Mar 22, 2012

@jonmeredith

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2012

Also applied commits again 1.1 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.