Support keepalive in PB connections [JIRA: RIAK-1666] #88

jonasrichard · 2015-04-02T13:37:49Z

Sometimes network split can happen between client applications and Riak. In that case when tcp connections are not closed properly, dead connections remain on Riak side. As far as I know if there is packet sent in that dead connection, the sender won't get any ack, so tcp will be finally closed. But since on Riak side the sockets are "server sockets" so they don't send anything until they don't get any request.

I am just thinking if it is worth adding keepalive here, at least get keepalive from config keys.

riak_api/src/riak_api_pb_listener.erl

Line 49 in 94a9485

    
           [binary, {packet, raw}, {reuseaddr, true}, {backlog, BackLog}, {nodelay, NoDelay}].

Can it have any drawback? Can it solve that (close zombie connection) situation?

Basho-JIRA · 2015-04-08T22:29:15Z

Sean - let me know if this is a feature request. If so, I can get it to Products.

_[posted via JIRA by Derek Somogyi]_

jonasrichard · 2015-04-09T07:41:48Z

Yes, it is a feature request (however I am not the one who was asked). Sometimes there is a load balancer between clients and riak nodes. When load balancer kills connections it doesn't always close them properly, leaving half-dead connection. It is not a riak bug, but since it is related we usually get it as a riak issue (support team). Keepalive can prevent those half-dead connections.

binarytemple · 2015-04-09T11:41:52Z

@seancribbs @DSomogyi This is a feature request, please contact me directly for more info.

kesslerm · 2015-04-09T14:37:53Z

See also http://lists.basho.com/pipermail/riak-users_lists.basho.com/2015-April/017044.html for the same issue raised on the public riak-users mailing list by a different user

kesslerm · 2015-04-13T08:55:27Z

I've been able to reproduce this easily on a local wireless network. Establish a Riak session between two computers, one running the Python client (non Linux) and the other a devrel cluster (Mac OSX). Then, after cycling the wifi connection on the client side both the client and server report the connections as still being ESTABLISHED, however no further data is transmitted. The python client will eventually time out and re-establish a new connection. The server side will hang on to the connection indefinitely.

Without keepalive the pb_listener hangs on to established connections in case of a network partition. This can lead to available sockets being exhausted on servers with a high number of concurrent connections. Fixes basho#88

kesslerm · 2015-04-17T12:46:07Z

I've tested that adding 'keepalive' to the tcp options will allow orphan connections to be reclaimed after the keepalive timeout has expired. I believe it is even safe to default to true here, but have submitted a pull request that allows this setting to be switched off or on.

The keepalive timeout values are configured at the OS level.

Mac OSX defaults:

net.inet.tcp.keepidle: 7200000
net.inet.tcp.keepintvl: 75000
net.inet.tcp.keepinit: 75000
net.inet.tcp.keepcnt: 8

Linux:

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

Basho-JIRA · 2015-04-17T19:14:46Z

This fix needs to be backported to 1.4 for a customer. The main difference will be that there is no cuttlefish schema to update. I'm ok with Magnus doing the port, if he has the cycles.

_[posted via JIRA by Sean Cribbs]_

Without keepalive the pb_listener hangs on to established connections in case of a network partition. This can lead to available sockets being exhausted on servers with a high number of concurrent connections. Backport of basho#89 Fixes basho#88 for 1.4.x

kesslerm · 2015-04-20T13:43:31Z

Backported against the 1.4 branch. Please merge if appropriate.

Basho-JIRA · 2015-04-28T12:56:09Z

I see this ticket has been marked as closed, what was the outcome? Is it going to be in 2.x. @kesslerm I didn't see any URL for the P/R.

_[posted via JIRA by Bryan Hunt]_

Basho-JIRA · 2015-04-28T13:06:12Z

AFAICT this is going to be in 2.1.1+. The pull request was against the development branch of riak_api (3f09915), and it has been integrated into the 2.1 branch through basho/riak@d50fefc.

I don't see any signs yet of it entering a future 2.0.x release, though.

_[posted via JIRA by Magnus Kessler]_

seancribbs · 2015-04-28T14:45:18Z

I believe a manual patch was going to be delivered to the customer needing it on 1.4 series.

binarytemple · 2015-04-28T14:51:09Z

Thank you both for the info. B

Basho-JIRA · 2015-06-01T09:14:00Z

I don't see the patch pulled into the '2.0' branch of riak_api, yet. Is 2.0.6 released from the '2.0' branch or the 'develop' branch?

_[posted via JIRA by Magnus Kessler]_

jonmeredith · 2015-06-01T13:36:42Z

The branches are a little in flux at the moment in a partial transition to
semver. We'll make sure it lands in 2.0.6 and 2.1.2.

On Mon, Jun 1, 2015 at 3:14 AM Basho JIRA bot! notifications@github.com
wrote:

I don't see the patch pulled into the '2.0' branch of riak_api, yet. Is
2.0.6 released from the '2.0' branch or the 'develop' branch?

[posted via JIRA by Magnus Kessler]

—

Reply to this email directly or view it on GitHub
#88 (comment).

Without keepalive the pb_listener hangs on to established connections in case of a network partition. This can lead to available sockets being exhausted on servers with a high number of concurrent connections. Fixes #88 (cherry picked from commit 8d7ede7)

Basho-JIRA · 2015-07-09T20:42:18Z

JIRA references GH PR 88 which is for 1.4. 89 appears to be for 2.0.6 Riak-1737 is Support keepalive in PB connections

_[posted via JIRA by Patricia Brewer]_

Basho-JIRA · 2015-08-27T22:39:25Z

The PR for 2.1.2 which was merged - #89

_[posted via JIRA by Patricia Brewer]_

Basho-JIRA changed the title ~~Support keepalive in PB connections~~ Support keepalive in PB connections [JIRA: RIAK-1666] Apr 2, 2015

Basho-JIRA added the JIRA: To Do label Apr 2, 2015

Basho-JIRA assigned seancribbs Apr 8, 2015

kesslerm mentioned this issue Apr 17, 2015

[pb_listener] Add TCP keepalive feature #89

Merged

borshop closed this as completed in #89 Apr 17, 2015

Basho-JIRA added JIRA: Needs Review and removed JIRA: To Do labels Apr 17, 2015

kesslerm mentioned this issue Apr 20, 2015

[pb_listener 1.4] enable TCP keepalive #90

Merged

Basho-JIRA assigned tburghart and unassigned seancribbs Apr 20, 2015

Basho-JIRA added JIRA: Closed and removed JIRA: Needs Review labels Apr 21, 2015

Basho-JIRA added JIRA: To Do and removed JIRA: Closed JIRA: To Do labels Apr 28, 2015

Basho-JIRA added JIRA: In Progress JIRA: Done JIRA: Closed and removed JIRA: In Progress labels Apr 28, 2015

lithp mentioned this issue Jan 18, 2017

Citus does not set a read/write timeout when communicating with remote nodes citusdata/citus#1135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support keepalive in PB connections [JIRA: RIAK-1666] #88

Support keepalive in PB connections [JIRA: RIAK-1666] #88

jonasrichard commented Apr 2, 2015

Basho-JIRA commented Apr 8, 2015

jonasrichard commented Apr 9, 2015

binarytemple commented Apr 9, 2015

kesslerm commented Apr 9, 2015

kesslerm commented Apr 13, 2015

kesslerm commented Apr 17, 2015

Basho-JIRA commented Apr 17, 2015

kesslerm commented Apr 20, 2015

Basho-JIRA commented Apr 28, 2015

Basho-JIRA commented Apr 28, 2015

seancribbs commented Apr 28, 2015

binarytemple commented Apr 28, 2015

Basho-JIRA commented Jun 1, 2015

jonmeredith commented Jun 1, 2015

Basho-JIRA commented Jul 9, 2015

Basho-JIRA commented Aug 27, 2015

Support keepalive in PB connections [JIRA: RIAK-1666] #88

Support keepalive in PB connections [JIRA: RIAK-1666] #88

Comments

jonasrichard commented Apr 2, 2015

Basho-JIRA commented Apr 8, 2015

jonasrichard commented Apr 9, 2015

binarytemple commented Apr 9, 2015

kesslerm commented Apr 9, 2015

kesslerm commented Apr 13, 2015

kesslerm commented Apr 17, 2015

Basho-JIRA commented Apr 17, 2015

kesslerm commented Apr 20, 2015

Basho-JIRA commented Apr 28, 2015

Basho-JIRA commented Apr 28, 2015

seancribbs commented Apr 28, 2015

binarytemple commented Apr 28, 2015

Basho-JIRA commented Jun 1, 2015

jonmeredith commented Jun 1, 2015

Basho-JIRA commented Jul 9, 2015

Basho-JIRA commented Aug 27, 2015