Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support keepalive in PB connections [JIRA: RIAK-1666] #88

Closed
jonasrichard opened this issue Apr 2, 2015 · 16 comments · Fixed by #89
Closed

Support keepalive in PB connections [JIRA: RIAK-1666] #88

jonasrichard opened this issue Apr 2, 2015 · 16 comments · Fixed by #89

Comments

@jonasrichard
Copy link

Sometimes network split can happen between client applications and Riak. In that case when tcp connections are not closed properly, dead connections remain on Riak side. As far as I know if there is packet sent in that dead connection, the sender won't get any ack, so tcp will be finally closed. But since on Riak side the sockets are "server sockets" so they don't send anything until they don't get any request.

I am just thinking if it is worth adding keepalive here, at least get keepalive from config keys.

[binary, {packet, raw}, {reuseaddr, true}, {backlog, BackLog}, {nodelay, NoDelay}].

Can it have any drawback? Can it solve that (close zombie connection) situation?

@Basho-JIRA Basho-JIRA changed the title Support keepalive in PB connections Support keepalive in PB connections [JIRA: RIAK-1666] Apr 2, 2015
@Basho-JIRA
Copy link

Sean - let me know if this is a feature request. If so, I can get it to Products.

_[posted via JIRA by Derek Somogyi]_

@jonasrichard
Copy link
Author

Yes, it is a feature request (however I am not the one who was asked). Sometimes there is a load balancer between clients and riak nodes. When load balancer kills connections it doesn't always close them properly, leaving half-dead connection. It is not a riak bug, but since it is related we usually get it as a riak issue (support team). Keepalive can prevent those half-dead connections.

@binarytemple
Copy link

@seancribbs @DSomogyi This is a feature request, please contact me directly for more info.

@kesslerm
Copy link

kesslerm commented Apr 9, 2015

See also http://lists.basho.com/pipermail/riak-users_lists.basho.com/2015-April/017044.html for the same issue raised on the public riak-users mailing list by a different user

@kesslerm
Copy link

I've been able to reproduce this easily on a local wireless network. Establish a Riak session between two computers, one running the Python client (non Linux) and the other a devrel cluster (Mac OSX). Then, after cycling the wifi connection on the client side both the client and server report the connections as still being ESTABLISHED, however no further data is transmitted. The python client will eventually time out and re-establish a new connection. The server side will hang on to the connection indefinitely.

kesslerm pushed a commit to kesslerm/riak_api that referenced this issue Apr 17, 2015
Without keepalive the pb_listener hangs on to established
connections in case of a network partition. This can lead
to available sockets being exhausted on servers with a high
number of concurrent connections.

Fixes basho#88
kesslerm pushed a commit to kesslerm/riak_api that referenced this issue Apr 17, 2015
Without keepalive the pb_listener hangs on to established
connections in case of a network partition. This can lead
to available sockets being exhausted on servers with a high
number of concurrent connections.

Fixes basho#88
@kesslerm
Copy link

I've tested that adding 'keepalive' to the tcp options will allow orphan connections to be reclaimed after the keepalive timeout has expired. I believe it is even safe to default to true here, but have submitted a pull request that allows this setting to be switched off or on.

The keepalive timeout values are configured at the OS level.

Mac OSX defaults:

net.inet.tcp.keepidle: 7200000
net.inet.tcp.keepintvl: 75000
net.inet.tcp.keepinit: 75000
net.inet.tcp.keepcnt: 8

Linux:

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

@Basho-JIRA
Copy link

This fix needs to be backported to 1.4 for a customer. The main difference will be that there is no cuttlefish schema to update. I'm ok with Magnus doing the port, if he has the cycles.

_[posted via JIRA by Sean Cribbs]_

kesslerm pushed a commit to kesslerm/riak_api that referenced this issue Apr 20, 2015
Without keepalive the pb_listener hangs on to established
connections in case of a network partition. This can lead
to available sockets being exhausted on servers with a high
number of concurrent connections.

Backport of basho#89
Fixes basho#88 for 1.4.x
@kesslerm
Copy link

Backported against the 1.4 branch. Please merge if appropriate.

@Basho-JIRA
Copy link

I see this ticket has been marked as closed, what was the outcome? Is it going to be in 2.x. @kesslerm I didn't see any URL for the P/R.

_[posted via JIRA by Bryan Hunt]_

@Basho-JIRA
Copy link

AFAICT this is going to be in 2.1.1+. The pull request was against the development branch of riak_api (3f09915), and it has been integrated into the 2.1 branch through basho/riak@d50fefc.

I don't see any signs yet of it entering a future 2.0.x release, though.

_[posted via JIRA by Magnus Kessler]_

@seancribbs
Copy link
Contributor

I believe a manual patch was going to be delivered to the customer needing it on 1.4 series.

@binarytemple
Copy link

Thank you both for the info. B

@Basho-JIRA
Copy link

I don't see the patch pulled into the '2.0' branch of riak_api, yet. Is 2.0.6 released from the '2.0' branch or the 'develop' branch?

_[posted via JIRA by Magnus Kessler]_

@jonmeredith
Copy link
Contributor

The branches are a little in flux at the moment in a partial transition to
semver. We'll make sure it lands in 2.0.6 and 2.1.2.

On Mon, Jun 1, 2015 at 3:14 AM Basho JIRA bot! notifications@github.com
wrote:

I don't see the patch pulled into the '2.0' branch of riak_api, yet. Is
2.0.6 released from the '2.0' branch or the 'develop' branch?

[posted via JIRA by Magnus Kessler]

Reply to this email directly or view it on GitHub
#88 (comment).

kesslerm pushed a commit that referenced this issue Jun 10, 2015
Without keepalive the pb_listener hangs on to established
connections in case of a network partition. This can lead
to available sockets being exhausted on servers with a high
number of concurrent connections.

Fixes #88

(cherry picked from commit 8d7ede7)
@Basho-JIRA
Copy link

JIRA references GH PR 88 which is for 1.4. 89 appears to be for 2.0.6 Riak-1737 is Support keepalive in PB connections

_[posted via JIRA by Patricia Brewer]_

@Basho-JIRA
Copy link

The PR for 2.1.2 which was merged - #89

_[posted via JIRA by Patricia Brewer]_

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants