Investigate excessive socket connections #742

tleyden · 2015-03-16T23:16:45Z

As reported in https://issues.couchbase.com/browse/CBSE-1722

tleyden · 2015-03-17T22:03:00Z

Also might be relevant: https://forums.couchbase.com/t/android-1-0-3-sync-gateway-opens-new-persistent-connection-when-wi-fi-turned-off-and-on/3186

tleyden · 2015-03-17T22:12:25Z

Copy/paste a skype conversation:

@snej

If the client closes the socket, it should show up as closed to the gateway too, shouldn't it?
The client would like a longpoll request to stay open indefinitely, and normally it will, but the client will retry if the socket gets closed before the response arrives. AWS servers are infamous for closing these sockets after a brief interval.

@adamcfraser:

I’m trying to understand what the keep-alive setting in nginx is doing in the situation described here - whether the client isn’t closing the socket (when it loses it’s wifi), or whether it’s getting closed but nginx isn’t routing that through to SG. I don’t know much about how socket management works for clients dropping off the network, honestly

@snej:

Ah, if the client just loses connectivity then TCP won't notice anything happened until the next time the server tries to send data. That's part of what the "heartbeat" parameter on the changes feed is for. It makes the server send a CRLF periodically.

@adamcfraser:

ah. so the expected result is that those connections would drop when the heartbeat happens

@snej:

By default TCP waits a ridiculously long time (90 min?) before sending an ACK out over an idle socket to see if it's still alive. It'd be something on the order of 15 sec after the heartbeat — the server has to wait for a response to the packet it sent.

tleyden · 2015-03-17T22:16:51Z

Using the following steps:

Start fresh sync gateway on AWS
Run TodoLite-IOS pointed to above sync gw on device
Put device into airplane mode
Force-kill app

I'm able to create ESTABLISHED connections that should have been cleaned up by the heartbeat according to the above comment.

These have all been hanging around for > 30 mins based on repro steps above:

sync_gate 7405 root    9u  IPv6 2944294      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:10980 (ESTABLISHED)
sync_gate 7405 root   10u  IPv6 2971342      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:21809 (ESTABLISHED)
sync_gate 7405 root   11u  IPv6 2944366      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:24292 (ESTABLISHED)
sync_gate 7405 root   12u  IPv6 2944368      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:59175 (ESTABLISHED)
sync_gate 7405 root   13u  IPv6 2954006      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:6864 (ESTABLISHED)
sync_gate 7405 root   14u  IPv6 2959275      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:20755 (ESTABLISHED)
sync_gate 7405 root   15u  IPv6 2962847      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:39919 (ESTABLISHED)
sync_gate 7405 root   16u  IPv6 2949350      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:32259 (ESTABLISHED)
sync_gate 7405 root   17u  IPv6 2949352      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:41819 (ESTABLISHED)
sync_gate 7405 root   18u  IPv6 2954008      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:37212 (ESTABLISHED)
sync_gate 7405 root   19u  IPv6 2962849      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:19406 (ESTABLISHED)
sync_gate 7405 root   20u  IPv6 2995912      0t0     TCP ip-10-182-209-208.ec2.internal:4984->173-228-114-82.dedicated.static.sonic.net:31135 (ESTABLISHED)
sync_gate 7405 root   25u  IPv6 2971344      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:45013 (ESTABLISHED)
sync_gate 7405 root   26u  IPv6 2975626      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:15094 (ESTABLISHED)

tleyden · 2015-03-18T01:22:35Z

Note to self. Possibly relevant:

tleyden · 2015-03-18T20:27:41Z

I noticed that:

when using curl to hit a test endpoint, the connection would disappear from the lsof output immediately after the curl request finished.
when using the android httpclient code snippet, the connection would hang around as ESTABLISHED in the lsof output. I'm not sure how long, but I did notice that if I did a second test, there would only be one ESTABLISHED connection, rather than two. (so there was some re-use going on, or at least the previous one was being destroyed)

This was puzzling, and so I decided to get to the bottom of it.

Here is the network capture when using curl:

and here is the network capture when using android's httpclient:

As it turns out, by default http1.1 clients will use persistent connections. This can be turned off by adding:

...
request.setHeader("Connection", "close");

This has the effect of forcing the server to close the connection after sending a response.

After adding the above, it behaved the same as the curl client. Btw I'm not suggesting that anybody do this, because then you will lose all of the benefits of persistent connections. I'm just writing this down as a piece to the puzzle in terms of figuring out why the connections are lingering.

tleyden · 2015-03-18T21:22:03Z

golang-nuts post: Proactively closing longpoll connections for clients that disappear from the network

tleyden · 2015-03-18T21:57:49Z

Good description of the problem: http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html

Article on tcp keepalive with Go: http://felixge.de/2014/08/26/tcp-keepalive-with-golang.html

jchris · 2015-03-18T22:28:51Z

http://www.unix.com/programming/135103-tcp-ip-how-verify-delivery.html

tleyden · 2015-03-18T22:40:15Z

I tried reducing the tcp keepalive time per this guide:

echo 30 > /proc/sys/net/ipv4/tcp_keepalive_time
echo 10 > /proc/sys/net/ipv4/tcp_keepalive_intvl
echo 9 > /proc/sys/net/ipv4/tcp_keepalive_probes

but it didn't seem to take the expected effect. This is on a CentOS7 box running in AWS.

tleyden · 2015-03-18T22:50:19Z

but it didn't seem to take the expected effect

It did seem to reduce it to 20 minutes, which may be related to some internal minimum.

tleyden · 2015-03-19T15:28:46Z

I was able to reduce the time down to approximately 3 minutes with:

echo 60 > /proc/sys/net/ipv4/tcp_keepalive_time
echo 6 > /proc/sys/net/ipv4/tcp_keepalive_intvl
echo 5 > /proc/sys/net/ipv4/tcp_keepalive_probes
echo 8 > /proc/sys/net/ipv4/tcp_retries2

jchris · 2015-03-19T17:43:06Z

Is there a way to measure the additional work the TCP stack will have to do? Is that related to dropping connections that aren't actually lost? I guess I'm wondering why the default is so long.

tleyden · 2015-03-19T17:52:12Z

Basically it will cause more packets to be sent and increase overall network traffic.

I think it's set to a high number by default because it's a relatively rare occurrence for a network cable to be unplugged, and the downside isn't that bad (an extra open socket for a while, which will eventually get reaped). Of course, in a mobile world, it's a more common occurrence.

tleyden · 2015-03-19T17:53:27Z

Is that related to dropping connections that aren't actually lost?

It's related to the "unplugged" scenario described in 2.3. Checking for dead peers of the TCP keepalive overview

tleyden · 2015-03-19T21:22:27Z

Too many open files errors

tleyden · 2015-03-19T22:03:53Z

Added two articles to the couchbase-mobile-portal docs under "OS Level tuning", so I'm closing this ticket.

couchbasebrian · 2015-03-23T17:44:40Z

Can you please provide a link to the articles? Thank you, I was looking under

http://developer.couchbase.com/mobile/develop/guides/sync-gateway/index.html

tleyden · 2015-03-23T17:51:35Z

@couchbasebrian sure both articles are here:

http://tleyden-couchbase.s3.amazonaws.com/mobile-docs/master/develop/guides/sync-gateway/os-level-tuning/index.html

(this will eventually get pushed up to developer.couchbase.com)

tleyden self-assigned this Mar 16, 2015

tleyden added the in progress label Mar 16, 2015

tleyden mentioned this issue Mar 18, 2015

Ignores heartbeat parameter in _changes POST request #745

Closed

tleyden closed this as completed Mar 19, 2015

tleyden removed the in progress label Mar 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate excessive socket connections #742

Investigate excessive socket connections #742

tleyden commented Mar 16, 2015

tleyden commented Mar 17, 2015

tleyden commented Mar 17, 2015

tleyden commented Mar 17, 2015

tleyden commented Mar 18, 2015

tleyden commented Mar 18, 2015

tleyden commented Mar 18, 2015

tleyden commented Mar 18, 2015

jchris commented Mar 18, 2015

tleyden commented Mar 18, 2015

tleyden commented Mar 18, 2015

tleyden commented Mar 19, 2015

jchris commented Mar 19, 2015

tleyden commented Mar 19, 2015

tleyden commented Mar 19, 2015

tleyden commented Mar 19, 2015

tleyden commented Mar 19, 2015

couchbasebrian commented Mar 23, 2015

tleyden commented Mar 23, 2015

Investigate excessive socket connections #742

Investigate excessive socket connections #742

Comments

tleyden commented Mar 16, 2015

tleyden commented Mar 17, 2015

tleyden commented Mar 17, 2015

tleyden commented Mar 17, 2015

tleyden commented Mar 18, 2015

tleyden commented Mar 18, 2015

tleyden commented Mar 18, 2015

tleyden commented Mar 18, 2015

jchris commented Mar 18, 2015

tleyden commented Mar 18, 2015

tleyden commented Mar 18, 2015

tleyden commented Mar 19, 2015

jchris commented Mar 19, 2015

tleyden commented Mar 19, 2015

tleyden commented Mar 19, 2015

tleyden commented Mar 19, 2015

tleyden commented Mar 19, 2015

couchbasebrian commented Mar 23, 2015

tleyden commented Mar 23, 2015