Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate excessive socket connections #742

Closed
tleyden opened this issue Mar 16, 2015 · 18 comments

Comments

@tleyden
Copy link
Contributor

commented Mar 16, 2015

@tleyden tleyden self-assigned this Mar 16, 2015

@tleyden tleyden added the in progress label Mar 16, 2015

@tleyden

This comment has been minimized.

@tleyden

This comment has been minimized.

Copy link
Contributor Author

commented Mar 17, 2015

Copy/paste a skype conversation:

@snej

If the client closes the socket, it should show up as closed to the gateway too, shouldn't it?
The client would like a longpoll request to stay open indefinitely, and normally it will, but the client will retry if the socket gets closed before the response arrives. AWS servers are infamous for closing these sockets after a brief interval.

@adamcfraser:

I’m trying to understand what the keep-alive setting in nginx is doing in the situation described here - whether the client isn’t closing the socket (when it loses it’s wifi), or whether it’s getting closed but nginx isn’t routing that through to SG. I don’t know much about how socket management works for clients dropping off the network, honestly

@snej:

Ah, if the client just loses connectivity then TCP won't notice anything happened until the next time the server tries to send data. That's part of what the "heartbeat" parameter on the changes feed is for. It makes the server send a CRLF periodically.

@adamcfraser:

ah. so the expected result is that those connections would drop when the heartbeat happens

@snej:

By default TCP waits a ridiculously long time (90 min?) before sending an ACK out over an idle socket to see if it's still alive. It'd be something on the order of 15 sec after the heartbeat — the server has to wait for a response to the packet it sent.

@tleyden

This comment has been minimized.

Copy link
Contributor Author

commented Mar 17, 2015

Using the following steps:

  • Start fresh sync gateway on AWS
  • Run TodoLite-IOS pointed to above sync gw on device
  • Put device into airplane mode
  • Force-kill app

I'm able to create ESTABLISHED connections that should have been cleaned up by the heartbeat according to the above comment.

These have all been hanging around for > 30 mins based on repro steps above:

sync_gate 7405 root    9u  IPv6 2944294      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:10980 (ESTABLISHED)
sync_gate 7405 root   10u  IPv6 2971342      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:21809 (ESTABLISHED)
sync_gate 7405 root   11u  IPv6 2944366      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:24292 (ESTABLISHED)
sync_gate 7405 root   12u  IPv6 2944368      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:59175 (ESTABLISHED)
sync_gate 7405 root   13u  IPv6 2954006      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:6864 (ESTABLISHED)
sync_gate 7405 root   14u  IPv6 2959275      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:20755 (ESTABLISHED)
sync_gate 7405 root   15u  IPv6 2962847      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:39919 (ESTABLISHED)
sync_gate 7405 root   16u  IPv6 2949350      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:32259 (ESTABLISHED)
sync_gate 7405 root   17u  IPv6 2949352      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:41819 (ESTABLISHED)
sync_gate 7405 root   18u  IPv6 2954008      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:37212 (ESTABLISHED)
sync_gate 7405 root   19u  IPv6 2962849      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:19406 (ESTABLISHED)
sync_gate 7405 root   20u  IPv6 2995912      0t0     TCP ip-10-182-209-208.ec2.internal:4984->173-228-114-82.dedicated.static.sonic.net:31135 (ESTABLISHED)
sync_gate 7405 root   25u  IPv6 2971344      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:45013 (ESTABLISHED)
sync_gate 7405 root   26u  IPv6 2975626      0t0     TCP ip-10-182-209-208.ec2.internal:4984->mobile-166-171-248-249.mycingular.net:15094 (ESTABLISHED)
@tleyden

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2015

I noticed that:

  • when using curl to hit a test endpoint, the connection would disappear from the lsof output immediately after the curl request finished.
  • when using the android httpclient code snippet, the connection would hang around as ESTABLISHED in the lsof output. I'm not sure how long, but I did notice that if I did a second test, there would only be one ESTABLISHED connection, rather than two. (so there was some re-use going on, or at least the previous one was being destroyed)

This was puzzling, and so I decided to get to the bottom of it.

Here is the network capture when using curl:

curl_net_capture

and here is the network capture when using android's httpclient:

android_net_capture

As it turns out, by default http1.1 clients will use persistent connections. This can be turned off by adding:

...
request.setHeader("Connection", "close"); 

This has the effect of forcing the server to close the connection after sending a response.

After adding the above, it behaved the same as the curl client. Btw I'm not suggesting that anybody do this, because then you will lose all of the benefits of persistent connections. I'm just writing this down as a piece to the puzzle in terms of figuring out why the connections are lingering.

@tleyden

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2015

@tleyden

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2015

@jchris

This comment has been minimized.

@tleyden

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2015

I tried reducing the tcp keepalive time per this guide:

echo 30 > /proc/sys/net/ipv4/tcp_keepalive_time
echo 10 > /proc/sys/net/ipv4/tcp_keepalive_intvl
echo 9 > /proc/sys/net/ipv4/tcp_keepalive_probes

but it didn't seem to take the expected effect. This is on a CentOS7 box running in AWS.

@tleyden

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2015

but it didn't seem to take the expected effect

It did seem to reduce it to 20 minutes, which may be related to some internal minimum.

@tleyden

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2015

I was able to reduce the time down to approximately 3 minutes with:

echo 60 > /proc/sys/net/ipv4/tcp_keepalive_time
echo 6 > /proc/sys/net/ipv4/tcp_keepalive_intvl
echo 5 > /proc/sys/net/ipv4/tcp_keepalive_probes
echo 8 > /proc/sys/net/ipv4/tcp_retries2
@jchris

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2015

Is there a way to measure the additional work the TCP stack will have to do? Is that related to dropping connections that aren't actually lost? I guess I'm wondering why the default is so long.

@tleyden

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2015

Basically it will cause more packets to be sent and increase overall network traffic.

I think it's set to a high number by default because it's a relatively rare occurrence for a network cable to be unplugged, and the downside isn't that bad (an extra open socket for a while, which will eventually get reaped). Of course, in a mobile world, it's a more common occurrence.

@tleyden

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2015

Is that related to dropping connections that aren't actually lost?

It's related to the "unplugged" scenario described in 2.3. Checking for dead peers of the TCP keepalive overview

@tleyden

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2015

@tleyden

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2015

Added two articles to the couchbase-mobile-portal docs under "OS Level tuning", so I'm closing this ticket.

@tleyden tleyden closed this Mar 19, 2015

@tleyden tleyden removed the in progress label Mar 19, 2015

@couchbasebrian

This comment has been minimized.

Copy link

commented Mar 23, 2015

Can you please provide a link to the articles? Thank you, I was looking under

http://developer.couchbase.com/mobile/develop/guides/sync-gateway/index.html

@tleyden

This comment has been minimized.

Copy link
Contributor Author

commented Mar 23, 2015

@couchbasebrian sure both articles are here:

http://tleyden-couchbase.s3.amazonaws.com/mobile-docs/master/develop/guides/sync-gateway/os-level-tuning/index.html

(this will eventually get pushed up to developer.couchbase.com)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.