Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sockets remain in TIME_WAIT / CLOSE_WAIT #164

Closed
wereHamster opened this issue Sep 10, 2013 · 5 comments
Closed

Sockets remain in TIME_WAIT / CLOSE_WAIT #164

wereHamster opened this issue Sep 10, 2013 · 5 comments

Comments

@wereHamster
Copy link
Contributor

I have a simple etcd setup with two machines in a cluster. Both servers eventually run out of free file descriptors (2013/09/10 16:47:32 http: Accept error: accept tcp [::]:7001: too many open files; retrying in 1s).

Etcd fails to properly close the sockets. They remain in TIME_WAIT / CLOSE_WAIT. You can use these two commands to inspect in which state the sockets are:

  • lsof -a -p pidof etcd
  • netstat -ntp

So far one of the etcd servers has leaked 14 file descriptors in about two hours.

@xiang90
Copy link
Contributor

xiang90 commented Sep 11, 2013

@wereHamster Based on your reporting on IRC, seems like you are not using the newest version of etcd, which has a socket leak problem (I am solving that).
I do not think etcd controls any tcp connection in the previous version. The golang's http package help to maintain all persistent connections. It is also possible that the leak is caused by the client side (based on the stats you gave me, there were about 10K reads). Which client library are you using?

@wereHamster
Copy link
Contributor Author

The client is written in haskell and uses conduit-http for the http connection. I don't think the client is a problem though, as the dangling connections were on port 7001 between the two etcd instances.

@philips
Copy link
Contributor

philips commented Oct 10, 2013

@wereHamster Can you please test with the latest master version? I don't know if this is fixed or how to reproduce.

@wereHamster
Copy link
Contributor Author

I'm running v0.1.2-29-g3be69f0 now and the problem still persists. The server has about 2k connections in TIME_WAIT. Though etcd still accepts connections on port 4001, something which previously failed due to the 'out of sockets' condition. Also, I don't see any error messages in the log file.

So I'm not sure if it's fixed. One one hand etcd works (as in, clients can connect to it and do their job), on the other hand 2k connections in TIME_WAIT still seems suspicious.

@wereHamster
Copy link
Contributor Author

Hm, only one of the two servers in the cluster was updated. The other one is still running v0.1.1-46-gadbcbef. That could explain that I'm seeing TIME_WAIT but no CLOSE_WAIT on the machine that's running the latest version. I'll update the second machine as well and will get back to you in a day or two.

@xiang90 xiang90 closed this as completed Nov 14, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants