Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redsocks enters an inifite loop when RLIMIT_NOFILE is reached and a new client connection is attempted #19

Closed
apoikos opened this issue Jan 26, 2012 · 13 comments
Assignees

Comments

@apoikos
Copy link
Contributor

apoikos commented Jan 26, 2012

Hi,

In its main loop, redsocks accept()s new connections and simply retries if accept() returns an error. However, this leads to an infinite loop in the following corner case:

  1. If redsocks has reached its open file limit and
  2. a new connection is being accept()ed,
  3. then accept() fails because it cannot create a new file descriptor for the incoming connection, however
  4. the client connection stays in the listening socket's backlog and thus
  5. the listening socket appears EV_READ-ready and thus we immediately loop back to 2.

This loop causes 100% CPU utilization and logfile flooding with redsocks[1187]: accept: Too many open files, until a client connection is closed.

There are a number of ways to solve this (e.g. check that accept() does not return EMFILE and throttle), although IMHO the best one would be to perform a getrlimit(RLIMIT_NOFILE) on startup and keep track of how many fd's are currently in use before accept()ing. setrlimit() support would also be nice by the way ;-)

Thanks,
Apollon

@darkk
Copy link
Owner

darkk commented Jan 26, 2012

Nice catch! Thanks, I'll try to fix that.

@darkk
Copy link
Owner

darkk commented Jan 26, 2012

By the way, counting file descriptors is bad solution for several reasons:

  • libevent may use some file descriptors internally for signal processing and DNS resolving
  • there may be several open file descriptors besides (0, 1, 2) while launching redsocks, they will be inherited by the process.

There are actually two issues here:
1 - busyloop and CPU load
2 - log flooding and possible disk exhaustion

I think, that exponential backoff after accept() error for EMFILE, ENFILE, ENOBUFS & ENOMEM is a solution for (1).
I think, that (2) may be ignored with sane exponential backoff defaults. Sane default is to limit max backoff with default TCP connection timeout that equals $(net.ipv4.tcp_syn_retries) * initRTO, where initRTO == 3 per RFC1122.
So, sane default for max_backoff is 15s for my host. I think, that one log message per 15s is not critical.

Yes, I'm aware about "Tuning TCP parameters for the 21st century" trend, I'm aware that tcp_syn_retries has nothing common with client-side, I just want to delegate responsibility for defaults on someone else. :)

Comments?

P.S. I wonder, have you found the bug in real-world situation or was it just a test case?
P.P.S. I've checked with telnet and tcp_syn_retries * initRTO != default TCP timeout. I'll re-read RFC.
For the record: timeout is ~189 seconds, there were 6 SYNs: 0s, 3s, 9s, 21s, 45s, 93s
P.P.P.S.: Of course, backoff can be reset when some socket is closed :)

@ghost ghost assigned darkk Jan 26, 2012
@apoikos
Copy link
Contributor Author

apoikos commented Jan 26, 2012

libevent actually does use file descriptors internally, which cannot be accounted for in a reliable way, you're right. The rest of the inherited fd's should be closed as part of the daemonization process. Other than that, yes, it's not the best way to do things probably, although it might be simpler than exponential backoff.

IMHO, you don't need to cap things by the exact RFC specification, just make sure that you're within a "safe" region (e.g. 5-10 seconds). I think that the best approach would be to throttle log messages in a different way than accept(), i.e. send out a "max open files reached" warning once and then again not before say 60 seconds have elapsed. Essentially, we care about the fact that the limit has been reached, i.e. we must raise the given limit, not that it was triggered by a given client connection. See for example the "MaxClients reached" warning that apache emits.

To sum things up: any non-zero backoff in accept() will solve the busyloop issue, personally i'd use something capped by 1-2 seconds in order to guarantee some responsiveness. But IMHO log messages should be throttled independently.

P.S.: I actually encountered this in a real-world situation, with a rather unorthodox use-case (using redsocks to tunnel hundreds of nagios NRPE checks over SOCKS for NAT traversal) and ended up with 12GB worth of logs within 6 hours.

@darkk
Copy link
Owner

darkk commented Jan 26, 2012

Regarding log throttling, are you writing log to syslog?
Every syslog implementation I've seen does some log throttling internally.

What is your logging configuration?

@apoikos
Copy link
Contributor Author

apoikos commented Jan 26, 2012

Yes, i'm using rsyslog. It supports per-process throttling, although not the version in Debian Squeeze. Versions of rsyslog before 5.7.1 only support log queues, which means that they actually delay dequeueing from /dev/log (or wherever), effectively blocking every process that tries to write something to syslog while a log storm is going on. So leaving log throttling up to the syslog implementation may cause side-effects.

@darkk
Copy link
Owner

darkk commented Jan 26, 2012

Hm... Can you copy-paste some lines of log?
As far as I see, all your lines should be same in this busy-loop, so it should work like:

Jan 26 18:06:19 darkk-ya-thinkpad darkk: foo
Jan 26 18:06:20 darkk-ya-thinkpad darkk: foo
Jan 26 18:06:20 darkk-ya-thinkpad darkk: foo
Jan 26 18:06:23  darkk: last message repeated 10 times

@apoikos
Copy link
Contributor Author

apoikos commented Jan 26, 2012

It goes on like this, forever ;-)

Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files
Jan 26 18:03:59 crowley redsocks[6977]: accept: Too many open files

Duplicate message reduction is a feature that must be explicitly enabled in rsyslog (at least under Debian). Even if it was enabled, it would still cause rsyslog to process the messages, although it wouldn't write them to disk.

@darkk
Copy link
Owner

darkk commented Jan 27, 2012

I've added backoff to master branch, can you test it?

My logs looks like that:

1327695004.774753 main.c:152 main(...) redsocks started
1327695006.843186 redsocks.c:607 redsocks_accept_client(...) accept: out of file descriptos, backing off for 1 ms: Too many open files
1327695006.844360 redsocks.c:607 redsocks_accept_client(...) accept: out of file descriptos, backing off for 3 ms: Too many open files
1327695006.847510 redsocks.c:607 redsocks_accept_client(...) accept: out of file descriptos, backing off for 7 ms: Too many open files
1327695006.854692 redsocks.c:607 redsocks_accept_client(...) accept: out of file descriptos, backing off for 12 ms: Too many open files
1327695006.866859 redsocks.c:607 redsocks_accept_client(...) accept: out of file descriptos, backing off for 27 ms: Too many open files
1327695006.894107 redsocks.c:607 redsocks_accept_client(...) accept: out of file descriptos, backing off for 10 ms: Too many open files
1327695006.904281 redsocks.c:607 redsocks_accept_client(...) accept: out of file descriptos, backing off for 101 ms: Too many open files
1327695007.005512 redsocks.c:607 redsocks_accept_client(...) accept: out of file descriptos, backing off for 158 ms: Too many open files
1327695007.163817 redsocks.c:607 redsocks_accept_client(...) accept: out of file descriptos, backing off for 351 ms: Too many open files
1327695007.515323 redsocks.c:607 redsocks_accept_client(...) accept: out of file descriptos, backing off for 836 ms: Too many open files
1327695008.352300 redsocks.c:607 redsocks_accept_client(...) accept: out of file descriptos, backing off for 1864 ms: Too many open files
1327695010.218325 redsocks.c:607 redsocks_accept_client(...) accept: out of file descriptos, backing off for 939 ms: Too many open files
1327695011.158418 redsocks.c:607 redsocks_accept_client(...) accept: out of file descriptos, backing off for 8045 ms: Too many open files

I have two questions:

  1. do you need min_accept_backoff option? Minimal backoff time is 1ms now.
  2. do you need low-latency solution and will you test it if I implement it? I mean restoring accept() handler after every close().

@apoikos
Copy link
Contributor Author

apoikos commented Jan 30, 2012

Hi Leonid,

Just tested it, seems to work fine. Just two remarks:

  1. There is a small typo, descriptos instead of descriptors in the message you log
  2. I was thinking of what you proposed in 2. As I see it, it shouldn't be too hard to implement and it's probably the event-driven Right Thing To Do (TM). I'd be glad to test it if you implement it.

Thanks for the quick response!
Apollon

P.S.: Will you be releasing a 0.4 version soon, or should I backport the bugfix and upload 0.3 to Debian?

@darkk
Copy link
Owner

darkk commented Jan 30, 2012

Hi Apollon,

There is one issue with (2): oscillation around can-accept/cant-accept states under connection flood, also, redsocks need two file descriptors to handle TCP-proxied connection, accepting connection and closing it is not really graceful degradation.
It's not Really Hard to do right, but it's a bit tricky.

I'll think a bit more about (2) and will release 0.4 as soon as this bug will be closed.

@darkk
Copy link
Owner

darkk commented Feb 6, 2012

Ok, I've finally made up my mind.
To close the bug I'm going to implement two features:

  1. min_accept_backoff
  2. restart accept() loop on close() if and only if min_accept_backoff milliseconds passed by.
    I really would like to avoid oscillation and I'm too lazy to preallocate all required descriptors before accept()-ing.

I'll also set min_accept_backoff to, maybe, 125ms to avoid oscillation. Can you suggest better value? It's average RTT from my current route cache among ~100 routes at my laptop, so my suggestion is just a random number.

@apoikos
Copy link
Contributor Author

apoikos commented Feb 7, 2012

Hi Leonid,

Your suggestion provides enough dampening and I think it's a good compromise between simplicity and responsiveness. As for min_accept_backoff, i'd go for 100ms, I prefer round numbers ;-)

Thanks again for the quick response!

@darkk
Copy link
Owner

darkk commented Mar 25, 2012

Hi Apollon.

Excuse me for late reply (after quick response, huh). I've commited min_accept_backoff feature with accept() after close() logic.

I've tested it a bit and it seems to work, so I tagged it with release-0.4

I would be happy if you give it a try in really loaded environment - my tests were rather synthetic :)

Leonid.

@darkk darkk closed this as completed Mar 25, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants