Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

race condition in double fork resulting in no server process #446

Closed
grooverdan opened this Issue Nov 20, 2013 · 11 comments

Comments

Projects
None yet
4 participants
Contributor

grooverdan commented Nov 20, 2013

freebsd 9.0 release .. fail2ban 0.8.9

From: http://windy.fakelag.net/fail2bantrace.txt

35694: write(1,"2013-11-19 20:40:46,581 fail2ban.server : INFO   Starting in daemon mode\n",73) = 73 (0x49)
35694: fork()                    = 35704 (0x8b78)
35704: process exit, rval = 0
 = 35704 (0x8b78)
35694: process exit, rval = 0
 = 0 (0x0)

From code it look like the server exits before the child sets up a signal mask to ignore the parent terminating.

perhaps signal.signal(signal.SIGHUP, signal.SIG_IGN) should be before the fork (assuming signal handling is inherited by child processes(?))

Contributor

grooverdan commented Nov 20, 2013

http://windy.fakelag.net/fail2bantrace-timesleep5.txt

a sleep of 5 seconds before the first fork parent exits

http://windy.fakelag.net/fail2bantrace-movedsignal-nosleep.txt

signal.signal(signal.SIGHUP, signal.SIG_IGN) to before the first pid = os.fork() in the same funcion

http://windy.fakelag.net/fail2bantrace-movedsignal.txt
contains both of the above changes.

All seem to progress to the server having a socket that's connectable, though don't live long enough to respond to even a ping.

All contain:

**Pth** SCHEDULER INTERNAL ERROR: no more thread(s) available to schedule!?!?

which could be an OOM indicator: https://groups.google.com/d/msg/minix3/RTyeeiryxmg/OAr7phfKpLwJ

Contributor

grooverdan commented Nov 20, 2013

OOM - probably not

[10:30:58]  only other question is it likely to be running short of memory?
[10:32:11]  shouldn't.. 2gb memory with 4gb swap. There's nearly 2gb inactive memory right now
Owner

yarikoptic commented Nov 21, 2013

On Tue, 19 Nov 2013, Daniel Black wrote:

freebsd 9.0 release .. fail2ban 0.8.9

From: [1]http://windy.fakelag.net/fail2bantrace.txt

35694: write(1,"2013-11-19 20:40:46,581 fail2ban.server : INFO Starting in daemon mode\n",73) = 73 (0x49)
35694: fork() = 35704 (0x8b78)
35704: process exit, rval = 0
= 35704 (0x8b78)
35694: process exit, rval = 0
= 0 (0x0)

From code it look like the server exits before the child sets up a signal
mask to ignore the parent terminating.

interesting indeed... or could it be that on freebsd it simply doesn't
handle signal.SIG_IGN the same way? is that a single case and otherwise
it started fine on freebsd 9.0? a simple print or debug statement in
the child could clarify if it set the signals by the time ... stop --
looking at pids in above strace -- it seems to be the 1st child
exits first (without calling 2nd .fork()) and only then the server
exits, so logic is correct but something goes hairy in that 1st child...
again -- adding a print statement in handling of exception for the 2nd
fork could have helped

perhaps signal.signal(signal.SIGHUP, signal.SIG_IGN) should be before the
fork (assuming signal handling is inherited by child processes(?))

Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Senior Research Associate, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
WWW: http://www.linkedin.com/in/yarik

Contributor

grooverdan commented Nov 21, 2013

interesting indeed... or could it be that on freebsd it simply doesn't handle signal.SIG_IGN the same way?

Looking at the beginning of the ktrace it seems that python goes to some lengths to normalize the behavior

is that a single case and otherwise it started fine on freebsd 9.0?

What do you mean by single case?

a simple print or debug statement in the child could clarify if it set the signals by the time ... stop

yes this won't help.

-- looking at pids in above strace -- it seems to be the 1st child exits first (without calling 2nd .fork()) and only then the server exits,

yes. The child exits before the signal mask is set.

so logic is correct but something goes hairy in that 1st child... again -- adding a print statement in handling of exception for the 2nd fork could have helped

It doesn't get to the second fork in its current state.

it was report on irc this morning:

[10:30:23] <rootrider> I can get the server to stay running if I manually run fail2ban-client and then fail2ban-server -x

Owner

yarikoptic commented Nov 21, 2013

On Thu, 21 Nov 2013, Daniel Black wrote:

 interesting indeed... or could it be that on freebsd it simply doesn't
 handle signal.SIG_IGN the same way?

Looking at the beginning of the ktrace it seems that python goes to some
lengths to normalize the behavior

 is that a single case and otherwise it started fine on freebsd 9.0?

What do you mean by single case?

does it happen all the time (on every freebsd box on every run) or just
some times (once ever)?

 a simple print or debug statement in the child could clarify if it set
 the signals by the time ... stop

yes this won't help.

 -- looking at pids in above strace -- it seems to be the 1st child exits
 first (without calling 2nd .fork()) and only then the server exits,

yes. The child exits before the signal mask is set.

 so logic is correct but something goes hairy in that 1st child... again
 -- adding a print statement in handling of exception for the 2nd fork
 could have helped

It doesn't get to the second fork in its current state.

or it fails to call it... we don't know, do we?

Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Senior Research Associate, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
WWW: http://www.linkedin.com/in/yarik

Contributor

grooverdan commented Nov 21, 2013

It doesn't get to the second fork in its current state.
or it fails to call it... we don't know, do we?

36126: sigaction(SIGHUP,{ SIG_IGN 0x0 ss_t },{ SIG_DFL 0x0 ss_t }) = 0 (0x0) shows up in the other logs where sleep or it has been done before the fork.

And its the first bit of code after the fork for the child.

I'm still baffled by this. Probably time to pull in the FreeBSD maintainers to help here (or set myself up a vm).

to explain a little more, the issue I'm seeing is with attempting to start fail2ban normally, using fail2ban-client. This is when fail2ban-server starts up, creates the pid and sock files, and then exits for some reason.

I am able to start fail2ban-client and then manually run fail2ban-server -x, and everything seems to work correctly then. fail2ban-server remains running.

Also, over the last day I upgraded to FreeBSD 9.2 RELEASE (the most recent stable release) and updated all ports, so everything is about as current as possible now (from ports). So, python 2.7.5_4 is installed, and fail2ban-0.8.10 is installed. The issues described here remain the same, however.

Let me know if any further information or testing is needed.

actually.. I have been running 'fail2ban-client -c /usr/local/etc/fail2ban/ -v start' and then 'fail2ban-server -f -x'. This has been working as it should.

However, I just tried doing the same, but without the -f option for fail2ban-server (so, just 'fail2ban-server -x') and fail2ban-server fails to remain running, just as when started by fail2ban-client.

So, fail2ban-server remains running when I use the -f -x parameters, but when I use -x only (so the server runs in daemon mode), it exits out quickly. This is repeatable.

(edited with server/server.py modified as mentioned below)

Here are traces (using truss in freebsd) of fail2ban-server -x, and then fail2banserver -f -x being run on my machine. The first exits out almost immediately. The second runs successfully (given that it's run in the foreground).

Both of these have server/server.py modified per #452, so that signal.signal(signal.SIGHUP, signal.SIG_IGN) is moved to before the fork.

http://windy.fakelag.net/fail2banserver-movedsignal-bg.txt
http://windy.fakelag.net/fail2banserver-movedsignal-fg.txt

@grooverdan grooverdan added a commit to grooverdan/fail2ban that referenced this issue Nov 23, 2013

@grooverdan grooverdan ENH: move signal.signal(signal.SIGHUP, signal.SIG_IGN) before fork in…
… server. closes #446
f2c529c
Contributor

chtheis commented Dec 1, 2013

I'm the maintainer of the FreeBSD port. Daniel told me about the ongoing discussion here.
I'm not a specialist in signal and fork, so I don't know if I'd be of great help here.

I'm running fail2ban on my (dual core but not to fast) server without any problems. I'v tried to start the fail2ban server on a virtual machine with an almost fresh installation of FreeBSD via "fail2ban-client -x start" and "fail2ban-server -x" and the server always started. I'm using the current version of FreeBSD, 9.2.
So it may be a race condition. But I'm surprised that it didn't hit us on any other Unix platform.

I'm sorry that I couldn't help you more.

br Christoph

Contributor

grooverdan commented Dec 2, 2013

thanks for testing @chtheis .

@grooverdan grooverdan closed this in #452 Dec 13, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment