Init scripts frequently fail to start their daemons #395

Closed
bitprophet opened this Issue Aug 19, 2011 · 19 comments

Comments

Projects
None yet
Owner

bitprophet commented Aug 19, 2011

Description

I've gotten multiple reports of this on IRC, as well as a comment on #350, and now a mailing list thread.

No clear cause yet, and while it's been reported multiple times I don't expect that it is a constant problem or we'd be hearing much more about it. In some very limited testing on my end so far, I can recreate the problem maybe 30-50% of the time -- but it is reproducible.

Symptom is simply that init-style scripts responsible for starting daemons and then returning immediately, will return OK, return code of 0 and "success" status message printed to stdout -- but will not actually spin up the daemon in question.

My personal test was done via latest master targeting an Ubuntu 10.04 (Lucid) VM and the stock Apache2 package's init script.


Originally submitted by Jeff Forcier (bitprophet) on 2011-07-23 at 07:25pm EDT

Relations

  • Related to #350: fabric hangs up some remote command (for daemon program)

@ghost ghost assigned bitprophet Aug 19, 2011

Owner

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


Instrumented the init script I'm testing and everything seems to run the same either way (i.e. real success or fake success scenarios), implying the problem is within the apachectl call the script itself makes.

Starting to think about what the cause could be on our end:

  • Since it's semi-random, that makes me think of the past issues with the race-condition-plagued IO subsystem. However, I can't really think about how that could possibly affect something on the remote end in this manner, and the race conditions were all local-based anyways.
    • One way to test this could be to see if this issue comes up with Fab 0.9.x and pty=True (to match the current default in 1.x).
  • It could also just be pty-related -- I don't recall this being an issue prior to 1.0 and setting pty to True was one of the major changes in default behavior. Though again, I can't see why using SSH's request-a-pty subsystem would cause init scripts to behave this way.
    • A test here would be to use ssh -t <hostname> <command> and see if that also reproduces the problem.

on 2011-07-23 at 07:45pm EDT

Owner

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


apache2ctl itself is also simply a wrapper Bash script calling /usr/sbin/apache2, which is a symlink to an actual binary executable in the Apache mpm-worker install location. Specifically, in normal start usage it calls /usr/sbin/apache2 -k start. As before, apache2ctl doesn't seem to behave any differently in the two different scenarios, re: return value or which sections are executed.

/usr/sbin/apache2's docs are relatively limited (even on Apache's site), stating only that you should be using apachectl to set up env vars (which is accurate -- running apache2 by itself bails out pretty obviously with errors about those vars not being set.)

Examining the output of env just prior to apache2ctl's invocation of apache2 yields only a few items: user, group, pidfile location and language. These do not change between success and failure situations. I was kind of hoping there would be something in the various sourcings and env var settings in the wrappers which would change sometimes, but no.


So far this isn't going anywhere useful. Time to test the above ideas (pty, ssh) to see what changes there.


on 2011-07-23 at 08:46pm EDT

Owner

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


With pty=False, it does appear to work much better (as implied by Max's comment in #350). With the default True setting, I was seeing failures roughly 5/10 times, sometimes a few more or less. With False, I've just ran it about 15 times in a row with zero failures. Not a statistician but that seems pretty good to me.

Running ssh by hand yields similar results: ssh -t <host> sudo /etc/init.d/apache2 start will silently fail to start Apache approximately 50% of the time. The same with -T (force no pty) and it starts 100% of the time.

So this isn't Fabric's fault; it's something deeper where these init scripts misbehave when an SSH style pseudo-tty is in play.


Going to dig a bit deeper for curiosity's sake, but it looks like the "solution" here is a new FAQ stating to use pty=False when this problem is encountered.


on 2011-07-23 at 08:59pm EDT

Owner

bitprophet commented Aug 19, 2011

Jeff Forcier (bitprophet) posted:


Yea, not finding anything that explains this behavior, unfortunately. Given the findings above I think an FAQ is definitely the way to go.


on 2011-07-23 at 10:35pm EDT

Owner

bitprophet commented Aug 19, 2011

Hugo Garza (hiro2k) posted:


Ughh I just ran into this yesterday, I wish I would have seen this bug, luckily I tried setting pty=False and it worked as well. Thanks for the explanation, at least it's not fabrics fault. Now you really have me wondering why this fails.


on 2011-08-02 at 01:27pm EDT

cwood commented Oct 19, 2011

Are you sure this isn't just a bash script issue too? I mean with my mailing list thread. They were just bash scripts that started java and weblogic.

yuvadm commented Oct 27, 2011

FWIW, I'm getting this horrible behavior on pretty much every Ubuntu machine I spin up on EC2.

It's also reproducible with tasks launched via a detached screen screen -d -m someBackgroundTask.

I should mention that usually pty=False solves the problem, but I've seen instances where that wasn't the case.

Owner

bitprophet commented Oct 27, 2011

@yuvadm -- in those cases where pty=False does not solve the problem, can the problem still be recreated by using a regular ssh command (as mentioned above)? As far as I've seen it's an SSH problem and not a Fabric one, but it would be good to know if there are any situations where it does not match up.

yuvadm commented Oct 27, 2011

That's an interesting angle to check, I'll get back to you on that one...

gonvaled commented Dec 6, 2011

I have reproduced this problem. Client is Ubuntu 10.04.3 LTS, server is "Ubuntu 8.04.4 LTS (server)".
SSH client is "OpenSSH_5.3p1 Debian-3ubuntu7, OpenSSL 0.9.8k 25 Mar 2009", ssh server is "OpenSSH_4.7p1 Debian-8ubuntu1, OpenSSL 0.9.8g 19 Oct 2007". Fabric is "1.3.3 final".

The issue is there 100% with pty = True, and it disappears with pty = False.

Connecting to other servers, the issue is not always there when pty = True.

In my case, for testing, I am running a very simple command: "nohup sleep 100 > /tmp/xxx 2>&1 </dev/null &"

linus commented Jan 24, 2012

I've been bitten by this, only on EC2 as it seems (I haven't seen it on my Linode, but I'm not 100% sure). Setting pty=False seems to fix it.

@bitprophet bitprophet closed this in 611133c Apr 2, 2012

Just faced with this problem.
I had a situation where I can't use tty=False because I run the command with sudo.
Adding >& /dev/null < /dev/null &executes well but process wasn't started.

I've solved the problem with adding a sleep after the command execution line: nohup java -jar text.jar & sleep 5; exit 0

previa commented Jan 30, 2013

Thanks spodgruskiy,

Your tips works for me.
I had tried wrote fab tp start a strom cluster with following commands.

  1. run('nohup ./bin/storm nimbus >& /dev/null < /dev/null &', pty=False)
  2. run('nohup ./bin/storm nimbus >& /dev/null < /dev/null &')
  3. run("screen -d -m './bin/storm nimbus' ", pty=False)
  4. run("||screen -d -m './bin/storm nimbus' ")

But none of them works, nimbus didn't start at all. I don't understand what happened.
Anyway, thanks.

clayg commented Feb 5, 2013

+1 for the sleep trick

needed to work on systems with requiretty

sudo('start service; sleep .5') and all is well!

Where you are using 'sudo()' and the remote system has RequireTty enabled for sudo access, you can use 'set -m; service start' to prevent the SIGHUP from being sent to the process started by the init script.

See http://stackoverflow.com/a/14866774 for a more detailed explanation on bash interactive versus non-interactive and how that effects job control.

benjyz commented Jan 12, 2014

I'm curious, what's the ssh issue here?

pty=false works for me

It's not really a SSH problem, it's more the subtle behaviour around BASH non-interactive/interactive modes and signal propagation to process groups.

Following is based on http://stackoverflow.com/questions/14679178/why-does-ssh-wait-for-my-subshells-without-t-and-kill-them-with-t/14866774#14866774 and http://www.itp.uzh.ch/~dpotter/howto/daemonize, with some assumptions not fully validated, but tests about how this works seem to confirm.

pty/tty = false

The bash shell launched connects to the stdout/stderr/stdin of the started process and is kept running until there is nothing attached to the sockets and it's children have exited. A good deamon process will ensure it doesn't wait for it's children to exit, fork a child process and then exit. When in this mode no SIGHUP will be sent to the child process by SSH. I believe this will work correctly for most scripts executing a process that handles deamonizing itself and doesn't need to be backgrounded. Where init scripts use '&' to background a process then it's likely that the main problem will be whether the backgrounded process ever attempts to read from stdin since that will trigger a SIGHUP if the session has been terminated.

pty/tty = true*

If the init script backgrounds the process started, the parent BASH shell will return an exit code to the SSH connection, which will in turn look to exit immediately since it isn't waiting on a child process to terminate and isn't blocked on stdout/stderr/stdin. This will cause a SIGHUP to be sent to the parent bash shell process group, which since job control is disabled in non-interactive mode in bash, will include the child processes just launched. Where a daemon process explicitly starts a new process session when forking or in the forked process then it or it's children won't receive the SIGHUP from the BASH parent process exiting. Note this is different from suspended jobs which will see a SIGTERM.

I suspect the problems around this only working sometimes has to do with a slight race condition. If you look at the standard approach to deamonizing - http://www.itp.uzh.ch/~dpotter/howto/daemonize, you'll see that in the code the new session is created by the forked process which may not be run before the parent exits, thus resulting the random sucess/failure behaviour mentioned above. A sleep statement will allow enough time for the forked process to have created a new session, which is why it works for some cases.

pty/tty = true and job control is explicitly enabled in bash

SSH won't connect to the stdout/stderr/stdin of the bash shell or any launched child processes, which will mean it will exit as soon as the parent bash shell started finished executing the requested commands. In this case, with job control explicitly enabled, any processes launched by the bash shell with '&' to background them will be placed into a separate session immediately and will not receive the SIGHUP signal when the the parent process to the BASH session exits (SSH connection in this case).

What's needed to fix

I think the solutions just need to be explicitly mentioned in the run/sudo operations documentation as a special case when working with background processes/services. Basically either use 'pty=false', or where that is not possible, explicitly enable job control as the first command, and the behaviour will be correct.

Ichimonji10 added a commit to Ichimonji10/automation-tools that referenced this issue Jan 19, 2015

Disable PTY when working with docker init script
Docker has a non-standard approach to daemonizing: `docker -d` stays in the
foreground, rather than forking in to the background. This, combined with the
naive init script distributed with docker, can cause commands like `ssh -t
user@host service docker restart` to silently fail. For more information, see:

* fabric/fabric#395 (comment)
* fabric/fabric#395 (comment)
* moby/moby#2758

@Ichimonji10 Ichimonji10 referenced this issue in SatelliteQE/automation-tools Jan 19, 2015

Merged

Disable PTY when working with docker init script #117

As I mentioned here fabrickit ( a wrapper of fabric libs ) https://github.com/HyukjinKwon/fabrickit/commit/cceb8bfb8f960a3ac41b24c64b8358bd6e7a0366

You can absolutely easily start a program as a daemon without specific configurations or settings.
This is anyway a kind of Shell execution and therefore there should be a way to do what Shell can do.

Try this:

run("sh -c '((nohup %s > /dev/null 2> /dev/null) & )'" % cmd, pty=False)

I tried this and it works perfectly fine even it does not implement additional programming to run as a daemon (even just a program writing 'Hello' within a while loop works fine).

@baweaver baweaver referenced this issue in seuros/capistrano-sidekiq Apr 6, 2015

Closed

Not started sidekiq after deploy - if pty is true #23

mattsfuller pushed a commit to prestodb/presto-admin that referenced this issue May 22, 2015

presto-admin: Remove pty=False from running of init.d scripts
Summary:
It doesn't work to run init.d scripts with pty, both via Fabric and via native
ssh.  However, CentOS and some other OSes have requiretty in their
/etc/sudoers file, meaning that you get the error "sudo: sorry, you must
have a tty to run sudo".  The only way to fix this is to remove the requiretty
default in a user's /etc/sudoers file, but we don't want to force them to do
that.

The work-around is to run the init script in job control mode (e.g. set -m),
because it avoids the race condition that is the cause of the daemons not
starting when executing commands with TTY.  See
fabric/fabric#395 (comment) for an
in-depth treatment of the issue.

We also add || true when running tar, because tar can sometimes have a
non-zero exit code even when the files correctly un-tarred.

Task: SWARM-363
Review Url: @@review_url@@

Test Plan: make clean lint test-all; manual

Reviewers: anu, rschlussel

Reviewed By: rschlussel

Subscribers: an186016, mf186042

Differential Revision: https://phabricator.td.teradata.com/D395

cawallin added a commit to prestodb/presto-admin that referenced this issue May 27, 2015

presto-admin: Remove pty=False from running of init.d scripts
Summary:
It doesn't work to run init.d scripts with pty, both via Fabric and via native
ssh.  However, CentOS and some other OSes have requiretty in their
/etc/sudoers file, meaning that you get the error "sudo: sorry, you must
have a tty to run sudo".  The only way to fix this is to remove the requiretty
default in a user's /etc/sudoers file, but we don't want to force them to do
that.

The work-around is to run the init script in job control mode (e.g. set -m),
because it avoids the race condition that is the cause of the daemons not
starting when executing commands with TTY.  See
fabric/fabric#395 (comment) for an
in-depth treatment of the issue.

We also add || true when running tar, because tar can sometimes have a
non-zero exit code even when the files correctly un-tarred.

Task: SWARM-363
Review Url: @@review_url@@

Test Plan: make clean lint test-all; manual

Reviewers: anu, rschlussel

Reviewed By: rschlussel

Subscribers: an186016, mf186042

Differential Revision: https://phabricator.td.teradata.com/D395

cawallin added a commit to prestodb/presto-admin that referenced this issue May 27, 2015

presto-admin: Remove pty=False from running of init.d scripts
Summary:
It doesn't work to run init.d scripts with pty, both via Fabric and via native
ssh.  However, CentOS and some other OSes have requiretty in their
/etc/sudoers file, meaning that you get the error "sudo: sorry, you must
have a tty to run sudo".  The only way to fix this is to remove the requiretty
default in a user's /etc/sudoers file, but we don't want to force them to do
that.

The work-around is to run the init script in job control mode (e.g. set -m),
because it avoids the race condition that is the cause of the daemons not
starting when executing commands with TTY.  See
fabric/fabric#395 (comment) for an
in-depth treatment of the issue.

We also add || true when running tar, because tar can sometimes have a
non-zero exit code even when the files correctly un-tarred.

Task: SWARM-363
Review Url: @@review_url@@

Test Plan: make clean lint test-all; manual

Reviewers: anu, rschlussel

Reviewed By: rschlussel

Subscribers: an186016, mf186042

Differential Revision: https://phabricator.td.teradata.com/D395

@cgarciaarano cgarciaarano referenced this issue in circus-tent/circus Jun 3, 2015

Closed

Fabric with Circus issue #733

nicosmaris added a commit to nicosmaris/vm that referenced this issue Feb 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment