Improved Worker Reliability #83

Open
wants to merge 4 commits into
from

Conversation

Projects
None yet
Contributor

danhunsaker commented Jan 24, 2013

Specifically, this improves the multi-worker forking process by having the
parent process stick around and monitor the children. If a child process
dies, and the supervisor has not been instructed to shut them all down, it
will spawn a new child to replace it. It will pass signals down to all
children. Currently, it only passes the signals explicitly watched by the
Worker class. Children will be told that they have a parent process, and
check periodically whether the supervisor is still running. This check is
performed at the beginning of reserve() to ensure orphaned workers handle
their condition properly. That is, if the supervisor is gone, the child will
exit. This is intended to allow users to kill the supervisor and
realistically expect that no new jobs will be processed until a new
supervisor (or single-worker instance) is started.

The other change, not related to worker-supervisor relationships,
nevertheless still involves forking. Specifically, when a job process exits,
the exit code is already being handled if it is greater than zero. However,
if it is zero, the code prior to this commit assumed the exit was performed
by PHP Resque itself. Usually, this is correct. The trouble comes when jobs
exit without specifying an exit status - PHP defaults these to 0, for
whatever reason. In these cases, the job status is not properly updated.
Generally speaking, if you're writing jobs yourself - in their entirety -
they will responsibly not use exit() at any point. But if your jobs use a
framework, the chances of an unexpected exit() causing issues go up immensely.
So, if the exit code is zero, and the job status hasn't been updated from
queued or running, this commit's code causes the status to be updated to
completed. This assumes that such jobs are properly written to only use exit
status 0 for success, so some users might wish to change it to failed, but
that's not best practice in the slightest, so I wouldn't encourage that.

Signed-off-by: Daniel Hunsaker danhunsaker@gmail.com

Any word on the possibility of this getting merged? Seems like a very good patch and could help solve the painful issue of restarting many workers when you deploy...

ShonM commented Feb 25, 2013

The first part seems pretty close to https://github.com/ebernhardson/php-resque-pool

Rebooting all workers would be as simple as kill -HUPps -A|awk '/[r]esque-pool/{print $1}'``

Contributor

danhunsaker commented Feb 27, 2013

@ShonM: It is certainly similar, but there are some differences, some of which are significant. For one, php-resque-pool acts as a wrapper process around the PHP Resque library, while this pull request integrates this kind of functionality into the PHP Resque library itself. This is significant because PHP Resque already did most of this anyway - the changes I've introduced here simply complete this functionality.

This in no way affects the usefulness of php-resque-pool, however, because it still does a couple of things that PHP Resque still doesn't natively handle. Specifically, the ability to use a YAML document to configure a worker pool, and the ability to specify multiple queue groups in the same pool (for example, 1 worker on one queue, another 1 on a second queue, and 4 on every queue - this is three separate queue groups in a single worker pool). Each pool created using only the native support can only operate on one queue group at a time, and the configuration can only be set via the environment.

I could also comment on how variable the definition of "as simple as" tends to be, but that would just be trolling, so I'll refrain from further comment on that. :)

Ultimately, though, the decision of whether to include this patch will probably come down to whether PHP Resque is intended to operate identically to its Ruby counterpart, or if it is seen as independent of that project and should stand on its own. (And probably also on whether or not the Ruby version handles multiple workers natively...) That's a factor I can't speak to - only @chrisboulton can make that call.

jbrower commented Mar 4, 2013

I'd like to see this get merged in. While php-resque-pool works, it's a distinct project that may not be as well supported as this one.

Contributor

danhunsaker commented Mar 7, 2013

Had a bit of a scare earlier today when updating our internal version of PHP-Resque to the latest while still preserving the modifications here. (We also have reasons to forbid the use of Composer, so I had to add the require_once() logic manually, but that part wasn't too difficult...) It seems one of our working branches got corrupted with a prior attempt at what eventually became this branch, and it was still firing and responding to SIGALRM.

This is a problem because PHP doesn't resume execution at the point where the alarm signal was received. At least, it doesn't do so properly. So the child forks created to execute jobs in a safe space ended up never terminating - or, indeed, completing their assigned tasks - and very quickly things were so broken they were useless. Took a while to track down the reversion, but once removed, things resumed their former solid operation.

Just to be safe, I checked that the same reversion hadn't made its way into this PR. Luckily, it hadn't, but I had forgotten to remove an output coloration artifact. I have removed that now, and updated the PR to merge in the latest content of master. Hopefully, this will make approving the merge possible without any conflicts, as it shouldn't be more than a fast-forward merge. 😄

Contributor

danhunsaker commented May 12, 2013

Updated again with latest changes upstream. Had to resolve a merge conflict involving the PREFIX env var addition, so I figured I'd push the merge commit to keep the pull simple, should this PR get merged.

Contributor

danhunsaker commented Aug 14, 2013

Following up again so this doesn't get lost completely in the shuffle of issues and PRs. This code has been in use, in production, for months now - at least as long as the PR has existed, and IIRC, somewhat longer - with no negative effects. So, yeah, just curious what the verdict on this kind of functionality will be.

Dynom commented Aug 30, 2013

I'd like to see this in too, what do you think @chrisboulton ?

👍

Aeon commented Oct 15, 2013

👍 please consider merging this...

A side-benefit of getting this merged would be the possibility of being able to run this via Foreman.

(Foreman, like Monit, thinks the main process has gone away and bails immediately on a foreman start.)

mjsilva commented Apr 7, 2014

Hey guys I just stumble into this problem when I tried to supervise php resque consumers with threads (COUNT=>1).

Is there any plans to merge this? Or any interim solution?

Thanks!

parkan commented Jan 3, 2015

+1 on this, trying to use php-resque on heroku and experiencing the same problem

@parkan parkan referenced this pull request in parkan/php-resque Jan 3, 2015

Merged

Improved forking #1

parkan commented Jan 3, 2015

Hmm I am seeing

PHP Parse error: syntax error, unexpected '{' in [..]/php-resque/bin/resque on line 127

bin/ doesn't seem to be tested, so the ci/unit tests pass even though this fails

problem is at HERE--> file_put_contents($PIDFILE, getmypid()) or HERE-->{...

I don't think "or {..." is valid syntax in any recent php version.

Contributor

danhunsaker commented Jan 4, 2015

Hrm. Not sure what I was thinking, there... Must have recently been working with JavaScript, I suppose.

As to bin/resque not being tested, it was always intended to be an example script upon which to build your own, so testing it does feel a little ... superfluous. Of course, it's almost never been treated as an example by users of the project, so we submit patches to it that make it more robust for most uses, and reduce its example status ever further with each one. So it might be a good idea to have tests for it as well, though it would likely need to be reworked some to be made more (or even actually) testable.

Any update on this? Would really love to see this merged!

dkcwd commented Jul 23, 2015

It would really be great to see this merged.
👍

sleenen commented Sep 19, 2015

@chrisboulton I would love this merged too. Having the ability to correctly monitor multiple spawned workers is just what I need. Can you please have a look?

+1 this would be awesome to merge.

danhunsaker and others added some commits Jan 24, 2013

Improved Worker Reliability
Specifically, this improves the multi-worker forking process by having the
parent process stick around and monitor the children.  If a child process
dies, and the supervisor has not been instructed to shut them all down, it
will spawn a new child to replace it.  It will pass signals down to all
children.  Currently, it only passes the signals explicitly watched by the
Worker class.  Children will be told that they have a parent process, and
check periodically whether the supervisor is still running.  This check is
performed at the beginning of reserve() to ensure orphaned workers handle
their condition properly.  That is, if the supervisor is gone, the child will
exit.  This is intended to allow users to kill the supervisor and
realistically expect that no new jobs will be processed until a new
supervisor (or single-worker instance) is started.

The other change, not related to worker-supervisor relationships,
nevertheless still involves forking.  Specifically, when a job process exits,
the exit code is already being handled if it is greater than zero.  However,
if it is zero, the code prior to this commit assumed the exit was performed
by PHP Resque itself.  Usually, this is correct.  The trouble comes when jobs
exit without specifying an exit status - PHP defaults these to 0, for
whatever reason.  In these cases, the job status is not properly updated.
Generally speaking, if you're writing jobs yourself - in their entirety -
they will responsibly *not* use exit() at any point.  But if your jobs use a
framework, the chances of an unexpected exit() causing issues go up immensely.
So, if the exit code is zero, and the job status hasn't been updated from
queued or running, this commit's code causes the status to be updated to
completed.  This assumes that such jobs are properly written to only use exit
status 0 for success, so some users might wish to change it to failed, but
that's not best practice in the slightest, so I wouldn't encourage that.

Signed-off-by: Daniel Hunsaker <danhunsaker@gmail.com>
Removed output coloration artifact from custom use case implementation.
Signed-off-by: Daniel Hunsaker <danhunsaker@gmail.com>
Cleaner PIDFILE success check
- Use `if` instead of logic operators
- Check for `false` explicitly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment