Fix lost jobs in multiple workers with count > 1 (issue #32) #43

salimane · 2012-02-02T08:37:31Z

Hi,
I tracked down the issue #32. Seems it's caused by not letting the main parent process continue to live and wait for all its children to exit before exiting itself. I've also registered a shutdown function to kill the child process when exiting.
Thanks

kballenegger · 2012-02-02T08:44:05Z

Just read over the code—this seems good. I was having issues with the same problem, I'll try to integrate this pull request into my setup, confirm if it solves the issue.

roynasser · 2012-02-02T12:53:07Z

Great, I will give this a try too... will make things easier to manage on multiple machines with multiple queues... I was planning on running with multi-processes...

roynasser · 2012-02-02T17:47:26Z

Hello Salimane,

Despite having possibly fixed some other bug it doesnt seem to have fixed the issue in #32 from what I see...

I just started a worker process with count=10. Sent 5 jobs to queue, it processed 3. I caught
Feb 2 15:40:15 VirtWebTeste php: PHP Warning: json_decode() expects parameter 1 to be string, array given in /xxxx/xxxx/xxx/Resque.php on line 92

in the messages log so it is still spitting out some non-json where it was apparently before (if you look at issue #32, I put some debugging just above line 92 as referenced above where it would spit out when it wouldnt get what it was supposed to...)

In the Resque control panel (I'm running the orignal control panel to keep tabs on things, despite only having php workers and queuers), it also shows only the number of "succesful" jobs in the processed column, and doesnt count the lost ones anywhere... they are really "lost"...

Also, this modification does add a requirement for php to have posix support. This isnt default on all systems and caused me a couple of minutes of puzzlement as to why some jobs were queing and failing, but i did install php-posix, and then still the problem maintains...

I did multiple tests, and couldnt find any difference with or without the patch... (jobs are queued correctly and obtain a job id, they just get lost when popped).

salimane · 2012-02-03T02:27:56Z

@RVN-BR weird for the error you got...anyway I will check on it report back

…same socket leading to improper jobs handling in multiple workers with count > 1

salimane · 2012-02-03T06:07:14Z

@kballenegger @RVN-BR I've added another commit.
Even though, the first commit fixed some other problems, seems there was another race condition problem that the commit 2030742 fixed.

Basically, after fork(), the two PHP processes (parent and child) keep using the same socket, which is shared. then there is a race condition between the two processes trying to read from the same socket. That leads to sometimes one of them getting the wrong reply from redis server (that explain some "+OK", ":1"....).

So the fix is to actually reconnect to the backend redis server in the child process, so I've added a function in Resque.php to handle that and call it in Worker.php and resque.php respectively.

Please get the commits ca909ca, 2030742 and check if it solves the issue.
Thanks

roynasser · 2012-02-03T11:50:25Z

Hey Salimane, thanks for the activity on this! I will retry the tests in a couple of hours. I'll hold off on updating issue 32 until we get some news, just so as not to confuse anyone.

I'll revert later today.

brgds.

roynasser · 2012-02-03T15:17:25Z

CONGRATULATIONS Salimane! :)

It looks like you have a winner!

I've tested this on about 1k jobs nothing lost or failed... looks good... More testing will ensue in the following weeks, but I'd say this should be considered as a final solution to the issue and imho commited in.

good work!

salimane · 2012-02-06T02:56:45Z

@RVN-BR thanks, at least now the issue is solved.

chrisboulton · 2012-03-05T08:25:42Z

Hey folks,

I've just committed back a bit of a different implementation to fix this in ebe7665. It's basically applying the same concept as @salimane has in this pull request, except I've implemented it entirely in Resque.php without having to add resetBackend() calls across the code base.

My testing on this is now passing, so I'm going to close out this pull request. If you have an issue with the implementation or it's not working, let me know.

Thanks for all of the work on this everyone!

roynasser · 2012-03-12T21:07:14Z

I seem to now be having issues with monit... Its just starting resque workers over and over... Ended up bringing my dev server down cause I didnt realize it until I had thoussands of workers running ooops :)

Anyways, it seems to not be writing PIDFILE?

I havent totally isoalted it to this particular change, but I'm pretty sure (as I cant see adding jobs breaking the PID functionality... and all I did recently was revert to the changed resque.php and add a few new job definitions...)

any ideas?

salimane · 2012-03-13T00:26:14Z

yep i've found this problem before. I've solved in my fork with this pull. Please check if these changes solve your problem. notice i've modified resque.php and extras/resque.monit

roynasser · 2012-03-13T00:47:04Z

Not working no....

Shouldnt monit be watching the pidg file in this case? (It doesnt work either way, but I just thought it would make more sense? monit always reports execution fails and keeps running s sh&tload of resque processes...)

EDIT: Salimane, turns out you have a type which was why it wasnt working hehehe my bad... since the monit stuff doesnt cry out errors I took a little while to catch it... its on line 69 of your resque file iircs... file_psut instead of file_put...

I also just noticed an Mar 12 21:50:09 php: PHP Warning: json_decode() expects parameter 1 to be string, array given in I/v1.2/Private/Jobs/lib/class.resque.php on line 117

Not sure why that is happening again... arghhhh...cant seem to catch a break really...

chrisboulton · 2012-03-13T01:00:09Z

Am I right in saying that you're using Monit with COUNT > 1? A general question I have there is how are you handling the situation where a single worker in the group might die? If I'm thinking correctly, then Monit won't detect that and you've basically lost a worker.

When we run php-resque in production, we always run multiple workers individually - that is COUNT=1, and Monit is watching each of them, primarily for the above reason.

roynasser · 2012-03-13T01:05:34Z

Hi Chris,

Yes, I'm using count > 1 (this is mainly due to my development right now and to simulate the worker pool i will have across several servers, etc)... I will probably switch to several monit "instances" with count 1 for each (implied), to prevent exactly what you are talking about...

Seems that also fixes the json_decode errors, which i suppose resolves the issue...nonetheless there is still something that fundamentally seems to upset resque when run at greater counts... :)

salimane · 2012-03-13T01:32:25Z

@RVN-BR sorry for the typo, i've pushed the typo fix.
I'm running count > 1 in production on multiple machines with no problems. (well with my fork).
I think you have to make sure you're running the latest versions with all the changes.
this error : "I also just noticed an Mar 12 21:50:09 php: PHP Warning: json_decode() expects parameter 1 to be string, array given in I/v1.2/Private/Jobs/lib/class.resque.php on line 117" means you haven't got the latest changes. Please catch your breath and start from scratch with clean code. add the pull changes. and report back again.

@chris

A general question I have there is how are you handling the situation where a single worker in the group might die? If I'm >thinking correctly, then Monit won't detect that and you've basically lost a worker.
In my pull the workers' parent process take care of monitoring its children. Monit just monitor the parent process. That's why in the pull, I added code to not let the parent die until all of its children (workers) are dead.
Thanks

salimane · 2012-03-13T01:41:42Z

@RVN-BR

Shouldnt monit be watching the pidg file in this case?
monit can't monitor the pidg file. because it's not really the pid of a process. monit can monitor the pid of the parent process but its stop command should not kill that pid because it will create zombie children if only that pid is killed. That's why I used the group process pid file in the monit stop command

roynasser · 2012-03-13T02:23:16Z

Hmmm I think you are right @salimane ... I must have done something wrong as I was running your proposed fix and all was fine... (I had modified all the files and left backups), when @chris submitted his updated fix I thought i'd revert to that (god knows why :p) and thats when things went awry again...

Any downsides to @salimane's way of doing it? I wouldnt mind having only one monit entry for multiple count even in production...

I shouldnt be entering production for another couple of weeks but I need to get something which I can work with without breaking for at least that long :s

In looking through the thread, for count>1 i think salimane's pull is the only way to go, right? (chris's current version isnt writing PID so will overflow with children when count>1)... I guess i'll revert to this for now and leave it until i have a good reason to change...?

salimane · 2012-03-13T02:50:38Z

@RVN-BR in my fork I'm also using @chrisboulton updated fix for count > 1 but with this pull. so as long as u running his lastest with my pull, it should be fine.
For me, I can't run count = 1 in prod because of resources, so i gotta run count >1 and fix whatever that should be fixed.

fix lost jobs in multiple workers with count > 1

ca909ca

fix race condition between parent and child process reading from the …

2030742

…same socket leading to improper jobs handling in multiple workers with count > 1

chrisboulton closed this Mar 5, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix lost jobs in multiple workers with count > 1 (issue #32) #43

Fix lost jobs in multiple workers with count > 1 (issue #32) #43

salimane commented Feb 2, 2012

kballenegger commented Feb 2, 2012

roynasser commented Feb 2, 2012

roynasser commented Feb 2, 2012

salimane commented Feb 3, 2012

salimane commented Feb 3, 2012

roynasser commented Feb 3, 2012

roynasser commented Feb 3, 2012

salimane commented Feb 6, 2012

chrisboulton commented Mar 5, 2012

roynasser commented Mar 12, 2012

salimane commented Mar 13, 2012

roynasser commented Mar 13, 2012

chrisboulton commented Mar 13, 2012

roynasser commented Mar 13, 2012

salimane commented Mar 13, 2012

salimane commented Mar 13, 2012

roynasser commented Mar 13, 2012

salimane commented Mar 13, 2012

Fix lost jobs in multiple workers with count > 1 (issue #32) #43

Fix lost jobs in multiple workers with count > 1 (issue #32) #43

Conversation

salimane commented Feb 2, 2012

kballenegger commented Feb 2, 2012

roynasser commented Feb 2, 2012

roynasser commented Feb 2, 2012

salimane commented Feb 3, 2012

salimane commented Feb 3, 2012

roynasser commented Feb 3, 2012

roynasser commented Feb 3, 2012

salimane commented Feb 6, 2012

chrisboulton commented Mar 5, 2012

roynasser commented Mar 12, 2012

salimane commented Mar 13, 2012

roynasser commented Mar 13, 2012

chrisboulton commented Mar 13, 2012

roynasser commented Mar 13, 2012

salimane commented Mar 13, 2012

salimane commented Mar 13, 2012

roynasser commented Mar 13, 2012

salimane commented Mar 13, 2012