Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deleting lock for master? #152

Closed
afrozl opened this issue Aug 11, 2016 · 10 comments
Closed

deleting lock for master? #152

afrozl opened this issue Aug 11, 2016 · 10 comments

Comments

@afrozl
Copy link

afrozl commented Aug 11, 2016

I just noticed that delayed jobs are no longer getting scheduled. Is there a way to check and see if a scheduler lock is 'stuck'? I suspect that clearing the redis db would restart the scheduler polling, but I would like to avoid that, if possible.

@evantahler
Copy link
Member

I'm going to bet that that is not the issue. If you are running more than one scheduler, yes, one will take the "master" role, and lock the others out. However, that lock only exists for 3 minutes (unless your override the default).

If you want to inspect the key in reds, it takes the form of self.connection.key('resque_scheduler_master_lock'), so that would be resque:resque_scheduler_master_lock with the default options.

@afrozl
Copy link
Author

afrozl commented Aug 11, 2016

I'm not sure how it got 'stuck' but that was indeed the case. I took a look at the redis key and it was locked by a scheduler that no longer existed. As soon as I deleted the key, all the delayed jobs kicked in.

@afrozl afrozl closed this as completed Aug 11, 2016
@davbeck
Copy link

davbeck commented Aug 4, 2017

I also ran into this. The lock had been stuck for a week.

@maxschmeling
Copy link

My lock gets stuck all the time. It's a major problem. I'll be digging into the code to see why that happens, but if anyone has any thoughts on what's happening, I would appreciate it.

@evantahler
Copy link
Member

9 times out of 10 it is improper shut down behavior. How are you running your workers, how long do you give them before the SIGKILL signal and a hard shutdown (kill -9)? How long is your average job duration?

@maxschmeling
Copy link

@evantahler I'm quite certain I have invalid shutdown behavior, but I thought the timeout on the scheduler was to prevent improper shutdown from being an issue.

Some of our jobs are a couple minutes long, but most are less than a minute (and 10% or so are very short).

I'm running on Heroku. I'll look at my shutdown handling and see how it can be improved.

@maxschmeling
Copy link

maxschmeling commented Jul 12, 2018

I ended up with something like this and it seems to be working ok for now. Just thought I'd share. Not at all saying this is the best / most correct way.

// Called from SIGINT and SIGTERM
async function gracefulShutdown(worker, scheduler, queue, librato) {
  const stopProcessTimeout = function() {
    throw new Error("process stop timeout reached. Terminating now.");
  };
  setTimeout(stopProcessTimeout, shutdownTimeout);

  worker.on("exit", process.exit);

  await Promise.all([worker.end(), scheduler.end(), queue.end()]);
}

@evantahler
Copy link
Member

That's exactly what you should do if your application is controlled by signals (which it certainly is on Heroku). You should also stop your http server and any connections you have open as well.

Actionhero does something similar https://github.com/actionhero/actionhero/blob/master/initializers/resque.js#L160-L166 (where those async stop() methods are in a signal catch: https://github.com/actionhero/actionhero/blob/master/bin/methods/start.js#L133-L135)

@evantahler
Copy link
Member

... would you mind contributing something to the README about this?

@maxschmeling
Copy link

Absolutely. I'll send a pull request tomorrow or Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants