New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting the server may stop the recurring jobs forever #131

Closed
abdelhady opened this Issue Jan 1, 2015 · 24 comments

Comments

Projects
None yet
@abdelhady
Copy link

abdelhady commented Jan 1, 2015

I'm starting my job every 30 minutes, but sometimes when restarting the server, the job stops working forever until I remove it from the agendaJobs collection manually, & I tried smaller periods for testing like 10 seconds, & it is now a consistent behavior, my code is like this:

agenda.start()
graceful = ()->
    agenda.stop ()->
        process.exit 0

process.on 'SIGTERM', graceful
process.on 'SIGINT' , graceful

agenda.define 'my job', my_job_function
agenda.every '10 seconds', 'my job'

And as a workaround, I decided to cancel all recurring jobs when stopping the server like this:

graceful = ()->
    agenda.cancel repeatInterval: { $exists: true, $ne: null }, (err, numRemoved)->
        agenda.stop ()->
            process.exit 0

With this workaround, now restarting the server doesn't affect any of the recurring jobs,

but, I don't know why I have to cancel all the recurring jobs in order to be safe from such a situation?!

@BlakePetersen

This comment has been minimized.

Copy link

BlakePetersen commented Feb 1, 2015

Cool workaround!

Curious, have you tried decoupling the agenda jobs from the server (using workers approach) and seeing if that resolves the problem with having to cancel everything on shutdown for things to work on restart?

@abdelhady

This comment has been minimized.

Copy link

abdelhady commented Feb 1, 2015

Simply, using jobWorkers didn't help.

I started using agenda with the code snippet I wrote here, then I used the decoupling approach "jobWorkers" proposed by agenda documentation, and that didn't help solving my issue either,
so, I was about to use other alternative, but the agenda-ui feature made me try more until I figured out the workaround here :) .

@BlakePetersen

This comment has been minimized.

Copy link

BlakePetersen commented Feb 1, 2015

Very interesting, thanks for the insights! I am going to add this in this evening and will report back if I uncover anything worth noting. Thanks again @abdelhady

@abdelhady

This comment has been minimized.

Copy link

abdelhady commented Feb 1, 2015

You are welcome, & I hope it works well with you.

@rschmukler

This comment has been minimized.

Copy link
Collaborator

rschmukler commented Feb 3, 2015

@abdelhady thank you for reporting this. Can you show the state of your job collection (ie. db.agendaJobs.find().toArray()) to see if they are stuck in a lock state?

@abdelhady

This comment has been minimized.

Copy link

abdelhady commented Feb 3, 2015

I have many non-recurring jobs in the agendaJobs, so I'll fetch only the recurring ones here like this:

db.agendaJobs.find({repeatInterval: { $exists: true, $ne: null }}).pretty()

To reproduce the issue, I've commented out the workaround mentioned earlier and downed the interval from 30 min to only 4 seconds to save my time :) (but don't worry, the job makes an almost trivial query on local db which should take only milliseconds), & here is the recurring job after reproducing the issue:

{
    "_id" : ObjectId("54d0c194e569d8a528e8560d"),
    "data" : null,
    "lastFinishedAt" : ISODate("2015-02-03T12:41:55.785Z"),
    "lastModifiedBy" : null,
    "lastRunAt" : ISODate("2015-02-03T12:41:55.781Z"),
    "lockedAt" : ISODate("2015-02-03T12:41:56.132Z"),
    "name" : "Example Job",
    "nextRunAt" : ISODate("2015-02-03T12:41:59.781Z"),
    "priority" : 0,
    "repeatInterval" : "4 seconds",
    "type" : "single"
}

I think you are right, it is stuck in a lock state.

@kosmikko

This comment has been minimized.

Copy link

kosmikko commented Feb 4, 2015

I'm having the same issue. I'm also using worker process to process the jobs, but also that process needs to be restarted when code is updated. Is the above workaround the only way to solve the issue?

@droppedoncaprica

This comment has been minimized.

Copy link
Collaborator

droppedoncaprica commented Feb 4, 2015

Unfortunately yes, for the most part. You can also unlock all locked jobs on startup, which seems to fix the issue.

Temporary Workaround:

var Agenda = require('agenda');
var agenda = new Agenda({db: { address: 'mongodb://localhost:27017/my-test-db'}});

agenda._db.update({ lockedAt : { $exists : true } }, { $set : { lockedAt : null } }, function(e, numUnlocked) {
  if(e) {
    throw e;
  }

  console.log("Unlocked " + numUnlocked + " jobs.");

  agenda.define('say hi', function(job, done) {
    console.log('Hi!!!!');
    done();
  });
  agenda.every('1 second', 'say hi');
  agenda.start();
});

Unfortunately I'm not going to have time to look at this until next week sometime.

@BlakePetersen

This comment has been minimized.

Copy link

BlakePetersen commented Feb 4, 2015

Haven't found time to investigate all that's going on, but for what it's worth, using the worker approach with forever -w to restart server on changes appears to be working perfectly for me.

@abdelhady

This comment has been minimized.

Copy link

abdelhady commented Feb 6, 2015

@MikkoLehtinen I've summed up the 2 possible workarounds till now in this blog post

@droppedoncaprica

This comment has been minimized.

Copy link
Collaborator

droppedoncaprica commented Feb 11, 2015

@abdelhady For these jobs that are permanently locked, you're not setting an extremely long lock lifetime when initializing Agenda or the job definitions right?

I started by trying to figure out how / where these lock times were set and how long they were good for until they get invalidated.

The main piece that handles how long they should be locking for is the lockLifetime. These values are set on an Agenda level, not actually stored in the data (here when being initialized, here when setting up a job definition, as well as here to set it manually later).

I figured that it was a situation where these lock lifetimes weren't being respected, so I dove a bit deeper. I started looking at the main "loop" that gets fired when processing jobs. This method basically runs when agenda.start() starts and then is set to run on an interval based on the processEvery property.

This method gets called in one of two ways. The first way this happens is when you save a job and the nextRunAt gets calculated to be before the current time. This will pass the job in so it gets put on top of the job queue.

The other way is the normal setInterval method. This won't move the job to the top of the queue and begin processing, instead what it does is loop through the available job definitions and runs Agenda._findAndLockNextJob on each definition.

Sorry for the braindump here, I've got a bit more code to read in this area. Just wanted to make sure that we're not running with extremely long lock times on your end for the time being.

...and right before I posted this, it does look like this gets stuck with this code (aka multiple .every() jobs being set). This supports something I noticed earlier with _findAndLockNextJob possibly returning multiple jobs (which processJobs doesn't necessarily support at the moment). I'm not 100% sure on that though, but after playing with it it seems like it might have fixed it.

Actual breaking example:

var Agenda = require('./lib/agenda.js');
var agenda = new Agenda({db: { address: 'mongodb://localhost:27017/my-test-db'}});

agenda.define('say hi', function(job, done) {
  console.log('Hi!!!!');
  setTimeout(done, 5000)
});
agenda.every('10 seconds', 'say hi');
agenda.every('10 seconds', 'say hi');
agenda.every('10 seconds', 'say hi');
agenda.every('10 seconds', 'say hi');

agenda.start();

droppedoncaprica added a commit to droppedoncaprica/agenda that referenced this issue Feb 11, 2015

Fix when _findAndLockNextJob returns multiple jobs.
After working out how Agenda pulls in jobs for processing, I noticed
that the _findAndLockNextJob method could return multiple jobs, which
ends up breaking jobs with the same definition.  The query grabs and
updates the list of jobs whose lock time is passed or should be ran,
meaning if you have multiple of the same jobs running and shut
down the server, it will end up never running those jobs again.

I do need to pick @rschmukler's brain about how to properly get tests
for these methods, as it seems that the methods in question are
private and unreachable using Mocha.

Fixes agenda#131

droppedoncaprica added a commit to droppedoncaprica/agenda that referenced this issue Feb 11, 2015

Fix for when _findAndLockNextJob returns multiple jobs.
After working out how Agenda pulls in jobs for processing, I noticed
that the _findAndLockNextJob method could return multiple jobs, which
ends up breaking jobs with the same definition.  The query grabs and
updates the list of jobs whose lock time is passed or should be ran,
meaning if you have multiple of the same jobs running and shut
down the server, it will end up never running those jobs again.

I do need to pick @rschmukler's brain about how to properly get tests
for these methods, as it seems that the methods in question are
private and unreachable using Mocha.

Fixes agenda#131
@droppedoncaprica

This comment has been minimized.

Copy link
Collaborator

droppedoncaprica commented Feb 12, 2015

@abdelhady @BlakePetersen Hey, could you guys possibly check out the newest version of Agenda on the master branch and see if it fixed these issues for you?

I'm adding additional tests for this specific case tonight / tomorrow, but master should fix the issues you've been having.

@abdelhady

This comment has been minimized.

Copy link

abdelhady commented Feb 15, 2015

I've updated my packages to use the latest version Release 0.6.28 but the issue still exists,

the case of "multiple .every() jobs being set" is not my case here, because I have only 1 recurring job, & even when I was having only this job (without the single jobs) the issue was still there too.

I think we should re-open it again or something.

@droppedoncaprica

This comment has been minimized.

Copy link
Collaborator

droppedoncaprica commented Feb 15, 2015

Whelp, fixing bugs while not fixing bugs!

agenda.start()
graceful = ()->
    agenda.stop ()->
        process.exit 0

process.on 'SIGTERM', graceful
process.on 'SIGINT' , graceful

agenda.define 'my job', (job, done) ->
    console.log 'Job running!  Will finish in 5 seconds.'
    setTimeout(done, 5000)

agenda.every '10 seconds', 'my job'

Would this be a fair use-case for you that triggers the bug? When I tried something like this it didn't get stuck for me (pre-patch). I tried to restart the server while the job was mid-run, as I'd assumed it was a lock time issue. Do you know if you're restarting the server while a job is in a finished state?

@JulieWang2014

This comment has been minimized.

Copy link

JulieWang2014 commented Jun 17, 2015

@rschmukler I found the same problem. when server stop. it calls the unlockJobs. sometimes the lockedAt is set to null, sometime not. And I found the root cause is in jobProcessing(). it pop up the job in the jobQueue, but it does not add it to the _runningJobs immediately as the nextRunAt > now. if server stop at this time. it will not unlock this job.(it is not in jobQueue or _runningJobs but it is locked).

var job = jobQueue.pop(),
if (job.attrs.nextRunAt < now) {
      runOrRetry();
    } else {
      setTimeout(runOrRetry, job.attrs.nextRunAt - now);
    }
@elssar

This comment has been minimized.

Copy link

elssar commented Jul 3, 2015

I have an odd problem. None of my recurring tasks are locked, but still they haven't been executed after the server restarted. lockedAt is false for all of them.

@bonesoul

This comment has been minimized.

Copy link

bonesoul commented Jan 22, 2016

is this already fixed? i'm also having the issue with 0.7 series.

@shaunymca

This comment has been minimized.

Copy link

shaunymca commented Feb 2, 2016

I'm also having issues with the latest build.

@mirkods

This comment has been minimized.

Copy link

mirkods commented Mar 9, 2016

👍

@pcorey

This comment has been minimized.

Copy link

pcorey commented May 24, 2016

I'm currently having this issue as well.

Re: @droppedoncaprica's fix. It looks like _db has been renamed to _collection. Also, this should probably be a multi Mongo update, in case you have multiple recurring jobs that need to be unlocked. Lastly, _collection isn't available until the "ready" event fires:

function removeStaleJobs(callback) {
    agenda._collection.update({
        lockedAt: {
            $exists: true
        }
    }, {
        $set: {
            lockedAt: null
        }
    }, {
        multi: true
    }, callback);
}

agenda.on('ready', function() {
  removeStaleJobs((e, r) => {
      if (e) {
          console.error("Unable to remove stale jobs. Starting anyways.");
      }
      agenda.start();
  });
});
@ashutosh-akss

This comment has been minimized.

Copy link

ashutosh-akss commented Aug 29, 2016

@pcorey Thanks , I tried your method but it still does not work for me .

I am on node 4.5.0

function removeStaleJobs(callback) {
    agenda._collection.update({
        lockedAt: {
            $exists: true
        }
    }, {
        $set: {
            lockedAt: null
        }
    }, {
        multi: true
    }, callback);
}

agenda.on('ready', function() {

    console.log("AGENDA Service starting");
     removeStaleJobs((e, r) => {
        console.log("REMOVED STALE JOBSSSS");

          if (e) {
              console.error("Unable to remove stale jobs. Starting anyways.");
          }
          agenda.start();
      });

});
@kfiroo

This comment has been minimized.

Copy link

kfiroo commented Nov 18, 2016

Issue seems to be resolved on latest release

@ghost

This comment has been minimized.

Copy link

ghost commented Feb 15, 2017

I have the same problem. agenda@0.9.0, node@7.5.0

My workaround of this problem is deleting of agenda collection at start time. It's not important for me the persistence, but I'm afraid to rely on graceful exit.

@simison

This comment has been minimized.

Copy link
Member

simison commented Jun 22, 2017

I'm trying to triage issues so I'll close this and follow ups here please: #410

This might still be an issue. If anyone still experiences this (or has solutions!), please just even give +1 at #410 so that we'll know better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment