Ideal error handling spec and proposal for improvement #172

owenallenaz · 2015-05-22T00:03:48Z

It's not clear from the docs or the conversation in github issues of how agenda is supposed to be handling errors.

Here are the current behaviors from my debugging:

Job cb's an error results in the job stopping forever, other jobs keep running.
Job throws an asynchronous error. Right now this results in the entirety of agenda stopping. Even if we were to catch this error in our user-space code (such as domain wrapping our call to agenda.start()), it still results in the destruction of the setInterval, which means agenda still stops.
Job throws a synchronous error, this is caught, job is stopped forever, other jobs keep running.
The query to get a lock on the job cb's an error which results in a thrown error. See number Support for cron like jobs #2 for the challenges there.
Mongoskin cb's an error during the saving a job. This gets munged in to the same err var as if the job itself cb'd an error.

The issue we're having using agenda is that whenever our mongoDB replica set switches primaries, it often results in agenda going haywire due to the eccentricities called out above. From looking at the code I don't believe it's too difficult to get everything straightened up, and I would be willing to do so and create a pull request, but I want to make sure the spec outlines with the vision of the package.

Generally, I think agenda should try and function as much like cron as possible. In that everything should keep running on schedule forever, successes get logged, errors get logged. In this way, regardless of how terrible the user-space code that's utilizing agenda is we want agenda to keep plugging along, calling jobs, and logging results. To accomplish this we want to make it very durable. This is similar to how crontab will keep running jobs even if they return a non 0 status code.

So here's my idea:

Job cb's an error. Error written to log, job will run again next time it's up.
Job throws an error whether synchronous or asynchronous. Error caught by a domain within Agenda, job will run again next time it's scheduled to run.
Mongoskin cb's an error. agenda emits an error event. As with most error emitters, if nothing is listening to the error event, it will result in a thrown error (the same behavior we have now). If something is listening to the error event then it's delegating to handler to reconcile. This occurs for all of the cases where mongoskin attempts to acquire locks, remove jobs, add jobs, or do it's various inner working with it's own mechanics. I feel these errors should be separated from the errors thrown by the job fn's declared by users of agenda.
If we still want to support the idea of stopping a job on failure (be it sync thrown, async thrown or cb'd), then I propose we add a setting when declaring the job allowing us to opt-out of that behavior.

If we add the setting, then what's outlined above can be accomplished without any loss of backward compatibility except for versions of node before domains were introduced, which I was believe was Node 0.8.

The text was updated successfully, but these errors were encountered:

Albert-IV · 2015-05-22T19:12:10Z

To be perfectly honest, I don't know what the intended behavior for when a job fails. Personally I'm +1 on the idea that Agenda should be calling these jobs, regardless of their last ran status.

Encapsulating the jobs being ran into their own domain is problematic however. Even in the documentation for domains says to exit the process if there is an uncaught error. Catching an error with a domain without exiting the process leads to problems down the line when running.

Overall I support these changes, with the exclusion of the domain addition. I think it's up to @rschmukler on how he wants job failures to be handled, as I've not seen anything from him that suggests how he wants to handle it, one way or the other.

owenallenaz · 2015-05-24T21:28:44Z

Regarding the domain issue, right now the call to the processor is wrapped in a try/catch, this means that synchronous errors are caught and handled while async errors are not. I see this inconsistency as problematic. All thrown errors within a code block should be handled the same way, whether async or sync in origination. So we if don't do the domain, then we should remove the try/catch.

Another approach would be to have a setting whether users could choose how it should proceed. If you init agenda with logThrownError then any error thrown by your handler is logged and it will continue running. If logThrownError is false then the domain will rethrow the error causing it to bubble up to your application's error handling code and crashing the process if that is how your app is setup.

Albert-IV · 2015-05-25T05:19:02Z

I don't understand how this would problematic. From my point of view, async errors normally would occur in user-side code, with the exception being if the MongoDB connection somehow fails (feel free to correct me if I'm wrong there).

If someone was worried about asynchronous errors, wouldn't they already have a domain solution set up for themselves? Even if hidden behind a flag, I really don't think the library should practice things that the core docs explicitly say not to do, even with an opt-in flag.

As-is it should be fairly easy to set up domains user-side, since Agenda inherits from Node's EventEmitter (though untested). domain.add(Agenda); My home PC environment isn't cooperating with me right now.

windmaomao · 2016-06-09T15:50:08Z

what is "logThrownError", where do I set it ? I can not find documentation on this.

ppsreejith · 2018-02-21T09:19:01Z

Would adding support for maximum retries mitigate these errors to some degree?

In cases 1,2 and 4, we can retry the job n times where n is the maximum retries specified on the job and then fail them.

owenallenaz mentioned this issue May 22, 2015

How to restart job after failing? #123

Closed

OmgImAlexis added the discussion label Jul 10, 2017

naz mentioned this issue Dec 3, 2020

[feat] custom error handler breejs/bree#61

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideal error handling spec and proposal for improvement #172

Ideal error handling spec and proposal for improvement #172

owenallenaz commented May 22, 2015

Albert-IV commented May 22, 2015

owenallenaz commented May 24, 2015

Albert-IV commented May 25, 2015

windmaomao commented Jun 9, 2016

ppsreejith commented Feb 21, 2018

Ideal error handling spec and proposal for improvement #172

Ideal error handling spec and proposal for improvement #172

Comments

owenallenaz commented May 22, 2015

Albert-IV commented May 22, 2015

owenallenaz commented May 24, 2015

Albert-IV commented May 25, 2015

windmaomao commented Jun 9, 2016

ppsreejith commented Feb 21, 2018