Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideal error handling spec and proposal for improvement #172

Open
owenallenaz opened this issue May 22, 2015 · 5 comments
Open

Ideal error handling spec and proposal for improvement #172

owenallenaz opened this issue May 22, 2015 · 5 comments

Comments

@owenallenaz
Copy link

It's not clear from the docs or the conversation in github issues of how agenda is supposed to be handling errors.

Here are the current behaviors from my debugging:

  1. Job cb's an error results in the job stopping forever, other jobs keep running.
  2. Job throws an asynchronous error. Right now this results in the entirety of agenda stopping. Even if we were to catch this error in our user-space code (such as domain wrapping our call to agenda.start()), it still results in the destruction of the setInterval, which means agenda still stops.
  3. Job throws a synchronous error, this is caught, job is stopped forever, other jobs keep running.
  4. The query to get a lock on the job cb's an error which results in a thrown error. See number Support for cron like jobs #2 for the challenges there.
  5. Mongoskin cb's an error during the saving a job. This gets munged in to the same err var as if the job itself cb'd an error.

The issue we're having using agenda is that whenever our mongoDB replica set switches primaries, it often results in agenda going haywire due to the eccentricities called out above. From looking at the code I don't believe it's too difficult to get everything straightened up, and I would be willing to do so and create a pull request, but I want to make sure the spec outlines with the vision of the package.

Generally, I think agenda should try and function as much like cron as possible. In that everything should keep running on schedule forever, successes get logged, errors get logged. In this way, regardless of how terrible the user-space code that's utilizing agenda is we want agenda to keep plugging along, calling jobs, and logging results. To accomplish this we want to make it very durable. This is similar to how crontab will keep running jobs even if they return a non 0 status code.

So here's my idea:

  1. Job cb's an error. Error written to log, job will run again next time it's up.
  2. Job throws an error whether synchronous or asynchronous. Error caught by a domain within Agenda, job will run again next time it's scheduled to run.
  3. Mongoskin cb's an error. agenda emits an error event. As with most error emitters, if nothing is listening to the error event, it will result in a thrown error (the same behavior we have now). If something is listening to the error event then it's delegating to handler to reconcile. This occurs for all of the cases where mongoskin attempts to acquire locks, remove jobs, add jobs, or do it's various inner working with it's own mechanics. I feel these errors should be separated from the errors thrown by the job fn's declared by users of agenda.
  4. If we still want to support the idea of stopping a job on failure (be it sync thrown, async thrown or cb'd), then I propose we add a setting when declaring the job allowing us to opt-out of that behavior.

If we add the setting, then what's outlined above can be accomplished without any loss of backward compatibility except for versions of node before domains were introduced, which I was believe was Node 0.8.

@Albert-IV
Copy link
Collaborator

To be perfectly honest, I don't know what the intended behavior for when a job fails. Personally I'm +1 on the idea that Agenda should be calling these jobs, regardless of their last ran status.

Encapsulating the jobs being ran into their own domain is problematic however. Even in the documentation for domains says to exit the process if there is an uncaught error. Catching an error with a domain without exiting the process leads to problems down the line when running.

Overall I support these changes, with the exclusion of the domain addition. I think it's up to @rschmukler on how he wants job failures to be handled, as I've not seen anything from him that suggests how he wants to handle it, one way or the other.

@owenallenaz
Copy link
Author

Regarding the domain issue, right now the call to the processor is wrapped in a try/catch, this means that synchronous errors are caught and handled while async errors are not. I see this inconsistency as problematic. All thrown errors within a code block should be handled the same way, whether async or sync in origination. So we if don't do the domain, then we should remove the try/catch.

Another approach would be to have a setting whether users could choose how it should proceed. If you init agenda with logThrownError then any error thrown by your handler is logged and it will continue running. If logThrownError is false then the domain will rethrow the error causing it to bubble up to your application's error handling code and crashing the process if that is how your app is setup.

@Albert-IV
Copy link
Collaborator

I don't understand how this would problematic. From my point of view, async errors normally would occur in user-side code, with the exception being if the MongoDB connection somehow fails (feel free to correct me if I'm wrong there).

If someone was worried about asynchronous errors, wouldn't they already have a domain solution set up for themselves? Even if hidden behind a flag, I really don't think the library should practice things that the core docs explicitly say not to do, even with an opt-in flag.

As-is it should be fairly easy to set up domains user-side, since Agenda inherits from Node's EventEmitter (though untested). domain.add(Agenda); My home PC environment isn't cooperating with me right now.

@windmaomao
Copy link

what is "logThrownError", where do I set it ? I can not find documentation on this.

@ppsreejith
Copy link

Would adding support for maximum retries mitigate these errors to some degree?

In cases 1,2 and 4, we can retry the job n times where n is the maximum retries specified on the job and then fail them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants