User's Guide 04:02 Pro: Safer Queueing

Ed Ropple edited this page May 22, 2018 · 1 revision

This is a feature of TaskBotJS Pro.

TaskBotJS, by default, uses Redis's BRPOP instruction to fetch jobs from all configured queues. This is a simple, high-performance operation that can watch multiple queues with a single call. However, this comes with a downside: in the (unlikely, but definitely not impossible) case where TaskBotJS crashes in an unrecoverable manner, that job is irretrievably lost. Another concern is that, when TaskBotJS is shut down with a SIGTERM or a SIGINT, it will attempt to return jobs to the queue. That said, it's possible for TaskBotJS to fail to do so--for example, a network interruption may break the connection to Redis or the server running TaskBotJS shuts down before the service can do its cleanup tasks.

For most applications, this is an unfortunate but tolerable edge case; it's very unlikely that TaskBotJS will lose jobs in a healthy environment. But, sometimes, one in a million is next Tuesday, and users' tolerance of this possibility can vary. To that end, TaskBotJS Pro supports safer queueing, which uses the Redis RPOPLPUSH command to stash a dequeued job in a separate list owned by the worker doing the dequeueing. When the job is resolved (successfully or not), it is removed from this list. When a worker shuts down in an unclean manner, these jobs are then left in this list and other workers can pick up the pieces.

It's important to realize that this is not safe queueing, but safer; in the event of jobs being orphaned we need to prioritize getting the job done ASAP, so we want to put the job at the head of our list. However, Redis lacks a command to do this atomically, so we must POP and then RPUSH the job. It is thus possible, though considerably less likely than without safer queueing, to lose a job in the milliseconds where the job is out of Redis. It is also, and separately, possible for a job to have been completed by the dying worker before it could be acknowledged in Redis, and so another worker might re-do the same work. This is why it's important for jobs to be designed for idempotency.

Safer queueing also incurs some performance penalty and load on the Redis server. If you need this feature, be prepared to provision for it.

Enabling Safer Queueing

(These bits of code are extracted from the example project's configuration, which is worth reading for this and all Pro features.)

Two steps are necessary to enable safer queueing. First, we need to specify the reliable option to the intake.

config.intake = {
  type: "weighted",
  reliable: true,
  timeoutSeconds: 1,
  queues: [
    { name: "critical", weight: 5 },
    { name: "default", weight: 3 },
    { name: "low", weight: 2 }
  ]
};

Once that's done, we need to activate the orphan plugin, which will check intermittently for workers who are no longer reporting a heartbeat to the datastore because of a crash or a network partition or the like. You can check a worker's last heartbeat in the control panel or by invoking Client.getWorkerInfo().

config.orphan.enabled = true;
config.orphan.polling.interval = { seconds: 3 };
config.orphan.polling.splay = { seconds: 1 };
config.orphan.requeueAge = { seconds: 30 };

It is important to note that the example project uses very accelerated timings to make it easier to show off the features of TaskBotJS Pro. The default polling time for the orphan plugin ranges between 30 and 34 minutes, with a default requeueAge of 30 minutes. Those defaults are almost certainly fine for the vast majority of projects, but you can tune them if necessary.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.