Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-hop / dependent jobs? #25

Closed
davisford opened this issue Dec 8, 2015 · 10 comments
Closed

multi-hop / dependent jobs? #25

davisford opened this issue Dec 8, 2015 · 10 comments

Comments

@davisford
Copy link

Hey, do you see any potential issue with jobs that have multiple hops / dependencies. For example:

producer1 --> consumer1 which then creates producer2 --> consumer2

The producer1 job isn't successful until the whole chain returns successful.

I also want to make use of retries and timeouts for the upstream job, but not the downstream. I'm looking to get rid of kue...I have had nothing but problems with it, and I'm looking at giving yours a try. I only have one particular job that has a multi-hop like that, but I thought I'd inquire before I spent the time to migrate the code and try it out.

@LewisJEllis
Copy link
Member

It feels a bit strange to me, but it should be possible. I've never had/tried a use case like this myself, and it's not a use case I've tried to design for, but the necessary pieces/functionalities are there to pull it off. I've focused mostly on small jobs that wouldn't be large enough to worry about breaking up into pieces. I'm curious about how much processing time one of these producer1 jobs would take in total?

To clarify the sequence of events:

producer1 produces job1
consumer1 begins job1 and produces job2 (taking the role of producer2)
consumer2 begins and finishes job2
consumer1, in its role as producer2, hears that job2 finished and then, in its role as consumer1, finishes job1

This could even work if consumer1 and consumer2 are the same, as long as it has a concurrency > 1.

If you end up doing this and find that some small changes could make the use case easier/cleaner, please let me know.

@davisford
Copy link
Author

Yes, that's the gist of it. There's some design reason why it is the way it is -- mostly related to: some jobs are useful standalone and also as steps of other jobs depending on the context, so they are split up...and some producers should only ever be run from a single node, but we may have any number of consumers.

I'll let you know if I run into any particular issue. Thanks! Feel free to close.

@davisford
Copy link
Author

This is working without issue, FWIW.

@LewisJEllis
Copy link
Member

Great to hear! I was kind of concerned about the failure model of such a scheme, but after giving it some thought I think it should be reasonably robust. With bee-queue generally following at-least-once delivery, retries and stalled job resets should ensure that everything gets done in these "chained" jobs, just with a bit more risk of some things happening more than once.

@davisford
Copy link
Author

Yea, I'm not quite out of the woods yet...as it relates to failures, timeouts and retries...I think I still have some tweaking to do.

NOTE: apologies in advance for this novel-size comment. I'm trying to write it out clearly mostly for myself, to try to reason it out, and hopefully you may have a recommendation or two after understanding the problem frame better.

Here's what these jobs do. We have a reporting site for restaurant POS data. It's heavy on charts, and tables. To print, we use phantomjs. From the webpage, the user can click the print button which will trigger a websocket message to the server which puts a job in the Redis queue that a server running phantom picks up, it creates a pdf of the page, uploads to s3, and then the user can open the pdf, or download it or email it.

So, that's one producer/consumer job. There's a createpdf queue for that.

But some people don't want to log into the website every day and run the report. Some reports they just want mailed to them, so we have the concept of subscriptions where the user can setup to have pdf reports emailed to them on a given day or schedule. That's the second queue, called sub. A producer node periodically pulls subscription definitions from the database, looks at them and determines if it is time to send one; if so, it puts a job on the sub queue. The webapp server consumes that, and puts a job on the createpdf queue...when that returns, it sends the email, and calls the final done() callback.

Here's how I've defined the queues / producers / consumers:

deliverSubQueue = new Queue(name, {
  redis: config.redis,
  isWorker: false, // not a worker, only a producer of jobs
  removeOnSuccess: true
});

// jobs are created like this; phantomjs sometimes fails, so I do want retries
// also, we need at least 60s, some reports can take a while for various reasons
var job = deliverSubQueue.createJob({sub: sub});
job.timeout(1000 * 60).retries(4).save(function (err, job) {
  if (err) {
    return error('Error creating subscription job in Redis', err);
  }
  log('job.%s[%d] created for subscription %s', job.queue.name, job.id, sub._id);
});

There's a separate node that runs the consumer code for this job. The consumer code generates a second job to create a pdf (it re-uses the same queue/job def from above where the user can print direct from the page). I want zero retries on this, b/c if it fails, I preferred the original (parent) job would re-try and just consider the child job a failure. The timeout needs to be somewhat lesser than the subscription job timeout.

deliverSubQueue = new Queue(name, {
  redis: config.redis,
  isWorker: true,
  removeOnSuccess: true
});

// worker code
deliverSubQueue.process(function (job, done) {
  // the processor for this job creates the second job:
  var pdfjob = createPdfQueue.createJob({ data: data });
  pdfjob.timeout(1000 * 50).retries(0).save(function (err, subjob) {
    if (err) { done(err); }
  });
  pdfjob.on('succeeded', function (result) {
     // etc.
  });
});

The createPdfQueue is defined as below. It is not a worker, since it just fires off requests to have a pdf generated.

createPdfQueue = new Queue(name, {
  redis: config.redis,
  isWorker: false,
  removeOnSuccess: true
});

I'll spare the queue / worker setup for generating the pdfs, as I don't think it is so relevant. So, here's the problem I'm facing. I'm trying to test and instrument various failure scenarios. Under normal conditions, this setup works just fine. I have a concurrency setting of 2, and I've chugged through 50 or so jobs repeatedly without failure. It's the retries and timeouts in the face of failure that I need to resolve.

So, if I bootstrap all this, and then kill the createpdf worker (so it is not running), this is what I see happening:

  1. A new job is put on bq:sub:job with id 1
  2. Job is consumed by worker, which creates a new job bq:createpdf:job with id 2
  3. Since that worker is down, job is waiting
  4. Timeout value passes, and it looks like the original consumer submits original job bq:sub:job again (same one with id 1) -- the job is re-submitted, but no new job is created.
  5. Consumer picks up job, and now generates a new job bq:createpdf:job with id 3

This will continue, and I end up generating only a single sub job, but multiple duplicate createpdf jobs. If I start the createpdf worker, it picks them all up and executes them, and I actually risk sending duplicate emails.

I'm trying to figure out a configuration that will retry the sub job in the face of failure, but not continue to pile up similar createpdf jobs.

@davisford
Copy link
Author

Here's an example of the state of the keys:

keys *
 1) "bq:createpdf:waiting"
 2) "bq:sub:stallTime"
 3) "bq:sub:active"
 4) "bq:sub:jobs"
 5) "bq:sub:stalling"
 6) "bq:sub:id"
 7) "bq:createpdf:jobs"
 8) "bq:createpdf:stallTime"
 9) "bq:createpdf:id"

// this has three child jobs waiting, i really only want one. 
127.0.0.1:6379> lrange bq:createpdf:waiting 0 -1
1) "3"
2) "2"
3) "1"

// only a single parent job
127.0.0.1:6379> lrange bq:sub:active 0 -1
1) "1"

@davisford
Copy link
Author

So, it works as expected, if I propagate the subjob failure back to the parent, i.e.:

pdfjob.on('failed', function (err) {
   done(err);  // calls parent job's done with error
});

This causes the parent to kick-in and retry. The only problem is when the pdfjob consumer is completely down. Then, jobs just get queued up and repeated.

@LewisJEllis
Copy link
Member

Good news: this all sounds like expected behavior to me, and it also sounds like you have a strong handle on everything that's going on as far as the library and its behavior is concerned.

Bad news: this risk of sending multiple emails is somewhat inherent in the at-least-once-delivery model.

Good news: I think it's possible to move some parts around and end up with something with a bit less risk of sending multiple emails. It should at the very least be possible to avoid queueing up email after email if the pdf consumer is down. I'm not immediately sure what the best structure is exactly, though; I'll have to give it some thought.

@LewisJEllis
Copy link
Member

Oh, missed your most recent comment; yes, that seems like it would work out better. I think it might be possible to make it work without propagating all the way back to the parent, but I suppose that's not such a bad thing to be doing.

@davisford
Copy link
Author

Yea, if you have some ideas on what to do in the scenario where the pdf consumer is offline (that has happened before, although there are other ways to mitigate against that).

I'm not certain I have a clear solution. By the very nature of pub/sub, the publisher isn't supposed to care who the subscriber is or if they're available. The broker just queues up messages in a mailbox for them. If there's a way I can query the queue to figure out if a pdfjob has already been submitted and is pending, that would do the trick, though?

Alternative solution might be to move the parent sub job to .retries(0) and the createpdf subjob handles the retries, e.g. .retries(3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants