Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'bury' on StandardError in Process! needs checking #28

Open
pagojo opened this issue Aug 27, 2013 · 8 comments
Open

'bury' on StandardError in Process! needs checking #28

pagojo opened this issue Aug 27, 2013 · 8 comments

Comments

@pagojo
Copy link

pagojo commented Aug 27, 2013

in the collection.rb https://github.com/beanstalkd/beaneater/blob/master/lib/beaneater/job/collection.rb#L98 when a StandardError is caught, before job.bury is used the job needs to be checked like in the ensure clause otherwise it may lead to try to bury a job which has already been deleted.

This may occur when

@beanstalk.jobs.register('whatever', :retry_on => [Timeout::Error]) do |job|
  process(job)
  screw_up_here_and_raise_some_exception()
end

in which case a weird Beaneater::NotFoundError: Response failed with: NOT_FOUND will percolate up from the parse_response(cmd, res) when it tries to send the bury command (which is very confusing)

@nesquena
Copy link
Member

Thanks for the bug report will see what we can do.

@pagojo
Copy link
Author

pagojo commented Aug 28, 2013

Happy to do testing for you if you wish. I revisited the bug many times yesterday, it is always repeatable but looking at your code I can't see exactly how it occurs after all. I'll try to gig out more facts today.

@pagojo
Copy link
Author

pagojo commented Sep 6, 2013

I just observed that this happens due to job.delete failing after a job processor has completed in https://github.com/beanstalkd/beaneater/blob/master/lib/beaneater/job/collection.rb#L92 in which case the rescue StandardError clause takes effect.

before that happens the exception thrown from delete has the following backtrace:

/home/user/.rvm/gems/ruby-2.0.0-p247@project/gems/beaneater-0.3.1/lib/beaneater/connection.rb:116:in `parse_response'
/home/user/.rvm/gems/ruby-2.0.0-p247@project/gems/beaneater-0.3.1/lib/beaneater/connection.rb:55:in `block in transmit'
/home/user/.rvm/gems/ruby-2.0.0-p247@project/gems/beaneater-0.3.1/lib/beaneater/connection.rb:49:in `synchronize'
/home/user/.rvm/gems/ruby-2.0.0-p247@project/gems/beaneater-0.3.1/lib/beaneater/connection.rb:49:in `transmit'
/home/user/.rvm/gems/ruby-2.0.0-p247@project/gems/beaneater-0.3.1/lib/beaneater/job/record.rb:193:in `transmit'
/home/user/.rvm/gems/ruby-2.0.0-p247@project/gems/beaneater-0.3.1/lib/beaneater/job/record.rb:76:in `delete'
/home/user/.rvm/gems/ruby-2.0.0-p247@project/gems/beaneater-0.3.1/lib/beaneater/job/collection.rb:92:in `block in process!'
/home/user/.rvm/gems/ruby-2.0.0-p247@missum-api/gems/beaneater-0.3.1/lib/beaneater/job/collection.rb:87:in `loop'
/home/user/.rvm/gems/ruby-2.0.0-p247@missum-api/gems/beaneater-0.3.1/lib/beaneater/job/collection.rb:87:in `process!'
/home/user/project/releases/20130906111440/problem/my_main_prog.rb:357:in `<top (required)>'

So to correct the previous report Beaneater::NotFoundError: Response failed with: NOT_FOUND comes first from the inability to delete the job.

i will try to find a ways to make this 100% reproducible, though i think it is deep in the protocol/

@pagojo
Copy link
Author

pagojo commented Sep 6, 2013

Problem found. This is now 100% reproducible for me.

It is down to If the worker does not delete, release, or bury the job within <ttr> seconds, the job will time out and the server will release the job.

Unfortunately this is NOT reflected in the state transition diagram of the protocol

In my case the jobs are long running so TTR may be exceed. I also have many workers listening on the same queue for such jobs. As a result job.delete will fail with Beaneater::NotFoundError: Response failed with: NOT_FOUND when a long running job which exceeds the TTR is put back into the ready state (by the server) and subsequently another worker picks it up. In that case the first worker upon completion of the first instance of the job will fail inside the Collection::Jobs::process! when it finishes the job and tries to delete it (while now the other worker has picked it up).

Having said this, the documentation for the delete protocol command states:
NOT_FOUND\r\n if the job does not exist or is not either reserved by the client, ready, or buried. This could happen if the job timed out before the client sent the delete command.

A more appropriate error which the Collection::Jobs::process! could handle sanely is needed. At the moment the process! error handling is bugged due to this. When a job exceeds its TTR it either could go to a TTR_exceeded queue or perhaps its session's handle should simply indicate this so that the client of this job is given a chance to find about it gracefully, something which Collection::Jobs::process! cannot really do at the moment.

I'm thinking now if there is a safe patch for Collection::Jobs::process!

@jenrzzz
Copy link

jenrzzz commented Sep 10, 2013

👍

I worked around this by just increasing the TTR, but it does seem like a more appropriate error is in order if possible.

@pagojo
Copy link
Author

pagojo commented Sep 10, 2013

Yup, that is my workaround as well, however a) the error handling should be fixed b) for clients that do wish to utilise the TTR the error handling is broken.

I recommend that until this is fixed we need to correct the documentation at least.

btw who is responsible for fixing this (as it is shared between the Beaneater and the Beanstalkd server)?

@nesquena
Copy link
Member

@pagojo Thanks not exactly sure what the right approach to fixing this is yet. Would you be willing to send a PR for the updated docs, I'd be happy to pull those in.

If anyone has a particular recommendation as to the best way to patch this from the beaneater side, let me know or send a PR.

@thesmart
Copy link

I just run into this myself.

What if reserved? checked with a known TTR value rather than relying on a potentially stale self.stats.state variable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants