Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

exception_executions not counted correctly? #215

Closed
Timmitry opened this issue Feb 12, 2021 · 5 comments
Closed

exception_executions not counted correctly? #215

Timmitry opened this issue Feb 12, 2021 · 5 comments
Projects

Comments

@Timmitry
Copy link
Contributor

Hello, and thank you for this great gem! I'm using it for the first time, but am already very satisfied, as the integration with ActiveJob feels very good 馃憣

I was just testing my application in production. I have a clockwork process which enqueues a DummyJob, which is then being executed by a separate good_job-process. The dummy job looks like this:

class DummyJob < ApplicationJob
  retry_on KeyError, wait: :exponentially_longer, attempts: 3

  def perform
    sleep(10)
    {}.fetch(:foo) if rand > 0.2
  end
end

E.g. it fails randomly with a key error, and is retried up to 3 times (for this error).

Now I was watching the good_job dashboard, and saw the three executions of the job (newest first):

37cf4067-1678-4de3-85a5-93ae53325981 	be09bce1-16d7-4614-9f08-92a97db36e18 	DummyJob 	default 	2021-02-12 13:22:56 UTC 	KeyError: key not found: :foo 	
{
  "job_id": "be09bce1-16d7-4614-9f08-92a97db36e18",
  "locale": "en",
  "priority": null,
  "timezone": "UTC",
  "arguments": [],
  "job_class": "DummyJob",
  "executions": 2,
  "queue_name": "default",
  "enqueued_at": "2021-02-12T13:22:36Z",
  "provider_job_id": "1a01e6c5-973a-443c-86e0-7c8a274d49ef",
  "exception_executions": {
    "[KeyError]": 12
  }
}

1a01e6c5-973a-443c-86e0-7c8a274d49ef 	be09bce1-16d7-4614-9f08-92a97db36e18 	DummyJob 	default 	2021-02-12 13:22:26 UTC 	KeyError: key not found: :foo 	
{
  "job_id": "be09bce1-16d7-4614-9f08-92a97db36e18",
  "locale": "en",
  "priority": null,
  "timezone": "UTC",
  "arguments": [],
  "job_class": "DummyJob",
  "executions": 1,
  "queue_name": "default",
  "enqueued_at": "2021-02-12T13:22:23Z",
  "provider_job_id": "82974df4-b046-4a9b-9a80-625eb1060144",
  "exception_executions": {
    "[KeyError]": 2
  }
}

82974df4-b046-4a9b-9a80-625eb1060144 	be09bce1-16d7-4614-9f08-92a97db36e18 	DummyJob 	default 	2021-02-12 13:22:12 UTC 	KeyError: key not found: :foo 	
{
  "job_id": "be09bce1-16d7-4614-9f08-92a97db36e18",
  "locale": "en",
  "priority": null,
  "timezone": "UTC",
  "arguments": [],
  "job_class": "DummyJob",
  "executions": 0,
  "queue_name": "default",
  "enqueued_at": "2021-02-12T13:22:12Z",
  "provider_job_id": null,
  "exception_executions": {
    "[KeyError]": 1
  }
}

As you can see, the executions go from 0 to 2, which seems correct. The exception_executions for KeyError, however, go from 1 to 2 to 12? For the next dummy job, which was also retried 3 times, I had 1, 2, and 20.

I'm not sure if I misunderstand some behaviour, or if this might be an ActiveJob bug, but I thought I'd ask here first.

Thank you!

@bensheldon bensheldon added this to Inbox in Backlog Feb 12, 2021
@bensheldon
Copy link
Owner

bensheldon commented Feb 12, 2021

@Timmitry Great question! I think you're experiencing the default behavior:

  • ActiveJob's retry_on, after the number of attempts are exceeded, will re-raise the exception up to the Adapter (GoodJob)
  • GoodJob, by default, will re-run a job when an exception is raised during job execution.

You can change this either by:

  • Use a block in your retry_on which is called instead of re-raising the exception when the attempts are exceeded retry_on(KeyError, wait: :exponentially_longer, attempts: 3) { |error| nil }
  • Change the GoodJob default to not re-run jobs when an exception is raised: GoodJob.retry_on_unhandled_error = false

Please take a look at this section in the Readme. Any ideas on how to better document/explain this behavior?

https://github.com/bensheldon/good_job#retries

@Timmitry
Copy link
Contributor Author

Hey @bensheldon, thanks for the quick answer - I'll be glad to help improve the documentation, but first, some more questions 馃檪

Let's start with the following setup for a dummy job:

# initializer
GoodJob.on_thread_error = ->(exception) do
  binding.pry
  Raven.capture_exception(exception)
end
GoodJob.preserve_job_records = true


# dummy_job.rb
class DummyJob < ApplicationJob
  around_perform do |_job, block|
    block.call
  rescue StandardError => e
    Raven.capture_exception(e)
    binding.pry # around_perform
    raise
  end

  retry_on KeyError, wait: 1.second, attempts: 3 do |_job, _exception|
    binding.pry # retry_on
    raise _exception
  end

  def perform
    binding.pry # perform
    {}.fetch(:foo)
  end
end

Now, if I enqueue one dummy job, I'll hit the binding.pry breakpoints in the following order:

perform -> around_perform (Execution 1)
perform -> around_perform (Execution 2)
perform -> around_perform (Execution 3)
retry_on (We specified attempts: 3, and the job failed 3 times, so now in ActiveJob, the job is failed)
retry_on (Why is this being called again?)

And from now on, it always goes like this, in a seemingly infinite loop (at least until exception_executions is up to 40, then I stopped trying).
perform -> around_perform -> retry_on -> retry_on

If i call (_job.)exception_executions in the various binding.pry, it is always an even value in perform and around_perform, e.g. it increases by 2 between executions, but increases by 1 in the retry_on.

The on_thread_error is only being reached when I use !!! to exit the binding.pry.

It seems like GoodJob does infinite retries when not having set GoodJob.retry_on_unhandled_error (e.g. it is true), however, on production, it seems to have stopped randomly at 12 or 20 executions.

Maybe you can shed some light on this behaviour?

Overall, my current thinking would be that retry_on_unhandled_error should always be set to false, and that people should use the retry mechanism from ActiveJob - especially as I do not see a possibility to specify a waiting time between retries in the GoodJob-Retry-handling?

@bensheldon
Copy link
Owner

@Timmitry I'll dig more into the examples you gave. I just wanted to respond to this:

Overall, my current thinking would be that retry_on_unhandled_error should always be set to false, and that people should use the retry mechanism from ActiveJob - especially as I do not see a possibility to specify a waiting time between retries in the GoodJob-Retry-handling?

I largely agree with that. But as a default, I am opinionated that GoodJob should preserve jobs on unhandled exceptions. This is true especially if someone is migrating from another job adapter and likely is not using ActiveJob for error handling at the time they may swap-in GoodJob. Safe by default, is my opinion here, even if it's a bit rough.

@bensheldon
Copy link
Owner

Thanks for documenting all of these things. There's a lot to cover, so let me know if you want me to dive deeper into something:

  • I'm not able to reproduce the even/odd execution behavior or the stopping at 12 or 20 executions. (we can hop on a Zoom and try some joint debugging). There should be infinite retries when GoodJob.retry_on_unhandled_error = true
  • Seeing exception_executions counts increase after the retry_on attempts: are exceeded is a bug. In GoodJob's current model, the job params should be immutable and that value should not be changing at all when an exception is raised to GoodJob. GoodJob will re-execute the job but shouldn't modify any of the params of that job. This is fixed in Ensure Job#serialized_params are immutable聽#218
  • I was able to reproduce that on_thread_error is not called when an exception bubbles up to GoodJob. I think that's intentional (because the exception is caught by GoodJob), but I'm thinking that maybe it should trigger on any job errors that bubble up.

@bensheldon
Copy link
Owner

@Timmitry I am going to close this issue because I believe all of the problems have either been addressed or have targetted issues.

Backlog automation moved this from Inbox to Done May 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Backlog
  
Done
Development

No branches or pull requests

2 participants