Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job results not being reported properly. #25

Open
MrJoy opened this issue Aug 21, 2019 · 3 comments
Open

Job results not being reported properly. #25

MrJoy opened this issue Aug 21, 2019 · 3 comments

Comments

@MrJoy
Copy link

MrJoy commented Aug 21, 2019

I'm seeing fairly frequent instances in which a job appears to be running for 30 minutes -- but, as far as I can tell, is not.

One of the job types where I see this has a metric that is recorded to DataDog when it successfully completes. That metric is never above 90 seconds or so. I am seeing some job failures here and there, but they are all below 30 minutes.

Instead, I suspect this is related to another problem we're having: We're seeing bursts of dial errors with workers and other clients trying to connect to Faktory unsuccessfully. What would happen if a job completed (success or failure), but the worker was unable to report that status back to the server because it couldn't get a connection to the server?

Looking at faktory_worker_go:

https://github.com/contribsys/faktory_worker_go/blob/master/runner.go#L260

				mgr.with(func(c *faktory.Client) error {
					if err != nil {
						return c.Fail(job.Jid, err, nil)
					} else {
						return c.Ack(job.Jid)
					}
				})

It appears that .with can return an error, but that this error is routinely ignored.

I don't think calling panic is appropriate for this situation, but perhaps some combination of the following might be helpful:

  1. Getting and holding a connection to both retrieve the job, and report the result (and maybe as a side-effect, make the connection available to the worker via context so it can fire jobs without having to establish its own connection).
  2. Introducing a pluggable logger mechanism so we can at least record that these failures are happening.
  3. Having some sort of retry loop, specifically around reporting the results of a job.

The first option would have some risks/challenges of its own, of course. You'd need to ensure the connection didn't time out, handle reconnecting if it did go away (either due to timeout or a server failure), etc. I'm sure you have more insight into how it would impact operations concerns in general, so forgive me if it's an Obviously Stupid Idea. That said, for situations involving a relatively high job volume (mid-hundreds- low-thousands per second), the many-transient-connections thing has proven to be a bit of a challenge (I'm paying attention to #219 / #222 for this reason, and we've been having to be careful about tuning things like FIN_WAIT and such in our server configuration).

@mperham
Copy link
Contributor

mperham commented Aug 27, 2019

Good catch. We should definitely be running golint or something similar to find code instances where we aren't handling errors.

Logging improvements are also welcome -- today FWG uses the Faktory logger infrastructure in faktory/util.

@mperham
Copy link
Contributor

mperham commented Aug 27, 2019

And thanks for your patience on all these issues -- I left for vacation right as you entered them all. Wonderful timing. 😂

@MrJoy
Copy link
Author

MrJoy commented Aug 27, 2019

No worries at all! Thank you for your patience with me, as I blunder my way through the learning curve!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants