Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make jobs run a little safer on Heroku #403

Merged
merged 2 commits into from Oct 8, 2018

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Oct 3, 2018

While trying to get Wayback imports working reliably, I’ve noticed that we frequently wind up with job progress stuck hanging for hours and hours and hours. Obviously we need to give up at some point on the ETL script side, but it’s not like any import jobs are actually taking that long.

Instead, what’s really happening is that Heroku has some Resque-incompatible behavior when it randomly kills dynos every so often. Specifically, it sends SIGTERM to both the Resque process and to the process’s children, but Resque’s children aren’t prepared for that — the children simply die and the parent never tracks the fact that the job didn’t finish and so never re-enqueues it to run later. The resque-heroku-signals gem should solve that issue.

Additionally, since the Wayback import jobs are so big, it would be nice to have some sense of progress in the API and the database. This now persists updates to the Import record every 5 seconds (so as to not cause a lot of overhead that slows down processing an import).

It turns out Heroku sends kill signals directly to all child processes of a Resque worker, which Resque is not designed to handle. That causes jobs to get killed without get logged as killed, and so they never get retried later. I'm pretty sure this is one of the causes I'm seeing of hanging import jobs for Internet Archive. See this GitHub issue for more about why the gem exists: resque/resque#1559
Since import jobs can go for a long time, it's useful to periodically persist any updates to the `Import` record so progress, warnings, and errors are visible in the API. It would be unnecessary overhead to persist every update immediately, so this opts for updating every 5 seconds (I'm thinking that if the job is shorter than that, we don't really need in-progress data anyway).
@Mr0grog
Copy link
Member Author

Mr0grog commented Oct 8, 2018

Alright, I’m going to go ahead and merge this, since nobody has reviewed.

@Mr0grog Mr0grog merged commit 320efe1 into master Oct 8, 2018
@Mr0grog Mr0grog deleted the make-jobs-a-little-safer-on-heroku branch October 8, 2018 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant