Make jobs run a little safer on Heroku #403

Mr0grog · 2018-10-03T19:28:39Z

While trying to get Wayback imports working reliably, I’ve noticed that we frequently wind up with job progress stuck hanging for hours and hours and hours. Obviously we need to give up at some point on the ETL script side, but it’s not like any import jobs are actually taking that long.

Instead, what’s really happening is that Heroku has some Resque-incompatible behavior when it randomly kills dynos every so often. Specifically, it sends SIGTERM to both the Resque process and to the process’s children, but Resque’s children aren’t prepared for that — the children simply die and the parent never tracks the fact that the job didn’t finish and so never re-enqueues it to run later. The resque-heroku-signals gem should solve that issue.

Additionally, since the Wayback import jobs are so big, it would be nice to have some sense of progress in the API and the database. This now persists updates to the Import record every 5 seconds (so as to not cause a lot of overhead that slows down processing an import).

It turns out Heroku sends kill signals directly to all child processes of a Resque worker, which Resque is not designed to handle. That causes jobs to get killed without get logged as killed, and so they never get retried later. I'm pretty sure this is one of the causes I'm seeing of hanging import jobs for Internet Archive. See this GitHub issue for more about why the gem exists: resque/resque#1559

Since import jobs can go for a long time, it's useful to periodically persist any updates to the `Import` record so progress, warnings, and errors are visible in the API. It would be unnecessary overhead to persist every update immediately, so this opts for updating every 5 seconds (I'm thinking that if the job is shorter than that, we don't really need in-progress data anyway).

Mr0grog · 2018-10-08T17:05:04Z

Alright, I’m going to go ahead and merge this, since nobody has reviewed.

Mr0grog added 2 commits October 3, 2018 12:16

Mr0grog requested review from danielballan and jsnshrmn October 3, 2018 19:28

Mr0grog added the in progress label Oct 3, 2018

Mr0grog merged commit 320efe1 into master Oct 8, 2018

Mr0grog removed the in progress label Oct 8, 2018

Mr0grog deleted the make-jobs-a-little-safer-on-heroku branch October 8, 2018 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make jobs run a little safer on Heroku #403

Make jobs run a little safer on Heroku #403

Mr0grog commented Oct 3, 2018

Mr0grog commented Oct 8, 2018

Make jobs run a little safer on Heroku #403

Make jobs run a little safer on Heroku #403

Conversation

Mr0grog commented Oct 3, 2018

Mr0grog commented Oct 8, 2018