Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover lost batches #7

Open
sleekweasel opened this issue Mar 4, 2016 · 0 comments
Open

Recover lost batches #7

sleekweasel opened this issue Mar 4, 2016 · 0 comments

Comments

@sleekweasel
Copy link
Contributor

If a batch-run times out or otherwise fails to return results for some/all the tests in the batch, we should consider re-queueing the tests. (Implies maintaining the currently running tests in the database.)

A test should not be re-queued endlessly - it's probably genuinely timing out or killing its agent. (Implies multiple queues.)

Re-queued tests are re-run in isolation, to separate bad tests from innocent batch-mates. (Implies workers knows about secondary queuing.)

We should recover even (especially) if the agent is killed with extreme prejudice. (Implies worker-tracking.)

Workers should not terminate until the queue is empty and all workers are idle. (Implies coordination)

Proposal:

  1. Workers should use transactions (http://redis.io/topics/transactions) to pull 'n' tests off the primary queue (or only 1 from the requeue) and into their own set, and then run them. Once the run is finished, any tests from the primary queue that weren't executed for any reason are added to the requeue.
  2. Worker-controller maintains a set listing each worker, removing a worker from the set when it terminates. If a worker terminates with tests in its set, worker-controller adds those tests to the requeue.
  3. The worker-controller polls for an empty queue and requeue, and for all worker sets to be empty, whereupon the worker-controller puts a 'tests complete' marker in a controller set and workers terminate in response.
  4. The various queues and sets have names based on that of the primary queue - e.g. queue, queue_requeue, queue_worker0, queue_control. These will be ensured empty at start-up by the processes using them (in case of previous catastrophic failure) but should be naturally empty by the end of a normally completed run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant