BUG - aaq_batch does not remove_from_jobqueue on inactive/failed/error job, leading to infinite loop

For jobs with status 'failed', 'inactive' or 'error', there is a difference between the situation before and after aaworkermaximumretry is reached.
Before the max retry is reached, remove_from_jobqueue() is called, with the retry input set to true. This is ok behaviour.
When the max number of retries is reached, aa throws an error (because it calls aas_log with the error input set to true), but does NOT call remove_from_jobqueue first.

At least when Parallel Server and MJS are used (instead of a parcluster('local')), since these are persistent, this means that these jobs stay in the pool forever, because they are never removed. They fill up the queue even, so if multiple times an aa script is run that leads to such a job, the nfreeworkers in runall() in aaq_batch becomes zero, without there being any obj.jobinfo, which leads to an infinite loop.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG - aaq_batch does not remove_from_jobqueue on inactive/failed/error job, leading to infinite loop #314

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BUG - aaq_batch does not remove_from_jobqueue on inactive/failed/error job, leading to infinite loop #314

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions