Skip to content

BUG - aaq_batch does not remove_from_jobqueue on inactive/failed/error job, leading to infinite loop #314

@AljenU

Description

@AljenU

For jobs with status 'failed', 'inactive' or 'error', there is a difference between the situation before and after aaworkermaximumretry is reached.
Before the max retry is reached, remove_from_jobqueue() is called, with the retry input set to true. This is ok behaviour.
When the max number of retries is reached, aa throws an error (because it calls aas_log with the error input set to true), but does NOT call remove_from_jobqueue first.

At least when Parallel Server and MJS are used (instead of a parcluster('local')), since these are persistent, this means that these jobs stay in the pool forever, because they are never removed. They fill up the queue even, so if multiple times an aa script is run that leads to such a job, the nfreeworkers in runall() in aaq_batch becomes zero, without there being any obj.jobinfo, which leads to an infinite loop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions