Tasks stuck in queued state when using Triggerer process along with max_active_tis_per_dag #34624

MichailKaramanos · 2023-09-26T09:48:54Z

Apache Airflow version

2.7.1

What happened

In our company, when we started taking leverage of the new triggerer process. In a specific Airflow instance, we started to have constant tasks stuck in queued state at random intervals almost every day.
The only difference between this specific instance and the other ones, is that this one has a DAG that max_active_tis_per_dag=1 in a configured in a deferrable enabled task to avoid having multiple runs of the same task instance along all dag runs.

The first strange behaviour that we noticed is that this configuration was not being respected and, when some delay was introduced between dag runs, multiple tasks were spawned and deferred.
The second behaviour that we noticed was that the tasks started getting stuck in queued state almost every day at a random interval
The third behaviour was that our alarmistic started bouncing between tasks being QUEUED <-> SCHEDULED state constantly
The last but not least was that the scheduler logs started looping and indicating that the concurrency limit was reached, such as: Not executing <TaskInstance: ... [scheduled]> since the task concurrency for this task has been reached.

What you think should happen instead

Scheduler logs started being flooded with the following:

[2023-09-25T17:07:23.698+0000] {scheduler_job_runner.py:527} INFO - Not executing <TaskInstance: ... scheduled__2023-09-25T08:00:00+00:00 [scheduled]> since the task concurrency for this task has been reached.
[2023-09-25T17:07:23.698+0000] {scheduler_job_runner.py:479} INFO - DAG ... has 31/180 running and queued tasks

How to reproduce

A DAG with following characteristics:

max_active_runs=4
an hourly schedule_interval
A deferable task with max_active_tis_per_dag=1

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

AKS

Anything else

After digging the code, it seems that the following is happening:

Tasks EXECUTION_STATES does not contemplate the task DEFERRED state: https://github.com/apache/airflow/blob/main/airflow/ti_deps/dependencies_states.py#L21
On the other hand, when using pools, this state was contemplated by concatenating the task EXECUTION_STATES: https://github.com/apache/airflow/blob/main/airflow/models/pool.py#L184
This structure is needed to apply the concurrency limits here: https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job_runner.py#L342
The previous concurrency_map gets filled with deferred tasks that already completed but are treated as starved tasks refusing to execute any more tasks, despite the deferred tasks were already completed: https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job_runner.py#L530

To sum up, the fix would be to add the DEFERRED state to the EXECUTION_STATES structure:

EXECUTION_STATES = {
    TaskInstanceState.RUNNING,
    TaskInstanceState.QUEUED,
    TaskInstanceState.DEFERRED,
}

to have the similar behaviour of pools.
On the other hand, the pool's logic should have the same pattern, feed from that EXECUTION_STATES structure instead of making a concatenation:

allowed_execution_states = EXECUTION_STATES | {
            TaskInstanceState.DEFERRED,
        }

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2023-09-26T09:48:56Z

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

hussein-awala · 2023-09-26T21:09:31Z

Thank you for reporting this issue. Indeed most of the concurrency limits don't consider the deferred state as execution state, recently we fixed that in the pools (#32709) but we still have some problems with the other limits.

I will check all the problems you explained and check if there is something else to fix.

MichailKaramanos · 2023-09-27T11:44:15Z

Hi @hussein-awala,
First, thanks for the reply.
I just to give you some feedback around this: In production we patched the to test the proposed solution, and we ended up removing the max_active_tis_per_dag limits and set the DAG max_active_runs=1! We only left the triggerer process to deal with the deferrable tasks.
There is something bigger around this problem. I mean, the tasks are still stuck now bouncing between QUEUED -> SCHEDULED without letting the DAG progression further on!

Here are the logs:

[2023-09-27T11:40:06.975+0000] {base_executor.py:280} ERROR - could not queue task TaskInstanceKey(dag_id='...', task_id='...', run_id='scheduled__2023-09-27T01:00:00+00:00', try_number=1, map_index=-1) (still running after 12 attempts)
[2023-09-27T11:40:06.975+0000] {base_executor.py:280} ERROR - could not queue task TaskInstanceKey(dag_id='...', task_id='...', run_id='scheduled__2023-09-27T01:00:00+00:00', try_number=1, map_index=-1) (still running after 12 attempts)
[2023-09-27T11:40:06.975+0000] {base_executor.py:280} ERROR - could not queue task TaskInstanceKey(dag_id='...', task_id='...', run_id='scheduled__2023-09-27T01:00:00+00:00', try_number=1, map_index=-1) (still running after 12 attempts)

Weird is that the deferrable tasks terminated with success and remain in queued state (same before the patch by the way)

MichailKaramanos · 2023-10-12T13:08:52Z

Hi again @hussein-awala,
Just to give you some additional and final feedback about this:
We ended patching our production Airflows using those proposals in my initial comment, and everything is stable and working like a train at full steam in our production envs. So in the end a simple, easy fix. It seems a bit different and complex approach taken here in your PR #34700.
Wondering if all that's needed, given our production patched achieved stability… What do you think?

Regarding my previous comment it seems that this happened because of other problems such as Kubernetes client timeouts in conjunction to the new deployment that was made (with the patch applied) along with tasks that were already running. Once these were resolved, everything is stable as mentioned earlier…

MichailKaramanos added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Sep 26, 2023

hussein-awala added affected_version:2.7 Issues Reported for 2.7 and removed needs-triage label for new issues that we didn't triage yet labels Sep 26, 2023

hussein-awala self-assigned this Sep 26, 2023

hussein-awala mentioned this issue Oct 1, 2023

Add deferred_as_active param to consider deferred state as an active state #34700

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks stuck in queued state when using Triggerer process along with max_active_tis_per_dag #34624

Tasks stuck in queued state when using Triggerer process along with max_active_tis_per_dag #34624

MichailKaramanos commented Sep 26, 2023 •

edited

Loading

boring-cyborg bot commented Sep 26, 2023

hussein-awala commented Sep 26, 2023

MichailKaramanos commented Sep 27, 2023

MichailKaramanos commented Oct 12, 2023 •

edited

Loading

Tasks stuck in queued state when using Triggerer process along with max_active_tis_per_dag #34624

Tasks stuck in queued state when using Triggerer process along with max_active_tis_per_dag #34624

Comments

MichailKaramanos commented Sep 26, 2023 • edited Loading

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Sep 26, 2023

hussein-awala commented Sep 26, 2023

MichailKaramanos commented Sep 27, 2023

MichailKaramanos commented Oct 12, 2023 • edited Loading

MichailKaramanos commented Sep 26, 2023 •

edited

Loading

MichailKaramanos commented Oct 12, 2023 •

edited

Loading