Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks stuck in queued state when using Triggerer process along with max_active_tis_per_dag #34624

Open
2 tasks done
MichailKaramanos opened this issue Sep 26, 2023 · 4 comments
Assignees
Labels
affected_version:2.7 Issues Reported for 2.7 area:core kind:bug This is a clearly a bug

Comments

@MichailKaramanos
Copy link

MichailKaramanos commented Sep 26, 2023

Apache Airflow version

2.7.1

What happened

In our company, when we started taking leverage of the new triggerer process. In a specific Airflow instance, we started to have constant tasks stuck in queued state at random intervals almost every day.
The only difference between this specific instance and the other ones, is that this one has a DAG that max_active_tis_per_dag=1 in a configured in a deferrable enabled task to avoid having multiple runs of the same task instance along all dag runs.

  • The first strange behaviour that we noticed is that this configuration was not being respected and, when some delay was introduced between dag runs, multiple tasks were spawned and deferred.
  • The second behaviour that we noticed was that the tasks started getting stuck in queued state almost every day at a random interval
  • The third behaviour was that our alarmistic started bouncing between tasks being QUEUED <-> SCHEDULED state constantly
  • The last but not least was that the scheduler logs started looping and indicating that the concurrency limit was reached, such as: Not executing <TaskInstance: ... [scheduled]> since the task concurrency for this task has been reached.

What you think should happen instead

Scheduler logs started being flooded with the following:

[2023-09-25T17:07:23.698+0000] {scheduler_job_runner.py:527} INFO - Not executing <TaskInstance: ... scheduled__2023-09-25T08:00:00+00:00 [scheduled]> since the task concurrency for this task has been reached.
[2023-09-25T17:07:23.698+0000] {scheduler_job_runner.py:479} INFO - DAG ... has 31/180 running and queued tasks

How to reproduce

A DAG with following characteristics:

  • max_active_runs=4
  • an hourly schedule_interval
  • A deferable task with max_active_tis_per_dag=1

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

AKS

Anything else

After digging the code, it seems that the following is happening:

To sum up, the fix would be to add the DEFERRED state to the EXECUTION_STATES structure:

EXECUTION_STATES = {
    TaskInstanceState.RUNNING,
    TaskInstanceState.QUEUED,
    TaskInstanceState.DEFERRED,
}

to have the similar behaviour of pools.
On the other hand, the pool's logic should have the same pattern, feed from that EXECUTION_STATES structure instead of making a concatenation:

allowed_execution_states = EXECUTION_STATES | {
            TaskInstanceState.DEFERRED,
        }

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@MichailKaramanos MichailKaramanos added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Sep 26, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Sep 26, 2023

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@hussein-awala hussein-awala added affected_version:2.7 Issues Reported for 2.7 and removed needs-triage label for new issues that we didn't triage yet labels Sep 26, 2023
@hussein-awala
Copy link
Member

Thank you for reporting this issue. Indeed most of the concurrency limits don't consider the deferred state as execution state, recently we fixed that in the pools (#32709) but we still have some problems with the other limits.

I will check all the problems you explained and check if there is something else to fix.

@hussein-awala hussein-awala self-assigned this Sep 26, 2023
@MichailKaramanos
Copy link
Author

Hi @hussein-awala,
First, thanks for the reply.
I just to give you some feedback around this: In production we patched the to test the proposed solution, and we ended up removing the max_active_tis_per_dag limits and set the DAG max_active_runs=1! We only left the triggerer process to deal with the deferrable tasks.
There is something bigger around this problem. I mean, the tasks are still stuck now bouncing between QUEUED -> SCHEDULED without letting the DAG progression further on!

Here are the logs:

[2023-09-27T11:40:06.975+0000] {base_executor.py:280} ERROR - could not queue task TaskInstanceKey(dag_id='...', task_id='...', run_id='scheduled__2023-09-27T01:00:00+00:00', try_number=1, map_index=-1) (still running after 12 attempts)
[2023-09-27T11:40:06.975+0000] {base_executor.py:280} ERROR - could not queue task TaskInstanceKey(dag_id='...', task_id='...', run_id='scheduled__2023-09-27T01:00:00+00:00', try_number=1, map_index=-1) (still running after 12 attempts)
[2023-09-27T11:40:06.975+0000] {base_executor.py:280} ERROR - could not queue task TaskInstanceKey(dag_id='...', task_id='...', run_id='scheduled__2023-09-27T01:00:00+00:00', try_number=1, map_index=-1) (still running after 12 attempts)

Weird is that the deferrable tasks terminated with success and remain in queued state (same before the patch by the way)

@MichailKaramanos
Copy link
Author

MichailKaramanos commented Oct 12, 2023

Hi again @hussein-awala,
Just to give you some additional and final feedback about this:
We ended patching our production Airflows using those proposals in my initial comment, and everything is stable and working like a train at full steam in our production envs. So in the end a simple, easy fix. It seems a bit different and complex approach taken here in your PR #34700.
Wondering if all that's needed, given our production patched achieved stability… What do you think?

Regarding my previous comment it seems that this happened because of other problems such as Kubernetes client timeouts in conjunction to the new deployment that was made (with the patch applied) along with tasks that were already running. Once these were resolved, everything is stable as mentioned earlier…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.7 Issues Reported for 2.7 area:core kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants