Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic DAGs that disappear end up stuck in queued state. #15488

Closed
ashb opened this issue Apr 22, 2021 · 4 comments
Closed

Dynamic DAGs that disappear end up stuck in queued state. #15488

ashb opened this issue Apr 22, 2021 · 4 comments
Assignees
Labels
affected_version:2.0 Issues Reported for 2.0 area:Scheduler Scheduler or dag parsing Issues kind:bug This is a clearly a bug priority:medium Bug that should be fixed before next release but would not block a release
Milestone

Comments

@ashb
Copy link
Member

ashb commented Apr 22, 2021

I can observe the same problem with version 2.0.2:

  • Tasks fail, because a DAG/task has gone missing (we are using dynamically created DAGs, and they can go missing)
  • The scheduler keeps those queued
  • The pool gradually fills up with these queued tasks
  • The whole operation stops, because of this behaviour

My current remedy:

  • Manually remove those queued tasks

My desired solution:

When a DAG/task goes missing while it is queued, it should end up in a failed state.

Originally posted by @lukas-at-harren in #13542 (comment)

@ashb ashb added this to the Airflow 2.1 milestone Apr 22, 2021
@ephraimbuddy ephraimbuddy self-assigned this Apr 22, 2021
@alokgarg5
Copy link

We are facing the same issue, I am using celery executor, upgraded to version 2.0.2 , tasks do not get executed.

@vikramkoka vikramkoka added kind:bug This is a clearly a bug priority:medium Bug that should be fixed before next release but would not block a release affected_version:2.0 Issues Reported for 2.0 area:Scheduler Scheduler or dag parsing Issues labels May 3, 2021
@jhtimmins jhtimmins modified the milestones: Airflow 2.1, Airflow 2.1.1 May 11, 2021
@ephraimbuddy
Copy link
Contributor

I reproduced this in 2.1.0 by removing a dag while it's queued. Set default pool to 9.
Below is the dag I used and to reproduce it, you need to trigger the dag multiple times and then remove the dag.

from datetime import timedelta

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.dummy import DummyOperator
from airflow.utils.dates import days_ago

args = {
    'owner': 'airflow',
}

with DAG(
    dag_id='example_bash_operator',
    default_args=args,
    schedule_interval='0 0 * * *',
    start_date=days_ago(2),
    dagrun_timeout=timedelta(minutes=60),
    params={"example_key": "example_value"},
) as dag:

    run_this_last = DummyOperator(
        task_id='run_this_last',
    )

    # [START howto_operator_bash]
    run_this = BashOperator(
        task_id='run_after_loop',
        bash_command='echo 1',
    )
    # [END howto_operator_bash]

    run_this >> run_this_last

    for i in range(7):
        task = BashOperator(
            task_id='runme_' + str(i),
            bash_command='echo "{{ task_instance_key_str }}" && sleep 30',
        )
        task >> run_this

    # [START howto_operator_bash_template]
    also_run_this = BashOperator(
        task_id='also_run_this',
        bash_command='echo "run_id={{ run_id }} | dag_run={{ dag_run }}"',
    )
    # [END howto_operator_bash_template]
    also_run_this >> run_this_last

# [START howto_operator_bash_skip]
this_will_skip = BashOperator(
    task_id='this_will_skip',
    bash_command='echo "hello world"; exit 99;',
    dag=dag,
)
# [END howto_operator_bash_skip]
this_will_skip >> run_this_last

Notice that when you remove the dag, queued tasks remain queued and the executor fails.
logs:

[2021-05-27 21:22:48,902] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_4 execution_date=2021-05-26 00:00:00+00:00 exited with status failed for try_number 1
[2021-05-27 21:22:48,902] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_6 execution_date=2021-05-26 00:00:00+00:00 exited with status failed for try_number 1
[2021-05-27 21:22:48,902] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_2 execution_date=2021-05-26 00:00:00+00:00 exited with status failed for try_number 1
[2021-05-27 21:22:48,902] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_3 execution_date=2021-05-26 00:00:00+00:00 exited with status failed for try_number 1
[2021-05-27 21:22:48,902] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.also_run_this execution_date=2021-05-26 00:00:00+00:00 exited with status failed for try_number 1
[2021-05-27 21:22:48,902] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.run_after_loop execution_date=2021-05-25 00:00:00+00:00 exited with status failed for try_number 1
[2021-05-27 21:22:48,902] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_5 execution_date=2021-05-26 00:00:00+00:00 exited with status failed for try_number 1
[2021-05-27 21:22:48,903] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_0 execution_date=2021-05-26 00:00:00+00:00 exited with status success for try_number 1
[2021-05-27 21:22:48,903] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_1 execution_date=2021-05-26 00:00:00+00:00 exited with status success for try_number 1
[2021-05-27 21:22:48,909] {scheduler_job.py:1244} ERROR - Executor reports task instance <TaskInstance: example_bash_operator.run_after_loop 2021-05-25 00:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2021-05-27 21:22:48,910] {scheduler_job.py:1244} ERROR - Executor reports task instance <TaskInstance: example_bash_operator.runme_2 2021-05-26 00:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2021-05-27 21:22:48,910] {scheduler_job.py:1244} ERROR - Executor reports task instance <TaskInstance: example_bash_operator.runme_3 2021-05-26 00:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2021-05-27 21:22:48,910] {scheduler_job.py:1244} ERROR - Executor reports task instance <TaskInstance: example_bash_operator.runme_4 2021-05-26 00:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2021-05-27 21:22:48,910] {scheduler_job.py:1244} ERROR - Executor reports task instance <TaskInstance: example_bash_operator.runme_5 2021-05-26 00:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2021-05-27 21:22:48,911] {scheduler_job.py:1244} ERROR - Executor reports task instance <TaskInstance: example_bash_operator.runme_6 2021-05-26 00:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2021-05-27 21:22:48,911] {scheduler_job.py:1244} ERROR - Executor reports task instance <TaskInstance: example_bash_operator.also_run_this 2021-05-26 00:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2021-05-27 21:22:48,922] {file_processor_handler.py:123} WARNING - /root/airflow/logs/scheduler/latest already exists as a dir/file. Skip creating symlink.
[2021-05-27 21:22:48,966] {dagbag.py:487} INFO - Filling up the DagBag from /files/dags/example_bash.py
[2021-05-27 21:22:48,968] {local_executor.py:127} ERROR - Failed to execute task dag_id could not be found: example_bash_operator. Either the dag did not exist or it failed to parse..
[2021-05-27 21:22:48,975] {dagbag.py:487} INFO - Filling up the DagBag from /files/dags/example_bash.py
[2021-05-27 21:22:48,978] {local_executor.py:127} ERROR - Failed to execute task dag_id could not be found: example_bash_operator. Either the dag did not exist or it failed to parse..
[2021-05-27 21:22:49,976] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_1 execution_date=2021-05-27 21:22:21.964648+00:00 exited with status failed for try_number 1
[2021-05-27 21:22:49,976] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.this_will_skip execution_date=2021-05-26 00:00:00+00:00 exited with status failed for try_number 1
[2021-05-27 21:22:49,980] {scheduler_job.py:1244} ERROR - Executor reports task instance <TaskInstance: example_bash_operator.this_will_skip 2021-05-26 00:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2021-05-27 21:22:49,980] {scheduler_job.py:1244} ERROR - Executor reports task instance <TaskInstance: example_bash_operator.runme_1 2021-05-27 21:22:21.964648+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?

Screenshot 2021-05-27 235108
Screenshot 2021-05-27 235218

I reproduced this with LocalExecutor. Just make sure you triggered the dag multiple times before removing the dag and you will reproduce this.

If you trigger another different dag, nothing is executing. You will see this from the log:

[2021-05-27 21:27:01,338] {scheduler_job.py:1837} INFO - Resetting orphaned tasks for active dag runs
[2021-05-27 21:27:01,374] {scheduler_job.py:1904} INFO - Reset the following 35 orphaned TaskInstances:
      <TaskInstance: example_bash_operator.runme_0 2021-05-27 21:22:21.964648+00:00 [scheduled]>
      <TaskInstance: example_bash_operator.runme_3 2021-05-27 21:22:23.739773+00:00 [scheduled]>

This no longer happens in master because of this fix #15929.
This is the log in master:

[2021-05-27 21:15:20,964] {local_executor.py:127} ERROR - Failed to execute task dag_id could not be found: example_bash_operator. Either the dag did not exist or it failed to parse..
[2021-05-27 21:15:21,896] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_6 execution_date=2021-05-27 21:14:49.905160+00:00 exited with status failed for try_number 1
[2021-05-27 21:15:21,896] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_5 execution_date=2021-05-27 21:14:49.905160+00:00 exited with status failed for try_number 1
[2021-05-27 21:15:21,896] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_4 execution_date=2021-05-27 21:14:49.905160+00:00 exited with status failed for try_number 1
[2021-05-27 21:15:21,896] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.this_will_skip execution_date=2021-05-27 21:14:47.796058+00:00 exited with status failed for try_number 1
[2021-05-27 21:15:21,897] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_3 execution_date=2021-05-27 21:14:49.905160+00:00 exited with status failed for try_number 1
[2021-05-27 21:15:21,897] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_2 execution_date=2021-05-27 21:14:49.905160+00:00 exited with status failed for try_number 1
[2021-05-27 21:15:21,897] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_0 execution_date=2021-05-27 21:14:49.905160+00:00 exited with status failed for try_number 1
[2021-05-27 21:15:21,897] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.also_run_this execution_date=2021-05-27 21:14:49.905160+00:00 exited with status failed for try_number 1
[2021-05-27 21:15:21,897] {scheduler_job.py:1215} INFO - Executor reports execution of example_bash_operator.runme_1 execution_date=2021-05-27 21:14:49.905160+00:00 exited with status failed for try_number 1
[2021-05-27 21:15:21,903] {scheduler_job.py:1244} ERROR - Executor reports task instance <TaskInstance: example_bash_operator.runme_6 2021-05-27 21:14:49.905160+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2021-05-27 21:15:21,903] {scheduler_job.py:1251} INFO - Setting task instance <TaskInstance: example_bash_operator.runme_6 2021-05-27 21:14:49.905160+00:00 [queued]> state to failed as reported by executor

Screenshot 2021-05-27 234108

And executor continues to execute other dags when triggered.

@kaxil
Copy link
Member

kaxil commented May 28, 2021

@lukas-at-harren @alokgarg5 Can you try it with Master as suggested by @ephraimbuddy please and report back please if possible?

@kaxil
Copy link
Member

kaxil commented Jul 27, 2021

Closed by #15929 and will be released in 2.1.3

@kaxil kaxil closed this as completed Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.0 Issues Reported for 2.0 area:Scheduler Scheduler or dag parsing Issues kind:bug This is a clearly a bug priority:medium Bug that should be fixed before next release but would not block a release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants