Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler crashlooping when dag with task_concurrency is deleted #20099

Closed
1 of 2 tasks
nclaeys opened this issue Dec 7, 2021 · 3 comments · Fixed by #20349
Closed
1 of 2 tasks

Scheduler crashlooping when dag with task_concurrency is deleted #20099

nclaeys opened this issue Dec 7, 2021 · 3 comments · Fixed by #20349
Assignees
Labels
area:core kind:bug This is a clearly a bug

Comments

@nclaeys
Copy link
Contributor

nclaeys commented Dec 7, 2021

Apache Airflow version

2.2.2 (latest released)

What happened

After deleting the dag, the scheduler starts crashlooping and cannot recover. This means that an issue with the dag causes the whole environment to be down.

The stacktrace is as follows:

airflow-scheduler [2021-12-07 09:30:07,483] {kubernetes_executor.py:791} INFO - Shutting down Kubernetes executor
airflow-scheduler [2021-12-07 09:30:08,509] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 1472
airflow-scheduler [2021-12-07 09:30:08,681] {process_utils.py:66} INFO - Process psutil.Process(pid=1472, status='terminated', exitcode=0, started='09:28:37') (1472) terminated with exit
airflow-scheduler [2021-12-07 09:30:08,681] {scheduler_job.py:655} INFO - Exited execute loop
airflow-scheduler Traceback (most recent call last):
airflow-scheduler File "/home/airflow/.local/bin/airflow", line 8, in
airflow-scheduler sys.exit(main())
airflow-scheduler File "/home/airflow/.local/lib/python3.9/site-packages/airflow/main.py", line 48, in main
airflow-scheduler args.func(args)
airflow-scheduler File "/home/airflow/.local/lib/python3.9/site-packages/airflow/cli/cli_parser.py", line 48, in command
airflow-scheduler return func(*args, **kwargs)
airflow-scheduler File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/cli.py", line 92, in wrapper
airflow-scheduler return f(*args, **kwargs)
airflow-scheduler File "/home/airflow/.local/lib/python3.9/site-packages/airflow/cli/commands/scheduler_command.py", line 75, in scheduler
airflow-scheduler _run_scheduler_job(args=args)
airflow-scheduler File "/home/airflow/.local/lib/python3.9/site-packages/airflow/cli/commands/scheduler_command.py", line 46, in _run_scheduler_job
airflow-scheduler job.run()
airflow-scheduler File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/base_job.py", line 245, in run
airflow-scheduler self._execute()
airflow-scheduler File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 628, in _execute
airflow-scheduler self._run_scheduler_loop()
airflow-scheduler File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 709, in _run_scheduler_loop
airflow-scheduler num_queued_tis = self._do_scheduling(session)
airflow-scheduler File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 820, in _do_scheduling
airflow-scheduler num_queued_tis = self._critical_section_execute_task_instances(session=session)
airflow-scheduler File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 483, in _critical_section_execute_task_instances
airflow-scheduler queued_tis = self._executable_task_instances_to_queued(max_tis, session=session)
airflow-scheduler File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/session.py", line 67, in wrapper
airflow-scheduler return func(*args, **kwargs)
airflow-scheduler File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 366, in _executable_task_instances_to_queued
airflow-scheduler if serialized_dag.has_task(task_instance.task_id):
airflow-scheduler AttributeError: 'NoneType' object has no attribute 'has_task'

What you expected to happen

I expect that the scheduler does not crash because the dag gets deleted. The biggest issue however is that the whole environment goes down, it would be acceptable that the scheduler has issues with that dag (it is deleted after all) but it should not affect all other dags on the environment.

How to reproduce

  1. I created the following dag:

`
from airflow import DAG
from datafy.operators import DatafyContainerOperatorV2
from datetime import datetime, timedelta
default_args = {
"owner": "Datafy",
"depends_on_past": False,
"start_date": datetime(year=2021, month=12, day=1),
"task_concurrency": 4,
"retries": 2,
"retry_delay": timedelta(minutes=5),
}

dag = DAG(
"testnielsdev", default_args=default_args, max_active_runs=default_args["task_concurrency"] + 1, schedule_interval="0 1 * * *",
)

DatafyContainerOperatorV2(
dag=dag,
task_id="sample",
cmds=["python"],
arguments=["-m", "testnielsdev.sample", "--date", "{{ ds }}", "--env", "{{ macros.datafy.env() }}"],
instance_type="mx_small",
instance_life_cycle="spot",
)
`
When looking at the airflow code, the most important setting apart from the defaults is to specify task_concurrency.
2. I enable the dag
3. I delete it. When the file gets removed, the scheduler starts crashlooping.

Operating System

We use the default airflow docker image

Versions of Apache Airflow Providers

Not relevant

Deployment

Other Docker-based deployment

Deployment details

Not relevant

Anything else

It occurred at one of our customers and I was quickly able to preproduce the issue.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@nclaeys nclaeys added area:core kind:bug This is a clearly a bug labels Dec 7, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Dec 7, 2021

Thanks for opening your first issue here! Be sure to follow the issue template!

@nclaeys nclaeys changed the title Scheduler crashlooping when dag is deleted Scheduler crashlooping when dag with task_concurrency is deleted Dec 7, 2021
@sushi30
Copy link

sushi30 commented Dec 8, 2021

How did you solve this instance? I am having the same issue but I cannot trace which DAG was deleted.

@nclaeys
Copy link
Contributor Author

nclaeys commented Dec 8, 2021

We did know the dag that was deleted, added the dag and disabled it before deleting it again did the trick for us.

@ephraimbuddy ephraimbuddy self-assigned this Dec 16, 2021
kaxil pushed a commit that referenced this issue Jan 13, 2022
)

When executing task instances, we do not check if the dag is missing in
the dagbag. This PR fixes it by ignoring task instances if we can't find
the dag in serialized dag table

Closes: #20099
ephraimbuddy added a commit that referenced this issue Jan 31, 2022
)

When executing task instances, we do not check if the dag is missing in
the dagbag. This PR fixes it by ignoring task instances if we can't find
the dag in serialized dag table

Closes: #20099
(cherry picked from commit 9871576)
jedcunningham pushed a commit that referenced this issue Feb 10, 2022
)

When executing task instances, we do not check if the dag is missing in
the dagbag. This PR fixes it by ignoring task instances if we can't find
the dag in serialized dag table

Closes: #20099
(cherry picked from commit 9871576)
jedcunningham pushed a commit that referenced this issue Feb 17, 2022
)

When executing task instances, we do not check if the dag is missing in
the dagbag. This PR fixes it by ignoring task instances if we can't find
the dag in serialized dag table

Closes: #20099
(cherry picked from commit 9871576)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants