-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Closed
Labels
Description
Apache Airflow version
2.2.3
What happened
We are using Airflow with KubernetesExecutor. During the execution of job, the airflow pod seems to be restarted or terminated, which has caused the running job to be marked as failed with SIGTERM error.
Below is the log in airflow:
2022-03-25, 19:09:45 IST] {local_task_job.py:82} ERROR - Received SIGTERM. Terminating subprocesses
[2022-03-25, 19:09:45 IST] {process_utils.py:120} INFO - Sending Signals.SIGTERM to group 121. PIDs of all processes in the group: [122, 121]
[2022-03-25, 19:09:45 IST] {process_utils.py:75} INFO - Sending the signal Signals.SIGTERM to group 121
[2022-03-25, 19:09:45 IST] {taskinstance.py:1408} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-03-25, 19:09:45 IST] {spark_submit.py:623} INFO - Sending kill signal to spark-submit
[2022-03-25, 19:09:45 IST] {taskinstance.py:1700} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1329, in _run_raw_task
self._execute_task_with_callbacks(context)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1455, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1511, in _execute_task
result = execute_callable(context=context)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 157, in execute
self._hook.submit(self._application)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 407, in submit
self._process_spark_submit_log(iter(self._submit_sp.stdout)) # type: ignore
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 456, in _process_spark_submit_log
for line in itr:
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1410, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2022-03-25, 19:09:45 IST] {taskinstance.py:1267} INFO - Marking task as FAILED. dag_id=kda_create_model_alpha, task_id=create_model, execution_date=20220325T124433, start_date=20220325T124451, end_date=20220325T133945
[2022-03-25, 19:09:46 IST] {standard_task_runner.py:89} ERROR - Failed to execute job 2451 for task create_model
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py", line 85, in _start_by_fork
args.func(args, dag=self.dag)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 92, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 298, in task_run
_run_task_by_selected_method(args, dag, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 107, in _run_task_by_selected_method
_run_raw_task(args, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 180, in _run_raw_task
ti._run_raw_task(
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/session.py", line 70, in wrapper
return func(*args, session=session, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1329, in _run_raw_task
self._execute_task_with_callbacks(context)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1455, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1511, in _execute_task
result = execute_callable(context=context)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 157, in execute
self._hook.submit(self._application)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 407, in submit
self._process_spark_submit_log(iter(self._submit_sp.stdout)) # type: ignore
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 456, in _process_spark_submit_log
for line in itr:
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1410, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
Below is the log within the Airflow worker pod:
Running <TaskInstance: b8455e69-ad99-4721-8a40-f0a7fe877389_623db928e9c8b434fa742404_24c566dc-f77a-4606-b38a-3f33f9199819 [queued]> on host 1a1606ebb2314870b3b2bea7daf32547
Below is the log within the scheduler pod during that instance:
Fast evaluation: node ip-XX-XX-XX-XXX.ec2.internal cannot be removed: airflow/1a1606ebb2314870b3b2bea7daf32547 is not replicated
Running <TaskInstance: b8455e69-ad99-4721-8a40-f0a7fe877389_623db928e9c8b434fa742404_24c566dc-f77a-4606-b38a-3f33f9199819 [queued]> on host 1a1606ebb2314870b3b2bea7daf32547
What you think should happen instead
The worker pod shouldn't have got terminated or restarted until the job completes.
How to reproduce
No response
Operating System
Debian GNU/Linux
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Reactions are currently unavailable