Skip to content

Airflow worker pod got terminated causing the jobs to be marked as Failed #22528

@bipin2295

Description

@bipin2295

Apache Airflow version

2.2.3

What happened

We are using Airflow with KubernetesExecutor. During the execution of job, the airflow pod seems to be restarted or terminated, which has caused the running job to be marked as failed with SIGTERM error.

Below is the log in airflow:

2022-03-25, 19:09:45 IST] {local_task_job.py:82} ERROR - Received SIGTERM. Terminating subprocesses
[2022-03-25, 19:09:45 IST] {process_utils.py:120} INFO - Sending Signals.SIGTERM to group 121. PIDs of all processes in the group: [122, 121]
[2022-03-25, 19:09:45 IST] {process_utils.py:75} INFO - Sending the signal Signals.SIGTERM to group 121
[2022-03-25, 19:09:45 IST] {taskinstance.py:1408} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-03-25, 19:09:45 IST] {spark_submit.py:623} INFO - Sending kill signal to spark-submit
[2022-03-25, 19:09:45 IST] {taskinstance.py:1700} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1329, in _run_raw_task
    self._execute_task_with_callbacks(context)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1455, in _execute_task_with_callbacks
    result = self._execute_task(context, self.task)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1511, in _execute_task
    result = execute_callable(context=context)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 157, in execute
    self._hook.submit(self._application)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 407, in submit
    self._process_spark_submit_log(iter(self._submit_sp.stdout))  # type: ignore
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 456, in _process_spark_submit_log
    for line in itr:
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1410, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2022-03-25, 19:09:45 IST] {taskinstance.py:1267} INFO - Marking task as FAILED. dag_id=kda_create_model_alpha, task_id=create_model, execution_date=20220325T124433, start_date=20220325T124451, end_date=20220325T133945
[2022-03-25, 19:09:46 IST] {standard_task_runner.py:89} ERROR - Failed to execute job 2451 for task create_model
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py", line 85, in _start_by_fork
    args.func(args, dag=self.dag)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 92, in wrapper
    return f(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 298, in task_run
    _run_task_by_selected_method(args, dag, ti)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 107, in _run_task_by_selected_method
    _run_raw_task(args, ti)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 180, in _run_raw_task
    ti._run_raw_task(
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/session.py", line 70, in wrapper
    return func(*args, session=session, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1329, in _run_raw_task
    self._execute_task_with_callbacks(context)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1455, in _execute_task_with_callbacks
    result = self._execute_task(context, self.task)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1511, in _execute_task
    result = execute_callable(context=context)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 157, in execute
    self._hook.submit(self._application)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 407, in submit
    self._process_spark_submit_log(iter(self._submit_sp.stdout))  # type: ignore
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 456, in _process_spark_submit_log
    for line in itr:
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1410, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal

Below is the log within the Airflow worker pod:

Running <TaskInstance: b8455e69-ad99-4721-8a40-f0a7fe877389_623db928e9c8b434fa742404_24c566dc-f77a-4606-b38a-3f33f9199819 [queued]> on host 1a1606ebb2314870b3b2bea7daf32547

Below is the log within the scheduler pod during that instance:

Fast evaluation: node ip-XX-XX-XX-XXX.ec2.internal cannot be removed: airflow/1a1606ebb2314870b3b2bea7daf32547 is not replicated


Running <TaskInstance:  b8455e69-ad99-4721-8a40-f0a7fe877389_623db928e9c8b434fa742404_24c566dc-f77a-4606-b38a-3f33f9199819 [queued]> on host 1a1606ebb2314870b3b2bea7daf32547 

What you think should happen instead

The worker pod shouldn't have got terminated or restarted until the job completes.

How to reproduce

No response

Operating System

Debian GNU/Linux

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions