Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow 2.1.0 doesn't retry a task if it externally killed #16285

Closed
waleedsamy opened this issue Jun 6, 2021 · 1 comment · Fixed by #16301
Closed

Airflow 2.1.0 doesn't retry a task if it externally killed #16285

waleedsamy opened this issue Jun 6, 2021 · 1 comment · Fixed by #16301
Labels
affected_version:2.1 Issues Reported for 2.1 kind:bug This is a clearly a bug priority:high High priority bug that should be patched quickly but does not require immediate new release
Milestone

Comments

@waleedsamy
Copy link

waleedsamy commented Jun 6, 2021

Apache Airflow version: 2.1.0

Kubernetes version (if you are using kubernetes) (use kubectl version):

Environment:

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): Ubuntu 18.04.5 LTS
  • Kernel (e.g. uname -a): Linux 4.15.0-143-generic move isinstance check outside of loop #147-Ubuntu SMP Wed Apr 14 16:10:11 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: pip
  • Others:

What happened:
When a task get externally killed, it is marked as Failed even though it can be retried.

What you expected to happen:
When a task get externally killed (kill -9 pid), it should put back to retry if it retries did not run out yet.

How to reproduce it:
I'm using Celery as executor and I have a cluster of ~250 machine.
I have a task that defined as next. When the task started to execute, and it get killed externally by sending SIGKILL to it (or to the executor process and it's children), it get marked as FAILED and doesn't put to retry (even though retries is set to 10 times)

import time

def _task1(ts_nodash, dag_run, ti, **context):
    time.sleep(300)

tasks = PythonOperator(
    task_id='task1',
    python_callable=_task1,
    retries=10,
    dag=dag1
)

Anything else we need to know: This bug is introduced by 15537 as far as I know.

image

Next is the task log after sending SIGKILL to it.

[2021-06-06 18:50:07,897] {taskinstance.py:876} INFO - Dependencies all met for <TaskInstance: convert_manager.download_rtv_file 2021-06-06T11:26:37+00:00 [queued]>
[2021-06-06 18:50:07,916] {taskinstance.py:876} INFO - Dependencies all met for <TaskInstance: convert_manager.download_rtv_file 2021-06-06T11:26:37+00:00 [queued]>
[2021-06-06 18:50:07,918] {taskinstance.py:1067} INFO - 
--------------------------------------------------------------------------------
[2021-06-06 18:50:07,919] {taskinstance.py:1068} INFO - Starting attempt 6 of 16
[2021-06-06 18:50:07,921] {taskinstance.py:1069} INFO - 
--------------------------------------------------------------------------------
[2021-06-06 18:50:07,930] {taskinstance.py:1087} INFO - Executing <Task(PythonOperator): download_rtv_file> on 2021-06-06T11:26:37+00:00
[2021-06-06 18:50:07,937] {standard_task_runner.py:52} INFO - Started process 267 to run task
[2021-06-06 18:50:07,942] {standard_task_runner.py:76} INFO - Running: ['airflow', 'tasks', 'run', 'convert_manager', 'download_rtv_file', '2021-06-06T11:26:37+00:00', '--job-id', '75', '--pool', 'lane_xs', '--raw', '--subdir', 'DAGS_FOLDER/convert_manager.py', '--cfg-path', '/tmp/tmp35oxqliw', '--error-file', '/tmp/tmp3eme_cq7']
[2021-06-06 18:50:07,948] {standard_task_runner.py:77} INFO - Job 75: Subtask download_rtv_file
[2021-06-06 18:50:07,999] {logging_mixin.py:104} INFO - Running <TaskInstance: convert_manager.download_rtv_file 2021-06-06T11:26:37+00:00 [running]> on host 172.29.29.11
[2021-06-06 18:50:08,052] {taskinstance.py:1282} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=traffics
AIRFLOW_CTX_DAG_ID=convert_manager
AIRFLOW_CTX_TASK_ID=download_rtv_file
AIRFLOW_CTX_EXECUTION_DATE=2021-06-06T11:26:37+00:00
AIRFLOW_CTX_DAG_RUN_ID=dev_triggered_lane_31_itf-30_201208213_run_2021-06-06T13:26:37.135821+02:00
[2021-06-06 18:50:08,087] {convert_manager.py:377} INFO - downloading to /var/spool/central/airflow/data/ftp/***/ITF_RTV.xml.zip/rtv/ITF_RTV.xml.zip_20210606184921
[2021-06-06 18:50:08,094] {ftp.py:187} INFO - Retrieving file from FTP: /rtv/ITF_RTV.xml.zip
[2021-06-06 18:50:38,699] {local_task_job.py:151} INFO - Task exited with return code Negsignal.SIGKILL
@waleedsamy waleedsamy added the kind:bug This is a clearly a bug label Jun 6, 2021
@waleedsamy
Copy link
Author

CC @ashb @ephraimbuddy

@waleedsamy waleedsamy changed the title Airflow >= 2.3 doesn't retry a task if it eternally killed Airflow >= 2.3 doesn't retry a task if it externally killed Jun 6, 2021
@waleedsamy waleedsamy changed the title Airflow >= 2.3 doesn't retry a task if it externally killed Airflow >= 2.0.3 doesn't retry a task if it externally killed Jun 6, 2021
@ashb ashb added this to the Airflow 2.1.1 milestone Jun 6, 2021
@ashb ashb added the priority:high High priority bug that should be patched quickly but does not require immediate new release label Jun 6, 2021
@waleedsamy waleedsamy changed the title Airflow >= 2.0.3 doesn't retry a task if it externally killed Airflow 2.1.0 doesn't retry a task if it externally killed Jun 6, 2021
@eladkal eladkal added the affected_version:2.1 Issues Reported for 2.1 label Jun 7, 2021
@ashb ashb modified the milestones: Airflow 2.1.1, Airflow 2.1.2 Jun 22, 2021
@ashb ashb modified the milestones: Airflow 2.1.2, Airflow 2.1.3 Jul 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.1 Issues Reported for 2.1 kind:bug This is a clearly a bug priority:high High priority bug that should be patched quickly but does not require immediate new release
Projects
None yet
3 participants