Tasks are being relaunched before first task finished its execution #28853

PhoenixChamb · 2023-01-11T10:21:18Z

PhoenixChamb
Jan 11, 2023

Hello,

Apache Airflow version
The versions of postgres and airflow that we currently have are the following ones:

Postgresql: 15.1
Airflow: 2.5.0

What happened
The problem we are encountering is the following. For some unknown reason, a task retry is being relaunched before first task finish correctly or incorrectly.

As the following example:

We have 4 retries:

Last logs from task 1 are at 04:42:02 ... and it finished correctly.
TIME: 04:42:02 when it finished

[2023-01-11, 04:42:00 CET] {pod_manager.py:284} INFO - Pod load-data-publish-74006ae13ad541f8bf21ea881e5eff3b has phase Running
**[2023-01-11, 04:42:02 CET]** {kubernetes_pod.py:499} INFO - Deleting pod: load-data-publish-publish-74006ae13ad541f8bf21ea881e5eff3b
**[2023-01-11, 04:42:02 CET]** {taskinstance.py:1322} INFO - Marking task as SUCCESS. dag_id=data_loader, task_id=trading-publisher, execution_date=20220328T000000, start_date=20230111T034112, end_date=20230111T034202

At this point the task has already been relaunched (retry 2) without waiting for task 1 to finish successfully. As you can see second retry is being launched without waiting for first task to end.

Here logs from retry 2:
TIME: 04:41:12 when it started

[2023-01-11, 04:41:12 CET] {taskinstance.py:1087} INFO - Dependencies all met for <TaskInstance: ****_loader.trading-publisher scheduled__2022-03-28T00:00:00+00:00 [queued]>
[2023-01-11, 04:41:12 CET] {taskinstance.py:1087} INFO - Dependencies all met for <TaskInstance: ****_loader.trading-publisher scheduled__2022-03-28T00:00:00+00:00 [queued]>
[2023-01-11, 04:41:12 CET] {taskinstance.py:1283} INFO - 
--------------------------------------------------------------------------------
**[2023-01-11, 04:41:12 CET**] {taskinstance.py:1284} INFO - Starting attempt 2 of 8
[**2023-01-11, 04:41:12** CET] {taskinstance.py:1285} INFO -

The same thing is happening between 1 and 2, 2 and 3, 3 and 4, all tasks are being retried before the previous task finishes.

We have encountered this problem after upgrading Airflow from 2.3.0 to 2.5.0 and also postgres upgrading from version 13.0 to 15.1. In addition to this, due to external problems we have migrated our infrastructure from Kubernetes to Openshift.

How to reproduce
It occurs randomly in the executions of our tasks, we do not observe a logical pattern

Deployment details

containers:
        # Always run the main scheduler container.
        - name: scheduler
          image: {{ template "airflow_image" . }}
          lifecycle:
            postStart:
              exec:
                command:
                  - bash
                  - -c
                  - "/opt/airflow/dags/*****/workflow/*_entity_dags_generate.bash || echo 'cannot generate dags' >> /opt/airflow/logs/*_entities_dags_generation.log"
          imagePullPolicy: IfNotPresent
          args: ["scheduler"]
          env:
            - name: IMAGE_TAG
              value: *****
            {{- if and .Values.smtp.password .Values.smtp.user }}
            - name: AIRFLOW__SMTP__SMTP_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: *****
                  name: *****
            - name: AIRFLOW__SMTP__SMTP_USER
              valueFrom:
                secretKeyRef:
                  key: *****
                  name: *****
            {{- end }}
            - name: POSTGRES_HOST
              value: ******
            - name: POSTGRES_DB
              value: ******
            - name: POSTGRES_USER
              value: ******
            - name: "POSTGRES_PASSWORD"
              valueFrom:
                secretKeyRef:
                  key: ******
                  name: ******
          {{- if .Values.scheduler.env }}
          {{- toYaml .Values.scheduler.env | nindent 12 }}
          {{- end }}
          {{- include "custom_airflow_environment" . | indent 10 }}
          {{- include "standard_airflow_environment" . | indent 10 }}
          # If the scheduler stops heartbeating for 5 minutes (10*30s) kill the
          # scheduler and let Kubernetes restart it
          livenessProbe:
            timeoutSeconds: {{ .Values.scheduler.livenessProbe.timeoutSeconds }}
            failureThreshold: 10
            periodSeconds: 30
            exec:
              command:
              - /usr/bin/env
              - AIRFLOW__CORE__LOGGING_LEVEL=ERROR
              - python3
              - -Wignore
              - -c
              - |
                from airflow.jobs.scheduler_job import SchedulerJob
                from airflow.utils.net import get_hostname
                import sys

                job = SchedulerJob.most_recent_job()
                sys.exit(0 if job.is_alive() and job.hostname == get_hostname() else 1)

What you think should happen instead
Wait for the first task to complete successfully or unsuccessfully to launch a retry

Versions of Apache Airflow Providers
Also comment that we are using for all these jobs the library from 'airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator'.

Are you willing to submit PR?
Yes I am willing to submit a PR!

potiuk · 2023-01-19T22:13:59Z

potiuk
Jan 19, 2023
Collaborator

We can't see the DAG. It's hard to reason about it without it. I believe you might have made some mistake on dag_id/task_id - but It's hard to see it .

0 replies

rafalwijata · 2024-03-04T08:41:26Z

rafalwijata
Mar 4, 2024

I had/have a similar problem. In my case it was/is problem with database connection. If it's off for more than 50 seconds airflow decides to restart task due to this database connection broken. Then database comes back and system ends up with two tasks running in parallel. I increased stale_dag_threshold/dag_file_processor_timeout/scheduler_health_check_threshold in airflow.cfg but I don't know if it's proper solution... IMHO dirty workaround.

0 replies

paolo-moriello · 2024-05-28T15:45:23Z

paolo-moriello
May 28, 2024

I've also had a similar issue, with scheduled tasks being identified as zombie even if the pod (K8S operator) was actually still running fine. in my case scaling-up the cluster helped, especially worker cpu and memory. Run on MWAA

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks are being relaunched before first task finished its execution #28853

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Tasks are being relaunched before first task finished its execution #28853

PhoenixChamb Jan 11, 2023

Replies: 3 comments

potiuk Jan 19, 2023 Collaborator

rafalwijata Mar 4, 2024

paolo-moriello May 28, 2024

PhoenixChamb
Jan 11, 2023

potiuk
Jan 19, 2023
Collaborator

rafalwijata
Mar 4, 2024

paolo-moriello
May 28, 2024