Tasks marked as "UP_FOR_RESCHEDULE" get stuck in Executor.running and never reschedule #25728

notatallshaw-gts · 2022-08-15T19:39:24Z

Apache Airflow version

2.3.3

What happened

Upon upgrading from Airflow 2.1.3 to Airflow 2.3.3 we have an issue with our sensors that have mode='reschedule'. Using TimeSensor as example:

It executes as normal on the first run
It detects it is not the correct time yet and marks itself "UP_FOR_RESCEDULE" (usually to rescheduled for 5 minutes in the future)
When the time comes to be rescheduled it just gets marked as "QUEUED" and is never actually run again, the error in the log:
[2022-08-15 00:01:11,027] {base_executor.py:215} ERROR - could not queue task TaskInstanceKey(dag_id='TestDAG', task_id='testTASK', run_id='scheduled__2022-08-12T04:00:00+00:00', try_number=1, map_index=-1) (still running after 4 attempts)

Looking at the relevant code (https://github.com/apache/airflow/blob/2.3.3/airflow/executors/base_executor.py#L215) it seems that the Task Key was never removed from self.running after it initially rescheduled itself.

What you think should happen instead

Rescheduled tasks should reschedule

How to reproduce

Airflow 2.3.3 from Docker
Celery 5.2.7 with Redis backend
MySQL 8
Airflow Timezone set to America/New_York
Have a normal (non-async) sensor that has mode reschedule and needs to reschedule itself

Operating System

Fedora 29

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

No response

Anything else

The symptoms of this discussion sounds the same, but no one has replied on it yet: #25651

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2022-08-15T19:39:25Z

Thanks for opening your first issue here! Be sure to follow the issue template!

notatallshaw-gts · 2022-08-15T21:23:16Z

It appears this was never an issue before 2.3.0 because the CeleryExecutor implemented it's own trigger_tasks logic, until this PR landed: #23016

notatallshaw-gts · 2022-08-16T18:25:08Z

Enabling Debug logs I see something very interesting, on Airflow 2.1.3 I see this debug message "Changing state:" quite often: https://github.com/apache/airflow/blob/2.1.3/airflow/executors/base_executor.py#L198. But I never see the equivalent message in Airflow 2.3.3: debug logs even though it's still there https://github.com/apache/airflow/blob/2.3.3/airflow/executors/base_executor.py#L238

Equivalently I see the "running task instances" debug message often go down to 0 in Airflow 2.1.3 but in Airflow 2.3.3 I never see this debug message go down to 0.

@potiuk @malthe sorry to ping you directly but I'm really starting to think this is a bug in the change to celery executor rather than just our environment being broken. Are there any hints you can give that would help us better pin down what the problem might be?

In the mean time I am going to try and see if I can reproduce the issue at home so I can post a reproducing example here that others can follow.

malthe · 2022-08-17T04:48:34Z

It would be useful to have logs to see what exactly is going on.

There's more background on the original change in this issue: #21316.

notatallshaw-gts · 2022-08-17T13:36:34Z

Thanks I already read through this, I'll see what I can do about the logs (they're big so I will need to cut down to the relevant part and also I'd need to get management sign off, if I'm able to reproduce outside my company then it will make the process a lot simpler).

notatallshaw-gts · 2022-08-18T14:57:46Z

Looks like it was our fault!

It seems the issue was that our scheduler celery results backend was pointing to a different database than our worker celery results backend 🤦‍♂️.

Thanks for responding earlier, sorry it was on our side.

notatallshaw-gts added area:core kind:bug This is a clearly a bug labels Aug 15, 2022

notatallshaw-gts closed this as completed Aug 18, 2022

vilozio mentioned this issue Sep 29, 2022

Tasks stuck in queued state forever after upgrade from 2.2.0 to 2.3.4 #26773

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks marked as "UP_FOR_RESCHEDULE" get stuck in Executor.running and never reschedule #25728

Tasks marked as "UP_FOR_RESCHEDULE" get stuck in Executor.running and never reschedule #25728

notatallshaw-gts commented Aug 15, 2022 •

edited

boring-cyborg bot commented Aug 15, 2022

notatallshaw-gts commented Aug 15, 2022

notatallshaw-gts commented Aug 16, 2022

malthe commented Aug 17, 2022 •

edited

notatallshaw-gts commented Aug 17, 2022 •

edited

notatallshaw-gts commented Aug 18, 2022

Tasks marked as "UP_FOR_RESCHEDULE" get stuck in Executor.running and never reschedule #25728

Tasks marked as "UP_FOR_RESCHEDULE" get stuck in Executor.running and never reschedule #25728

Comments

notatallshaw-gts commented Aug 15, 2022 • edited

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Aug 15, 2022

notatallshaw-gts commented Aug 15, 2022

notatallshaw-gts commented Aug 16, 2022

malthe commented Aug 17, 2022 • edited

notatallshaw-gts commented Aug 17, 2022 • edited

notatallshaw-gts commented Aug 18, 2022

notatallshaw-gts commented Aug 15, 2022 •

edited

malthe commented Aug 17, 2022 •

edited

notatallshaw-gts commented Aug 17, 2022 •

edited