Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks marked as "UP_FOR_RESCHEDULE" get stuck in Executor.running and never reschedule #25728

Closed
1 of 2 tasks
notatallshaw-gts opened this issue Aug 15, 2022 · 6 comments
Closed
1 of 2 tasks
Labels
area:core kind:bug This is a clearly a bug

Comments

@notatallshaw-gts
Copy link
Contributor

notatallshaw-gts commented Aug 15, 2022

Apache Airflow version

2.3.3

What happened

Upon upgrading from Airflow 2.1.3 to Airflow 2.3.3 we have an issue with our sensors that have mode='reschedule'. Using TimeSensor as example:

  1. It executes as normal on the first run
  2. It detects it is not the correct time yet and marks itself "UP_FOR_RESCEDULE" (usually to rescheduled for 5 minutes in the future)
  3. When the time comes to be rescheduled it just gets marked as "QUEUED" and is never actually run again, the error in the log:
    [2022-08-15 00:01:11,027] {base_executor.py:215} ERROR - could not queue task TaskInstanceKey(dag_id='TestDAG', task_id='testTASK', run_id='scheduled__2022-08-12T04:00:00+00:00', try_number=1, map_index=-1) (still running after 4 attempts)

Looking at the relevant code (https://github.com/apache/airflow/blob/2.3.3/airflow/executors/base_executor.py#L215) it seems that the Task Key was never removed from self.running after it initially rescheduled itself.

What you think should happen instead

Rescheduled tasks should reschedule

How to reproduce

  1. Airflow 2.3.3 from Docker
  2. Celery 5.2.7 with Redis backend
  3. MySQL 8
  4. Airflow Timezone set to America/New_York
  5. Have a normal (non-async) sensor that has mode reschedule and needs to reschedule itself

Operating System

Fedora 29

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

No response

Anything else

The symptoms of this discussion sounds the same, but no one has replied on it yet: #25651

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@notatallshaw-gts notatallshaw-gts added area:core kind:bug This is a clearly a bug labels Aug 15, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Aug 15, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@notatallshaw-gts
Copy link
Contributor Author

It appears this was never an issue before 2.3.0 because the CeleryExecutor implemented it's own trigger_tasks logic, until this PR landed: #23016

@notatallshaw-gts
Copy link
Contributor Author

Enabling Debug logs I see something very interesting, on Airflow 2.1.3 I see this debug message "Changing state:" quite often: https://github.com/apache/airflow/blob/2.1.3/airflow/executors/base_executor.py#L198. But I never see the equivalent message in Airflow 2.3.3: debug logs even though it's still there https://github.com/apache/airflow/blob/2.3.3/airflow/executors/base_executor.py#L238

Equivalently I see the "running task instances" debug message often go down to 0 in Airflow 2.1.3 but in Airflow 2.3.3 I never see this debug message go down to 0.

@potiuk @malthe sorry to ping you directly but I'm really starting to think this is a bug in the change to celery executor rather than just our environment being broken. Are there any hints you can give that would help us better pin down what the problem might be?

In the mean time I am going to try and see if I can reproduce the issue at home so I can post a reproducing example here that others can follow.

@malthe
Copy link
Contributor

malthe commented Aug 17, 2022

It would be useful to have logs to see what exactly is going on.

There's more background on the original change in this issue: #21316.

@notatallshaw-gts
Copy link
Contributor Author

notatallshaw-gts commented Aug 17, 2022

Thanks I already read through this, I'll see what I can do about the logs (they're big so I will need to cut down to the relevant part and also I'd need to get management sign off, if I'm able to reproduce outside my company then it will make the process a lot simpler).

@notatallshaw-gts
Copy link
Contributor Author

Looks like it was our fault!

It seems the issue was that our scheduler celery results backend was pointing to a different database than our worker celery results backend 🤦‍♂️.

Thanks for responding earlier, sorry it was on our side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

2 participants