Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Orphaned tasks stuck in executor as running #16550

Merged
merged 1 commit into from Jun 22, 2021

Conversation

@Jorricks
Copy link
Contributor

@Jorricks Jorricks commented Jun 20, 2021

Related: #13542

The issue discussed here was caused by multiple things.
One of the issues is that when the scheduler picks up 'assumed to be' orphaned tasks, these tasks might have never made it to celery.
When the tasks execution never happens, it is automatically cleaned up but only partially.
Then once the scheduler retries to queue the task again, it won't be able to, because there is still a reference of the task in the set of the running variable.
This PR should fix the described issue.

@boring-cyborg
Copy link

@boring-cyborg boring-cyborg bot commented Jun 20, 2021

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, pylint and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@Jorricks Jorricks force-pushed the fix-tasks-stuck-in-executor-running branch from 21a176f to 28236f4 Jun 21, 2021
ashb
ashb approved these changes Jun 21, 2021
@ashb ashb merged commit 90f0088 into apache:main Jun 22, 2021
33 checks passed
@boring-cyborg
Copy link

@boring-cyborg boring-cyborg bot commented Jun 22, 2021

Awesome work, congrats on your first merged pull request!

@jhtimmins jhtimmins added this to the Airflow 2.1.1 milestone Jun 22, 2021
jhtimmins added a commit to astronomer/airflow that referenced this issue Jun 22, 2021
ashb added a commit that referenced this issue Jun 22, 2021
kaxil added a commit to astronomer/airflow that referenced this issue Jun 22, 2021
kaxil added a commit to astronomer/airflow that referenced this issue Jun 23, 2021
(cherry picked from commit 90f0088)
(cherry picked from commit 83fb4bf)
kaxil added a commit to astronomer/airflow that referenced this issue Jun 23, 2021
(cherry picked from commit 90f0088)
(cherry picked from commit 83fb4bf)
(cherry picked from commit d44c223)
kaxil pushed a commit that referenced this issue Jul 2, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.
ferruzzi added a commit to ferruzzi/airflow that referenced this issue Jul 19, 2021
…6718)

Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.
jhtimmins added a commit that referenced this issue Aug 9, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a239)
jhtimmins added a commit that referenced this issue Aug 13, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a239)
kaxil added a commit that referenced this issue Aug 17, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a239)
jhtimmins added a commit that referenced this issue Aug 17, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a239)
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this issue Nov 27, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](apache/airflow#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a23928efb4ff1d87d115ae2664edec3a9408c)

GitOrigin-RevId: 4d436182194b6d79d5a0c040d4dd07310ba74faf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants