Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executor reports task instance <TaskInstance: XXX [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally #41276

Closed
1 of 2 tasks
trlopes1974 opened this issue Aug 6, 2024 · 9 comments
Labels
area:core Can't Reproduce The problem cannot be reproduced kind:bug This is a clearly a bug

Comments

@trlopes1974
Copy link

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.9.2

What happened?

Case 2 of issue #39717

Task fails with the message:
Executor reports task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled__2024-08-05T19:47:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally.

Despite that the Airflow task has an external_executor_id, it is not registered in flower ( celery)

Dagrun Running | Task instance's dagrun was not in the 'running' state but in the state 'failed'. -- | -- Task Instance State | Task is in the 'failed' state.
external_executor_id a01b358f-ad34-4b16-81b9-fd69218f613e does not exist in flower / celery

look at the timestamps in the logs:
image

image

there is a gap of 10minutes betwwen the Start ( dummy task) and the dispatch_restores task.
And this behaviour is recurrent, (the 10m gap)

and in the task log:
'attempt=1.log.SchedulerJob.log'
(tkapp) ttauto@slautop02$ cat attempt=1.log.SchedulerJob.log
[2024-08-05T21:02:15.585+0100] {event_scheduler.py:40} WARNING - Marking task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled__2024-08-05T19:47:00+00:00 [queued]> stuck in queued as failed. If the task instance has available retries, it will be retried.
[2024-08-05T21:02:16.750+0100] {scheduler_job_runner.py:843} ERROR - Executor reports task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled__2024-08-05T19:47:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?

What you think should happen instead?

The external task should exist in celery and the reason for the task to be marked as failed should be explicit.
There is no other info that justifies the task to fail.

How to reproduce

I just happens

Operating System

NAME="Red Hat Enterprise Linux" VERSION="8.9 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.9" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.9 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.9"

Versions of Apache Airflow Providers

apache-airflow==2.9.2
apache-airflow-providers-celery==3.7.2
apache-airflow-providers-common-io==1.3.2
apache-airflow-providers-common-sql==1.14.0
apache-airflow-providers-fab==1.1.1
apache-airflow-providers-ftp==3.9.1
apache-airflow-providers-hashicorp==3.7.1
apache-airflow-providers-http==4.11.1
apache-airflow-providers-imap==3.6.1
apache-airflow-providers-postgres==5.11.1
apache-airflow-providers-sftp==4.10.1
apache-airflow-providers-smtp==1.7.1
apache-airflow-providers-sqlite==3.8.1
apache-airflow-providers-ssh==3.11.1

Deployment

Other

Deployment details

Just simple airflow install on linux box.
Using PostgresQLm Rabbitmq and celery executor

Anything else?

randomly

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@scaoupgrade
Copy link

That 10 minutes gap could be because of this config:

task_queued_timeout:
description: |
Amount of time a task can be in the queued state before being retried or set to failed.
version_added: 2.6.0
type: float
example: ~
default: "600.0"

@trlopes1974
Copy link
Author

trlopes1974 commented Aug 6, 2024

Maybe, it makes some sense as we do not have that setting in the configuration.
But, what is bothering me is WHY? why did it timeout after being queued? We have no exhaustion anywhere, not in CPU, Memory, pool slots, concurrency. I'd say that at the current time we have a very light system...

So, I believe that the real question is, why is the task queued but never started?

+-1year with some cleanups
image

All timeouts on our config:
dagbag_import_timeout = 120.0
dag_file_processor_timeout = 180
default_task_execution_timeout =
web_server_master_timeout = 120
web_server_worker_timeout = 120
smtp_timeout = 30
operation_timeout = 2
stalled_task_timeout = 0
default_timeout = 604800

@scaoupgrade
Copy link

Maybe, it makes some sense as we do not have that setting in the configuration. But, what is bothering me is WHY? why did it timeout after being queued? We have no exhaustion anywhere, not in CPU, Memory, pool slots, concurrency. I'd say that at the current time we have a very light system...

So, I believe that the real question is, why is the task queued but never started?

+-1year with some cleanups image

All timeouts on our config: dagbag_import_timeout = 120.0 dag_file_processor_timeout = 180 default_task_execution_timeout = web_server_master_timeout = 120 web_server_worker_timeout = 120 smtp_timeout = 30 operation_timeout = 2 stalled_task_timeout = 0 default_timeout = 604800

It's the same case for me when the incident happened the other day. Workers are all online, but no task gets executed. I notice something abnormal in scheduler log when this happens: all tasks in a dag was repeatedly queued for thousands of times in one second. Looks like scheduler gets into a strange state

@trlopes1974
Copy link
Author

trlopes1974 commented Aug 6, 2024 via email

@shahar1 shahar1 added Can't Reproduce The problem cannot be reproduced and removed needs-triage label for new issues that we didn't triage yet labels Aug 24, 2024
@vaibhavnsingh
Copy link

  • 1

@EvertonSA
Copy link
Contributor

EvertonSA commented Sep 30, 2024

Workers are all online, but no task gets executed.

I had the same, we noticed it was very aligned with Redis instabilities. after we migrated to saas redis, we did not see it anymore. Also allowing keda to auto scaling to 0 naturally restarted workers.

mini scheduler was also disabled, so I'm unsure what fixed.

@xudalei1977
Copy link

I use Amazon Managed Workflow Apache Airflow 2.9.2, met the issue when execute a simple dag, the Task log is:
[2024-10-07T03:09:30.278+0000] {scheduler_job_runner.py:769} ERROR - Executor reports task instance <TaskInstance: example_dag.example_task_1 manual__2024-10-07T03:09:07.236750+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?

the schedule log as:
1728270570271,"[�[34m2024-10-07T03:09:30.239+0000�[0m] {�[34mcelery_executor.py:�[0m290} ERROR�[0m - Error sending Celery task: Timeout, PID: 21
"
1728270570271,"Celery Task ID: TaskInstanceKey(dag_id='example_dag', task_id='example_task_1', run_id='manual__2024-10-07T03:09:07.236750+00:00', try_number=1, map_index=-1)
"
1728270570271,"Traceback (most recent call last):
"
1728270570271," File ""/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/providers/celery/executors/celery_executor_utils.py"", line 220, in send_task_to_executor
"
1728270570271," result = task_to_run.apply_async(args=[command], queue=queue)
"
1728270570271," ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
"
1728270570271," File ""/usr/local/airflow/.local/lib/python3.11/site-packages/celery/app/task.py"", line 594, in apply_async
"
1728270570271," return app.send_task(
"
1728270570271," ^^^^^^^^^^^^^^
"
1728270570271," File ""/usr/local/airflow/.local/lib/python3.11/site-packages/celery/app/base.py"", line 801, in send_task
"
1728270570271," amqp.send_task_message(P, name, message, **options)
"
1728270570272," File ""/usr/local/airflow/.local/lib/python3.11/site-packages/celery/app/amqp.py"", line 518, in send_task_message
"
1728270570272," ret = producer.publish(
"
1728270570272," ^^^^^^^^^^^^^^^^^
"
1728270570272," File ""/usr/local/airflow/.local/lib/python3.11/site-packages/kombu/messaging.py"", line 186, in publish

@vaibhavnsingh
Copy link

Please check this property AIRFLOW__CELERY__OPERATION_TIMEOUT. In our environment we were getting this airflowtasktimeout error. We had to increase the time from 1.0 sec .

@xudalei1977
Copy link

Please check this property AIRFLOW__CELERY__OPERATION_TIMEOUT. In our environment we were getting this airflowtasktimeout error. We had to increase the time from 1.0 sec .

thanks very much, it works.

@apache apache locked and limited conversation to collaborators Oct 12, 2024
@potiuk potiuk converted this issue into discussion #42966 Oct 12, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
area:core Can't Reproduce The problem cannot be reproduced kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

6 participants