-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running tasks marked as skipped on DagRun timeout #30264
Comments
The dag timeouted so the dag status is marked as failed. airflow/airflow/jobs/scheduler_job.py Lines 1302 to 1315 in 3239720
|
@eladkal Maybe they should not be set to fail, but they should also not be set to skipped. The task was not skipped, technically speaking. An issue that has come up is that a user wants an alert for a specific task failure, so they don't want to set the I believe it's worth discussing either marking these tasks that are stopped mid-run as FAILED or introducing a new state into the task instance. I'm curious if
https://github.com/apache/airflow/blob/main/airflow/utils/state.py#L42 Thoughts? |
Instead of setting the state to SKIPPED, I propose calling handle_failure such that the callbacks are executed. |
Hey Alan, did more digging. So in short, adding the |
Just to clarify if the goal is to change the current behavior from skipped to fail this is a breaking change and can not happen before Airflow 3. Before discussing how to get it done I suggest first to discuss if this should be done. I'm not convinced setting tasks to failure when dag timeout is the desired behavior. |
The goal is to change the behavior, but not necessarily from skipped to fail, just something to trigger the |
Since the task didn't fail, I don't see the need to run the failure callback in every stopped task, the dag failure callback is enough to handle this case, where we can check if the run failed due to timeout, and select skipped tasks in metadata to do what we need to do. WDYT? |
Well this user wants a callback if this specific task fails, so not on the dag level. Could be that we need a on skipped callback. Thoughts? |
I'm OK with adding |
Should we scope this issue to adding |
@erdos2n would you have an update on the last question from Elad? |
Experiencing the same issue, and it is my opinion that because the Airflow Scheduler is SIGTERM'ing running tasks, that is a legitimate reason to mark them as failed. The task was running and now it is not and it did complete successfully, that is a task failure, not a skipped task. |
Hello, |
I believe the question is what does it mean when a dagrun times out. If dagrun timeout means "I need everything to stop including the task instances" then forcing task termination is appropriate. I don't agree with setting the ending state as skipped if a task was in the running state since the task in the middle of execution. Looking at @RNHTTR's PR, I see the logic is to mark all tasks that are unfinished to skipped.
Instead, I think it should be more refined. The scheduled and queued state should be set to skipped IF that was their first attempt (checking try number). Though this may not work for sensors that are rescheduled and are in the middle of being scheduled / queued. The rest of the states should be set to failed because they imply the task instance was attempted. Tasks that are attempted should be failed. It is worth noting that the PR was written and released for an Airflow version (see 2.0.0) where the active daguns are determined by task instances instead of the dagrun state, as pointed out by issues/13407, for Airflow 2.0.0. In Airflow 2.6.x, active dagruns are determined by the state of the dagrun and not the task instances states. This means that it does not matter which state the running task ends up as, skipped, failed, or even running. Referring back to the question I posed earlier, depending on what it means when the dagrun times out, the state of a running task should reflect that definition. |
I agree with @wolfier . If a task was running, I feel, then it could proceed to Looking at our definitions for the states:
|
This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
Can this be re-opened? First, that behavior seems wrong: if a task is taking too long and hits the Second, @hussein-awala wrote,
But what is the "dag failure callback"? I don't see a callback like that in these docs: (Do you mean the A DAG-level failure callback would be very nice to have. Thank you. |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
I agree with @pankajkoti
I think we should re-open this issue |
Apache Airflow version
2.5.2
What happened
Users are experiencing the following:
on_failure_callback
never gets revokedHere are some example logs:
What you think should happen instead
Once a DagRun times out, tasks that are currently in RUNNING should be marked as FAILED and downstream tasks should be marked as UPSTREAM_FAILED
How to reproduce
The following DAG will cause this intermittently
no status
See screenshot
Operating System
MacOS
Versions of Apache Airflow Providers
N/A
Deployment
Astronomer
Deployment details
Airflow Version 2.5.2
Anything else
Every time a DagRun times out
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: