Running tasks marked as skipped on DagRun timeout #30264

erdos2n · 2023-03-23T21:27:17Z

Apache Airflow version

2.5.2

What happened

Users are experiencing the following:

A DAG begins to run
Task(s) go into running state, as expected
The DagRun times out, marking any currently running task as SKIPPED
Because tasks are not mark as failed the on_failure_callback never gets revoked

Here are some example logs:

[2023-03-22, 16:30:02 PDT] {local_task_job.py:266} WARNING - DagRun timed out after 4:00:02.394287.
[2023-03-22, 16:30:07 PDT] {local_task_job.py:266} WARNING - DagRun timed out after 4:00:07.447373.
[2023-03-22, 16:30:07 PDT] {local_task_job.py:272} WARNING - State of this instance has been externally set to skipped. Terminating instance.
[2023-03-22, 16:30:07 PDT] {process_utils.py:129} INFO - Sending Signals.SIGTERM to group 8515. PIDs of all processes in the group: [8515]

What you think should happen instead

Once a DagRun times out, tasks that are currently in RUNNING should be marked as FAILED and downstream tasks should be marked as UPSTREAM_FAILED

How to reproduce

The following DAG will cause this intermittently

import time
import logging

from airflow.decorators import dag, task
from airflow.utils.dates import datetime, timedelta



@task
def task_1():
    import random
    pulses = random.randint(5, 10)
    for i in range(pulses):
        logging.info(f"pulsing: pulse...{i}")
        time.sleep(4)


@task
def task_2():
    import random
    pulses = random.randint(10, 20)
    for i in range(pulses):
        logging.info(f"pulsing: pulse...{i}")
        time.sleep(5)

@task
def downstream_finished_task():
    logging.info("task finished")
    time.sleep(20)

@dag(dag_id="dagrun_interval_test",
     schedule_interval="*/5 * * * *",
     start_date=datetime(2023, 3, 23),
     dagrun_timeout=timedelta(seconds=30),
     catchup=False)
def my_dag():
    return [task_1(), task_2()] >> downstream_finished_task()


dag = my_dag()

Running tasks marked skipped
Downstream left with no status

See screenshot

Operating System

MacOS

Versions of Apache Airflow Providers

N/A

Deployment

Astronomer

Deployment details

Airflow Version 2.5.2

Anything else

Every time a DagRun times out

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

eladkal · 2023-03-23T23:11:01Z

The dag timeouted so the dag status is marked as failed.
Tasks did not finish so why should they be set to fail?

airflow/airflow/jobs/scheduler_job.py

Lines 1302 to 1315 in 3239720

    
           if ( 
        
               dag_run.start_date 
        
               and dag.dagrun_timeout 
        
               and dag_run.start_date < timezone.utcnow() - dag.dagrun_timeout 
        
           ): 
        
               dag_run.set_state(DagRunState.FAILED) 
        
               unfinished_task_instances = ( 
        
                   session.query(TI) 
        
                   .filter(TI.dag_id == dag_run.dag_id) 
        
                   .filter(TI.run_id == dag_run.run_id) 
        
                   .filter(TI.state.in_(State.unfinished)) 
        
               ) 
        
               for task_instance in unfinished_task_instances: 
        
                   task_instance.state = TaskInstanceState.SKIPPED

erdos2n · 2023-03-24T12:38:06Z

@eladkal Maybe they should not be set to fail, but they should also not be set to skipped. The task was not skipped, technically speaking.

An issue that has come up is that a user wants an alert for a specific task failure, so they don't want to set the on_failure_callback on the dag level. That specific task gets marked skipped on a dagrun_timeout and the on_failure_callback isn't triggered.

I believe it's worth discussing either marking these tasks that are stopped mid-run as FAILED or introducing a new state into the task instance.

I'm curious if SHUTDOWN makes more sense in this instance. It seems to fit what is occurring more than skipped.

SHUTDOWN # External request to shut down (e.g. marked failed when running)

https://github.com/apache/airflow/blob/main/airflow/utils/state.py#L42

Thoughts?

wolfier · 2023-03-24T16:07:51Z

Instead of setting the state to SKIPPED, I propose calling handle_failure such that the callbacks are executed.

erdos2n · 2023-03-24T16:24:34Z

Hey Alan, did more digging. handle_failure is only called on failures, which means that my initial proposal of SHUTDOWN would not work, anyways. But if the remaining tasks are marked as upstream_failed then they would trigger the handle_failure callback.

So in short, adding the handle_failure callback would work, or marking downstream tasks as upstream_failed would work. Thoughts?

eladkal · 2023-03-24T16:50:22Z

Just to clarify if the goal is to change the current behavior from skipped to fail this is a breaking change and can not happen before Airflow 3.

Before discussing how to get it done I suggest first to discuss if this should be done. I'm not convinced setting tasks to failure when dag timeout is the desired behavior.

erdos2n · 2023-03-24T20:02:31Z

The goal is to change the behavior, but not necessarily from skipped to fail, just something to trigger the handle_failure method so callbacks can exhibit (more) expected behavior from users.

hussein-awala · 2023-03-25T14:43:01Z

The goal is to change the behavior, but not necessarily from skipped to fail, just something to trigger the handle_failure method so callbacks can exhibit (more) expected behavior from users.

Since the task didn't fail, I don't see the need to run the failure callback in every stopped task, the dag failure callback is enough to handle this case, where we can check if the run failed due to timeout, and select skipped tasks in metadata to do what we need to do. WDYT?

erdos2n · 2023-03-25T14:52:16Z

Well this user wants a callback if this specific task fails, so not on the dag level. Could be that we need a on skipped callback. Thoughts?

eladkal · 2023-04-12T18:44:30Z

Well this user wants a callback if this specific task fails, so not on the dag level. Could be that we need a on skipped callback. Thoughts?

I'm OK with adding on_skipped_callback (regardless of what we discuss here, this is probably something we should add)

eladkal · 2023-04-27T11:22:16Z

Should we scope this issue to adding on_skipped_callback ?
@erdos2n is that a suitable solution for your use case?

pankajkoti · 2023-07-21T20:03:54Z

@erdos2n would you have an update on the last question from Elad?

seanmuth · 2023-07-21T20:09:07Z

Experiencing the same issue, and it is my opinion that because the Airflow Scheduler is SIGTERM'ing running tasks, that is a legitimate reason to mark them as failed. The task was running and now it is not and it did complete successfully, that is a task failure, not a skipped task.

erdos2n · 2023-07-21T20:52:21Z

Hello,
I'm of the opinion that an on skipped callback would be a good addition.

wolfier · 2023-07-21T22:35:34Z

I believe the question is what does it mean when a dagrun times out.

If dagrun timeout means "I need everything to stop including the task instances" then forcing task termination is appropriate. I don't agree with setting the ending state as skipped if a task was in the running state since the task in the middle of execution.

Looking at @RNHTTR's PR, I see the logic is to mark all tasks that are unfinished to skipped.

            TaskInstanceState.SCHEDULED,
            TaskInstanceState.QUEUED,
            TaskInstanceState.RUNNING,
            TaskInstanceState.SHUTDOWN,
            TaskInstanceState.RESTARTING,
            TaskInstanceState.UP_FOR_RETRY,
            TaskInstanceState.UP_FOR_RESCHEDULE,
            TaskInstanceState.DEFERRED,

Instead, I think it should be more refined.

The scheduled and queued state should be set to skipped IF that was their first attempt (checking try number). Though this may not work for sensors that are rescheduled and are in the middle of being scheduled / queued.

The rest of the states should be set to failed because they imply the task instance was attempted. Tasks that are attempted should be failed.

It is worth noting that the PR was written and released for an Airflow version (see 2.0.0) where the active daguns are determined by task instances instead of the dagrun state, as pointed out by issues/13407, for Airflow 2.0.0. In Airflow 2.6.x, active dagruns are determined by the state of the dagrun and not the task instances states. This means that it does not matter which state the running task ends up as, skipped, failed, or even running.

Referring back to the question I posed earlier, depending on what it means when the dagrun times out, the state of a running task should reflect that definition.

pankajkoti · 2023-07-26T06:15:45Z

I agree with @wolfier . If a task was running, I feel, then it could proceed to Failed / Shutdown instead of Skipped.
Wouldn't Skipped mean that it was never attempted or went to Running state at all?

Looking at our definitions for the states:
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html#task-instances

Failed / Shutdown sounds more reasonable than the Skipped state.

github-actions · 2023-08-26T00:09:58Z

This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions · 2023-09-02T06:43:19Z

This issue has been closed because it has not received response from the issue author.

benbuckman · 2023-09-07T20:48:46Z

Can this be re-opened?
We also encountered this, and were very surprised that the on_failure_callback was not fired, because it only runs on task failure, but the task that was running when the timeout was hit was skipped not failed.

First, that behavior seems wrong: if a task is taking too long and hits the dagrun_timeout, I would expect that task (as well as the DAG) to fail.

Second, @hussein-awala wrote,

I don't see the need to run the failure callback in every stopped task, the dag failure callback is enough to handle this case

But what is the "dag failure callback"? I don't see a callback like that in these docs:
https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/callbacks.html

(Do you mean the sla_miss_callback? i.e. set the DAG's SLA to the same as the DAG's dagrun_timeout?)

A DAG-level failure callback would be very nice to have.

Thank you.

github-actions · 2023-09-24T00:11:50Z

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions · 2023-10-10T00:11:10Z

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions · 2023-10-17T00:11:31Z

This issue has been closed because it has not received response from the issue author.

raphaelauv · 2024-02-05T09:57:13Z

I agree with @pankajkoti

Failed / Shutdown sounds more reasonable than the Skipped state.

I think we should re-open this issue

erdos2n added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Mar 23, 2023

pankajkoti added the pending-response label Jul 21, 2023

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 26, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 2, 2023

pankajkoti reopened this Sep 8, 2023

github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 10, 2023

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 24, 2023

hussein-awala removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 24, 2023

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Oct 10, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 17, 2023

RNHTTR mentioned this issue Nov 28, 2023

Add an on_skipped_callback #35936

Closed

2 tasks

hussein-awala mentioned this issue Dec 25, 2023

Add on_skipped_callback in to BaseOperator #36374

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running tasks marked as skipped on DagRun timeout #30264

Running tasks marked as skipped on DagRun timeout #30264

erdos2n commented Mar 23, 2023 •

edited

eladkal commented Mar 23, 2023 •

edited

erdos2n commented Mar 24, 2023

wolfier commented Mar 24, 2023

erdos2n commented Mar 24, 2023

eladkal commented Mar 24, 2023

erdos2n commented Mar 24, 2023 •

edited

hussein-awala commented Mar 25, 2023

erdos2n commented Mar 25, 2023

eladkal commented Apr 12, 2023 •

edited

eladkal commented Apr 27, 2023

pankajkoti commented Jul 21, 2023

seanmuth commented Jul 21, 2023

erdos2n commented Jul 21, 2023

wolfier commented Jul 21, 2023 •

edited

pankajkoti commented Jul 26, 2023 •

edited

github-actions bot commented Aug 26, 2023

github-actions bot commented Sep 2, 2023

benbuckman commented Sep 7, 2023

github-actions bot commented Sep 24, 2023

github-actions bot commented Oct 10, 2023

github-actions bot commented Oct 17, 2023

raphaelauv commented Feb 5, 2024

Running tasks marked as skipped on DagRun timeout #30264

Running tasks marked as skipped on DagRun timeout #30264

Comments

erdos2n commented Mar 23, 2023 • edited

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

eladkal commented Mar 23, 2023 • edited

erdos2n commented Mar 24, 2023

wolfier commented Mar 24, 2023

erdos2n commented Mar 24, 2023

eladkal commented Mar 24, 2023

erdos2n commented Mar 24, 2023 • edited

hussein-awala commented Mar 25, 2023

erdos2n commented Mar 25, 2023

eladkal commented Apr 12, 2023 • edited

eladkal commented Apr 27, 2023

pankajkoti commented Jul 21, 2023

seanmuth commented Jul 21, 2023

erdos2n commented Jul 21, 2023

wolfier commented Jul 21, 2023 • edited

pankajkoti commented Jul 26, 2023 • edited

github-actions bot commented Aug 26, 2023

github-actions bot commented Sep 2, 2023

benbuckman commented Sep 7, 2023

github-actions bot commented Sep 24, 2023

github-actions bot commented Oct 10, 2023

github-actions bot commented Oct 17, 2023

raphaelauv commented Feb 5, 2024

erdos2n commented Mar 23, 2023 •

edited

eladkal commented Mar 23, 2023 •

edited

erdos2n commented Mar 24, 2023 •

edited

eladkal commented Apr 12, 2023 •

edited

wolfier commented Jul 21, 2023 •

edited

pankajkoti commented Jul 26, 2023 •

edited