Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_active_runs = 1 can still create multiple active execution runs #9975

Closed
match-gabeflores opened this issue Jul 24, 2020 · 71 comments · Fixed by #16401
Closed

max_active_runs = 1 can still create multiple active execution runs #9975

match-gabeflores opened this issue Jul 24, 2020 · 71 comments · Fixed by #16401
Assignees
Labels
kind:bug This is a clearly a bug
Milestone

Comments

@match-gabeflores
Copy link
Contributor

match-gabeflores commented Jul 24, 2020

Edit: There is a separate issue affecting max_active_runs in 1.10.14. That regression is fixed in 1.10.15.

Edit2: Version v2.1.3 contains some fixes but also contains bad regressions involving max_active_runs. Use v2.14 for the complete fixes to this issue

Edit3: Version 2.2.0 contains a fix for max_active_runs using dags trigger command or TriggerDagRunOperator. #18583

--

Apache Airflow version: 1.10.11, localExecutor

What happened:

I have max_active_runs = 1 in my dag file (which consists of multiple tasks) and I manually triggered a dag. While it was running, a second execution began under its scheduled time while the first execution was running.

I should note that the second execution is initially queued. It's only when the dag's 1st execution moves to the next task that the second execution actually starts.

My dag definition. The dag just contains tasks using pythonOperator.

dag = DAG(
    'dag1',
    default_args=default_args,
    description='xyz',
    schedule_interval=timedelta(hours=1),
    catchup=False,
    max_active_runs=1
)

What you expected to happen:

Only one execution should run. A second execution should be queued but not begin executing.

How to reproduce it:
In my scenario:

  1. Manually trigger dag with multiple tasks.. have task1 take longer than the beginning of the next scheduled execution. (Dag Execution1). As an example, if the scheduled interval is 1 hour, have task1 take longer than 1 hour so as to queue up the second execution (Execution2).
  2. When task1 of Execution1 finishes and just before starting task2, the second execution (Execution2, which is already queued) begins running.

image

Anything else we need to know:
I think the second execution begins in between the task1 and task2 of execution1. I think there's a few second delay there and maybe that's when Airflow thinks there's no dag execution? That's just a guess.

Btw, this can have potentially disastrous effects (errors, incomplete data without errors, etc)

@match-gabeflores match-gabeflores added the kind:bug This is a clearly a bug label Jul 24, 2020
@boring-cyborg
Copy link

boring-cyborg bot commented Jul 24, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

@mik-laj
Copy link
Member

mik-laj commented Aug 20, 2020

The problem is, we don't have a state that describes DAG Run that are saved but not running. All DAG Run have running state initially. If we want to fix this bug we have to add a new dag state.

@dinesh-procore
Copy link

I am running into the exact same issue.

@alechheb
Copy link

The same issue here

@ashb
Copy link
Member

ashb commented Oct 16, 2020

Would someone be able to test if this specific case still happens on Airflow 2.0.0alpha1? (A few things about how we created DagRuns changed so this might have been fixed, but I didn't specifically set out to fix this.

@ashb
Copy link
Member

ashb commented Oct 16, 2020

Read the reproduction steps, and this bit sounds bang on:

I think the second execution begins in between the task1 and task2 of execution1. I think there's a few second delay there and maybe that's when Airflow thinks there's no dag execution? That's just a guess.

Yes, looking at the code that sounds right, and also hasn't changed in 2.0.0alpha1, the same logic is used.

@natejenkins21
Copy link

Same issue here. Causing a lot of issues for my backfills...

@kaxil
Copy link
Member

kaxil commented Dec 14, 2020

@natejenkins21 Can you provide reproduction steps please?

@nathadfield
Copy link
Collaborator

For what it's worth, this doesn't seem to be an issue on master but it certainly is on 2.0 installed from pip.

I created this DAG, switched it on and allowed it to catchup and then cleared the status of all tasks.

from airflow import models
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'start_date': datetime(2020, 12, 10),
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
}

dag_name = 'test_dag'

with models.DAG(dag_name,
                default_args=default_args,
                schedule_interval='0 0 * * *',
                catchup=True,
                max_active_runs=1
                ) as dag:

    start = BashOperator(
        task_id=f'start',
        bash_command='echo "Starting"; echo "Sleeping for 30 seconds"; sleep 30; echo "Finished"'
    )

As you can see, all the tasks are now running at the same time.

Screenshot 2020-12-22 at 11 16 44

@nathadfield
Copy link
Collaborator

Here's the same DAG running against master via breeze.

Screenshot 2020-12-22 at 11 35 21

@kaxil
Copy link
Member

kaxil commented Dec 22, 2020

@nathadfield Does it occur without clearing tasks instances?

@nathadfield
Copy link
Collaborator

@nathadfield Does it occur without clearing tasks instances?

Yes. I removed catchup and triggered the DAG several times via the CLI and all of the tasks were running at the same time.

Screenshot 2020-12-23 at 09 03 39

@kaxil
Copy link
Member

kaxil commented Dec 25, 2020

@nathadfield What metadata DB do you use?

Just wondering if it is related to #13278

@nathadfield
Copy link
Collaborator

#13278

@kaxil We're using Postgres so it's probably not related.

@kaxil
Copy link
Member

kaxil commented Jan 4, 2021

Cool, I am looking at this today and tomorrow

@kaxil
Copy link
Member

kaxil commented Jan 4, 2021

This happens in Master too but only happens when you manually trigger DAGs multiple time.

image

image

@kaxil kaxil added this to the Airflow 2.0.1 milestone Jan 4, 2021
@kamac
Copy link

kamac commented Jan 5, 2021

This also happens when triggering DAGs manually with create_dagrun. If the DagRun is created with state RUNNING, max_active_runs is not respected (the run isn't counted if you browse DAG's details page).

I've also tried creating the run with state SCHEDULED, but it's never run by the scheduler. (perhaps I'm doing it the wrong way - I'm still investigating that)

Here's the code for two DAGs to reproduce this: https://gist.github.com/kamac/7112af78f1a9004142903d4fe6e387d4

@ashb
Copy link
Member

ashb commented Jan 5, 2021

I've also tried creating the run with state SCHEDULED, but it's never run by the scheduler. (perhaps I'm doing it the wrong way - I'm still investigating that)

Right now the only states for Dag Runs are None, "running", or "failed" -- that's why the scheduler is never picking up that dag run.

@kamac
Copy link

kamac commented Jan 5, 2021

@ashb I've tried creating DagRuns with state NONE. The tasks get run now (the last two instances), but max_active_runs is not respected.

Screenshot 2021-01-05 at 11 48 42Screenshot 2021-01-05 at 11 48 50

I'm on airflow 1.10.13, SequentialExecutor

@kaxil
Copy link
Member

kaxil commented Jan 5, 2021

@ashb I've tried creating DagRuns with state NONE. The tasks get run now (the last two instances), but max_active_runs is not respected.

Screenshot 2021-01-05 at 11 48 42Screenshot 2021-01-05 at 11 48 50

I'm on airflow 1.10.13, SequentialExecutor

🤔 1.10.13 ?? Just noticed the original issue creator was on 1.10.11 so this is not a regression in 2.0 -- I though I read somewhere that it worked correctly for you @nathadfield on 1.10.x -- can you confirm please.

@nathadfield
Copy link
Collaborator

@kaxil It is definitely an issue on 1.10.14 as that is what we're running on our production system at the moment.

It does also seem to be an issue on 2.0 that I've got running on a locally running dev instance of our setup, but I could not replicate the problem on master via Breeze.

@kaxil
Copy link
Member

kaxil commented Jan 5, 2021

@kaxil It is definitely an issue on 1.10.14 as that is what we're running on our production system at the moment.

It does also seem to be an issue on 2.0 that I've got running on a locally running dev instance of our setup, but I could not replicate the problem on master via Breeze.

I might ping you on Slack for more ad-hoc discussion, I was able to replicate it on Master too: #9975 (comment)

@kaxil kaxil removed this from the Airflow 2.0.1 milestone Jan 5, 2021
@ashb ashb added this to the Airflow 2.1 milestone Jan 5, 2021
@nathadfield
Copy link
Collaborator

@kaxil Is this likely to only be fixed in 2.1 or might we also see it in a new 1.10 version for those people who are affected by this issue but cannot (or don't want to) move to 2.0 yet?

@Adam-Leyshon
Copy link

This issue is affecting 1.10.14.

Upgrading to 2.0 is not possible for us at this time since we have a number of internal blockers on plugins that we wrote that require refactoring.

We cleared a DAG run for 2019-01-01 and selected the all future tasks option,
Now it want to processes over 200 runs at the same time

We did not experience this issue in 1.10.9.

Instead it is now trying to run a single task for each of the active runs in parallel instead of waiting for the first run to complete.

What I expect to happen is for it to clear the tasks and then complete each run sequentially, our system requires data be loaded in that order, having multiple parallel runs is causing huge issues for us.

image

image

@kaxil
Copy link
Member

kaxil commented Jan 21, 2021

looks like max_active_runs was broken further in 1.10.13 -- This will be fixed in 1.10.15 by #13803 . However like I mentioned in #9975 (comment) not all use-cases will be fixed.

Complete fix will require adding a new state to DagRun -- 'queued' -- similar to Taskinstance

kaxil pushed a commit that referenced this issue Aug 13, 2021
This change adds queued state to DagRun. Newly created DagRuns
start in the queued state, are then moved to the running state after
satisfying the DAG's max_active_runs. If the Dag doesn't have
max_active_runs, the DagRuns are moved to running state immediately

Clearing a DagRun sets the state to queued state

Closes: #9975, #16366
(cherry picked from commit 6611ffd)
kaxil pushed a commit that referenced this issue Aug 14, 2021
This change adds queued state to DagRun. Newly created DagRuns
start in the queued state, are then moved to the running state after
satisfying the DAG's max_active_runs. If the Dag doesn't have
max_active_runs, the DagRuns are moved to running state immediately

Clearing a DagRun sets the state to queued state

Closes: #9975, #16366
(cherry picked from commit 6611ffd)
kaxil pushed a commit that referenced this issue Aug 17, 2021
This change adds queued state to DagRun. Newly created DagRuns
start in the queued state, are then moved to the running state after
satisfying the DAG's max_active_runs. If the Dag doesn't have
max_active_runs, the DagRuns are moved to running state immediately

Clearing a DagRun sets the state to queued state

Closes: #9975, #16366
(cherry picked from commit 6611ffd)
jhtimmins pushed a commit that referenced this issue Aug 17, 2021
This change adds queued state to DagRun. Newly created DagRuns
start in the queued state, are then moved to the running state after
satisfying the DAG's max_active_runs. If the Dag doesn't have
max_active_runs, the DagRuns are moved to running state immediately

Clearing a DagRun sets the state to queued state

Closes: #9975, #16366
(cherry picked from commit 6611ffd)
@himabindu07
Copy link

verified by QA
Screen Shot 2021-08-18 at 3 03 51 PM

abhishekbafna pushed a commit to twitter-forks/airflow that referenced this issue Aug 25, 2021
…runs

* EWT-797: Fixing max_active_runs to control the number of runs running simultaneously

* Bumped the version

Co-authored-by: Shruti Mantri <smantri@twitter.com>

I referred to the git issue: apache#9975 which in its first line itself tells that the issue that happens related to max_active_runs in 1.10.14 is a separate one and is discussed here: apache#13802

This issue was resolved but the fix went in the 1.10.15 version. I took the same fix: https://github.com/apache/airflow/pull/13803/files and have applied it in this commit (5e87232).

The resolution worked and I did the testing on gke devel cluster: https://airflow-smantri--devel--etl-workflow-users.service.qus1.twitter.biz/admin/airflow/tree?dag_id=bq_dal_integration_dag

Prior to the fix, if the max_active_runs = 1, and if you start more than 1 run simultaneously, all of them will start their tasks, and would not honour max_active_runs configuration.

With this fix, if the max_active_runs = 1, and one of the runs is going on while the second run is triggered, the second run starts with the start_date being the time when it is triggered, but the second run won't start any of its tasks until the prior run is complete.
@aran3
Copy link
Contributor

aran3 commented Aug 31, 2021

I've been testing version 2.1.3 for a few days now, and while adding Queued state seem to help most cases I think this bug is not fully solved.
When manually triggering a dag that has max_active_runs=1 many times, it does happen that dags reaches more than dag run in "running" state at the same time.
In our case the dag has two tasks, a trigger and a sensor:
image

I will try to gather more specific information and update.

@kamushadenes
Copy link

kamushadenes commented Sep 5, 2021

Observing the same behavior on 2.1.2 with catchup=False, this has been blowing through my quotas.

@ephraimbuddy
Copy link
Contributor

@aran3, I think your case has been fixed in #17786 where tasks can start running while the dagruns are still queued. This would lead to the queued dagrun entering the running state

@vumdao
Copy link

vumdao commented Sep 25, 2021

Issue does not happen on 2.1.4
image

@argemiront
Copy link

Is there a way to prevent the scheduler to queue new runs if there's an active run? I have a DAG that now and then overruns and because of this behaviour I'm seeing DAG runs piling up on the new UI through time due to this new behaviour

@ephraimbuddy
Copy link
Contributor

@argemiront, If you are on 2.1.4 you can change this:
AIRFLOW__SCHEDULER__MAX_QUEUED_RUNS_PER_DAG=16
to a lower number

@argemiront
Copy link

AIRFLOW__SCHEDULER__MAX_QUEUED_RUNS_PER_DAG=16

thank you so much!

@dave-martinez
Copy link

Issue does not happen on 2.1.4 image

I tested this one with Docker Airflow 2.1.4 Python 3.7.
Only works when triggered via HTTP (UI/API)

However, when using TriggerDagRun Operator, it doesn't work, or is this is the intended behavior for TriggerDagRun?

@ephraimbuddy
Copy link
Contributor

What you see there is queued run. The currently active run is 1 but there's also a queued run which doesn't count as an active run.

@aran3
Copy link
Contributor

aran3 commented Oct 21, 2021

It is worth noting for anyone (such as us) that heavily relies on max_active_runs=1 - that this still happens in 2.1.4 when using cli dags trigger command or TriggerDagRunOperator and was supposedly fixed in #18583 (version 2.2.0)

image

@ephraimbuddy
Copy link
Contributor

It is worth noting for anyone (such as us) that heavily relies on max_active_runs=1 - that this still happens in 2.1.4 when using cli dags trigger command or TriggerDagRunOperator and was supposedly fixed in #18583 (version 2.2.0)

image

This is now fixed

@DanielMorales9
Copy link

DanielMorales9 commented Nov 8, 2021

How can this be solved on a fully managed setup such as MWAA? MWAA only supports 1.10.12 and 2.0.2?
I am looking for a workaround here, any help will be appreciated.

@ashb
Copy link
Member

ashb commented Nov 9, 2021

@DanielMorales9 Not easily I'm afraid - by asking AWS to provide a more recent version, or use a different method than MWAA that providers quicker update cycles.

@stroykova
Copy link

stroykova commented Nov 19, 2021

Same problem in 2.1.3 with manually triggered dags. All of them run simultaneously. I will try 2.2.0

@ashb
Copy link
Member

ashb commented Nov 19, 2021

@stroykova Please let us know. (I'd try 2.2.2 than 2.2.0)

@stroykova
Copy link

2.2.2 is fine with this 🥳

@stas-snow
Copy link

stas-snow commented Mar 11, 2022

I'm seeing this issue in 2.2.3.
catchup=True and max_active_runs=1.
DAG is triggered multiple times and multiple instances are running in parallel.

image

@GabeChurch
Copy link

If you are on an older version of airflow that has this problem you can add concurrency setting to you dag (ie concurrency = some_num ). That or depends_on_past = True

@nathadfield
Copy link
Collaborator

nathadfield commented Dec 19, 2023

@alexstrimbeanu Coming in here with an attitude like that is unacceptable and is not going to help your cause but I'll give you the decency of replying.

You may notice that this particular issue is closed and, afaik, there isn't currently an open issue that documents this as a problem.

Maybe you would like to open one and provide it will all the necessary information so that someone can replicate the scenario?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug
Projects
None yet