New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve support for "concurrency pools" in Elasticsearch DAGs #3891
Comments
Looking back into this, I don't think using pools is that simple. Technically tasks are assigned to pools, not entire DAGs -- and there is no way (that I have found) to simply filter a list of DAGs with tasks assigned to a particular pool. We could hard code the list, and rely on the same test_dag_parsing test to check that any DAG that uses the list includes itself in the list. An advantage to using the tags is that you can also filter these DAGs in the Airflow UI, though (and you avoid a huge mess of circular imports with the dag ids). It might also be worth re-stating why we can't just use pools in the normal way, by having a 'staging_elasticsearch' and 'production_elasticsearch' pool with one slot each. Again, tasks run in pools, not entire DAGs. Once a task finishes the slot is freed and the next task that runs may not be from the same DAG. You can manipulate this with priority weights, but it still does not guarantee what we want. For example, imagine we set "staging_database_restore" as higher priority than "create_new_staging_es_index", but "staging_database_restore" began while "create_new_staging_es_index" was already underway. It would take over the slot instead of waiting as we'd prefer. |
Why does |
Answered initially thinking of |
I'm missing the relation between both DAGs 😅 My reservation is that we could be doing extra work and creating an unnecessary dependency (with risk of more waiting times and/or circular dependencies). |
I think it would be fair to mark those as an exception, which is easy to do with the implementation.
FWIW this was actually really easy to implement once I had the idea for how to do it, and has in fact already been done 😅 I'm just waiting on the point_es_alias DAGs to get approved & merged so I can rebase and do a final pass. The idea is that rather than maintaining a bunch of different almost-the-same lists of dependencies per each DAG, we only ever need to specify exceptions to the rule. It makes the DAGs themselves cleaner, makes implementing new ones easier, and reduces risk -- based on the idea that it is worse to accidentally exclude a necessary dependency (which could manifest as an error or performance issue in prod, potentially only happening to occur long after the DAGs were introduced), than unintentionally introduce a dependency that wasn't 100% necessary (which just means some time gets wasted waiting). |
Problem
We have a large number of DAGs that operate on Elasticsearch or the underlying databases. These DAGs must not run concurrently, so we have two reusable tasks (
prevent_concurrency_with_dags
andwait_for_external_dags
) that each accept a list of dag_ids and either (a) immediately fail if any of those dags are running or (b) wait until none of those dags are running, respectively.The problem is that every time a new elasticsearch DAG is added, all the other DAGs that affect that environment must also be updated to fail fast/wait on it. It is easy to miss dependencies, and indeed a few were missed already:
create_new_staging_es_index
DAG needs to prevent concurrency withcreate_proportional_by_source_staging_index
staging_database_restore
needs to wait oncreate_proportional_by_source_staging_index
Description
I've prototyped some of this and think this would work:
prevent_concurrency_with_dags
andwait_for_external_dags
task groups to be a single task (rather than dynamically mapping a task for each external dag id). Instead of taking a list of external_dag_ids, it takes the tag to check forsession.query(DagModel).filter(DagModel.tags.any(DagTag.name == tag)).all()
in order to get the list of external dags. It excludes the DAG from which the task was called(Note: actual implementation ended up being very similar to this idea but not quite)
Alternatives
I prototyped this using tags, but as I write it out it seems obvious that we should literally use Airflow pools instead. I think that would work with the exact same approach.
The text was updated successfully, but these errors were encountered: