-
Notifications
You must be signed in to change notification settings - Fork 16.5k
Description
Hey guys !
I'm trying to fit my tasks in the airflow world, but I'm hitting some walls...
I have normal "daily" tasks that should run daily, and then I have "range" tasks that should run only when I have the result of the last 30 days of those daily tasks. There can only be one of those daily + range task running at once: I'm using a SequentialExecutor.
Right now, I'm especially interested in doing this in a backfill mode. I'd like to say : run these DAGs for 6 months of data, and I want it to understand that :
- it can run all those daily tasks in any order (depends_on_past is false)
- it can run any range task whenever all of its dependencies are satisfied (depends_on_past is also false here)
To some extent, the dependencies that I want to express for my range task is a mix of what I can do with a normal set_downstream and an extended depends_on_past where I could set a number of dates rather than being limited to "the previous date".
I've tried dividing the daily tasks and the range tasks in two DAGs, with sensors in the range DAG that wait for the (external) daily tasks to succeed (one sensor per date on which the range task depends). This works until one of the Sensors that doesn't have all his dependencies satisfied blocks the SequentialExecutor.
I've tried putting everything in the same DAG, using priorities to tell him what to run first, but priorities don't have an effect on backfill jobs.
Am I missing another possible solution ?
If that's not the case, what would be the extension that would best fit airflow:
- Implementing priorities in the BackfillJob class ?
- Extending set_downstream with relative dates (similarly to the ExternalTaskSensor or the depends_on_past) ?
Happy to contribute with any of these if they make sense ;-)
Cheers,
Thoralf