New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(airflow): add the "parallel" mode #966
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
dlt/helpers/airflow_helper.py
Outdated
end = DummyOperator(task_id=f"{group_name}_end") | ||
|
||
start >> tasks >> end | ||
return [start] + tasks + [end] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure that we should return start
and end
too. On the other hand - they are marking the start and the end of the parallel group. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm so here is one trick I asked to implement:
the start
should be the first source component. when you run the pipeline, it will do a lot of setup work (ie create schemas and initial state) that rather should not happen in parallel.
full parallelism happens in parallel-isolated
mode
the end task makes sense. I think it is used with a condition that all predecessors succeeded. look at examples in the discussion in the original tickets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm cool! but what if tasks
is an empty list? maybe you should test a pipeline with just one component (resource).
had no idea that >>
works with lists!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears it'll not work fine. The start
and the end
will not be up/down streamed to each other if the middle list is empty - a bit unexpected for me. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really good! I have one request for a start
task.
dlt/helpers/airflow_helper.py
Outdated
end = DummyOperator(task_id=f"{group_name}_end") | ||
|
||
start >> tasks >> end | ||
return [start] + tasks + [end] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm so here is one trick I asked to implement:
the start
should be the first source component. when you run the pipeline, it will do a lot of setup work (ie create schemas and initial state) that rather should not happen in parallel.
full parallelism happens in parallel-isolated
mode
the end task makes sense. I think it is used with a condition that all predecessors succeeded. look at examples in the discussion in the original tickets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just small details
dlt/helpers/airflow_helper.py
Outdated
end = DummyOperator(task_id=f"{group_name}_end") | ||
|
||
start >> tasks >> end | ||
return [start] + tasks + [end] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm cool! but what if tasks
is an empty list? maybe you should test a pipeline with just one component (resource).
had no idea that >>
works with lists!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Towards: #931
Implements the
parallel
decomposition mode in the Airflowdlt
helper.