Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(airflow): add the "parallel" mode #966

Merged
merged 5 commits into from Feb 18, 2024
Merged

feat(airflow): add the "parallel" mode #966

merged 5 commits into from Feb 18, 2024

Conversation

IlyaFaer
Copy link
Collaborator

Towards: #931

Implements the parallel decomposition mode in the Airflow dlt helper.

Copy link

netlify bot commented Feb 14, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 51874d3
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/65cdb61b37a5980008df3921
😎 Deploy Preview https://deploy-preview-966--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

end = DummyOperator(task_id=f"{group_name}_end")

start >> tasks >> end
return [start] + tasks + [end]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure that we should return start and end too. On the other hand - they are marking the start and the end of the parallel group. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm so here is one trick I asked to implement:
the start should be the first source component. when you run the pipeline, it will do a lot of setup work (ie create schemas and initial state) that rather should not happen in parallel.

full parallelism happens in parallel-isolated mode

the end task makes sense. I think it is used with a condition that all predecessors succeeded. look at examples in the discussion in the original tickets

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A-ah, I think I got it about start. So, no dummy is needed for a start, just the first element of the decomposition, and when it's finished, other tasks are ran in parallel.

About the end: by default, all_success trigger is used, so it should already work like that:
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm cool! but what if tasks is an empty list? maybe you should test a pipeline with just one component (resource).

had no idea that >> works with lists!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears it'll not work fine. The start and the end will not be up/down streamed to each other if the middle list is empty - a bit unexpected for me. Fixed.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really good! I have one request for a start task.

dlt/helpers/airflow_helper.py Show resolved Hide resolved
dlt/helpers/airflow_helper.py Outdated Show resolved Hide resolved
dlt/helpers/airflow_helper.py Show resolved Hide resolved
end = DummyOperator(task_id=f"{group_name}_end")

start >> tasks >> end
return [start] + tasks + [end]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm so here is one trick I asked to implement:
the start should be the first source component. when you run the pipeline, it will do a lot of setup work (ie create schemas and initial state) that rather should not happen in parallel.

full parallelism happens in parallel-isolated mode

the end task makes sense. I think it is used with a condition that all predecessors succeeded. look at examples in the discussion in the original tickets

@IlyaFaer IlyaFaer marked this pull request as ready for review February 14, 2024 10:54
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just small details

end = DummyOperator(task_id=f"{group_name}_end")

start >> tasks >> end
return [start] + tasks + [end]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm cool! but what if tasks is an empty list? maybe you should test a pipeline with just one component (resource).

had no idea that >> works with lists!

dlt/helpers/airflow_helper.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rudolfix rudolfix merged commit cf2e886 into devel Feb 18, 2024
62 checks passed
@rudolfix rudolfix deleted the airflow_parallel branch February 18, 2024 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants