feat: Add DbtCloudRetryJobOperator to retry failed dbt job #38001

andyguwc · 2024-03-08T18:07:44Z

This PR adds a new operator to the dbt provider DbtCloudRetryJobOperator. This operator calls DBT cloud API for retrying a job from point of failure.

Note after making the retrying call, the behavior of this operator is very similar to the DbtCloudRunJobOperator to poll the job status, hence it references the same operator link DbtCloudRunJobOperatorLink and the openlineage code is also the same.

Also makes a minor fix for DbtCloudRunJobOperator and related docs.

Tested locally with example DBT dag

closes: #35772

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

andyguwc · 2024-03-13T16:14:51Z

hey @josh-fell do you still plan to review this PR? If not, can you assign some other reviewers? Thanks.

andyguwc · 2024-03-21T02:31:57Z

cc @Taragolis @hussein-awala @eladkal anyone can help review? Thanks!

Bowrna · 2024-03-22T08:59:13Z

airflow/providers/dbt/cloud/operators/dbt.py

+                elif job_run_status == DbtCloudJobRunStatus.SUCCESS.value:
+                    self.log.info("Job run %s has completed successfully.", self.run_id)
+                    return self.run_id
+                elif job_run_status in (


@andyguwc is there any reason to check the cancelled and error in the same condition statement? exception raised could be more specific if we put in different conditions right?

this logic came from the existing DBT operator. I think the rationale is we probably just care about success. And CANCELLED ERROR needs manual triaging anyways.

That said, I see the other comment too about why creating a new operator which duplicates a lot of code. Let me think about the other approach as well.

josh-fell

Why not create a flag on the DbtCloudRunJobOperator to retry from failure rather than create a new operator that reuses a lot of the same code?

andyguwc · 2024-03-22T17:00:14Z

Why not create a flag on the DbtCloudRunJobOperator to retry from failure rather than create a new operator that reuses a lot of the same code?

@josh-fell thanks. I was actually going back and forth between these approaches. I thought separating is cleaner because the operator only does one thing, and the user doesn't need to think through both "retry" from airflow task perspective and "retry" by calling that DBT API to retry from point of failure.

Let me think more about this.

andyguwc · 2024-03-22T23:05:24Z

@josh-fell As I think about adding a flag for retry from failure. I can't wrap my head around the logic for retry in a deferrable operator. Mind share your thoughts on how this could work if the operator is deferrable?

josh-fell · 2024-03-27T19:55:44Z

@josh-fell As I think about adding a flag for retry from failure. I can't wrap my head around the logic for retry in a deferrable operator. Mind share your thoughts on how this could work if the operator is deferrable?

@andyguwc Thanks for your patience here with my responses!

How I think about this feature, from a UX perspective, is really handling which endpoint is used in the trigger_job_run method of the DbtCloudHook based on a DbtCloudRunJobOperator parameter. The key is the behavior change of triggering the job rather than handling what to do based on its run status which is what the deferrable functionality checks. Also, let Airflow continue with its retry functionality; nothing new needed in the provider I think.

From the documentation of this new endpoint, it seems as though it could be used for retries as well as regular job execution.

Use this endpoint to retry a failed run for a job from the point of failure, if the run failed. Otherwise trigger a new run. When this endpoint returns a successful response, a new run will be enqueued for the account.

What this suggests is the dbt API handling the lookup of the last run for a given job and choosing whether or not to start a new instance of the job or not, which is great! So, I would think the implementation in Airflow would be adding a retry_from_failure parameter (or similarly named as you wish of course) to the DbtRunJobOperator, proprogate that value down to the trigger_job_run method in the hook, and then choose which endpoint to use based on that value. This way users simply just need to set a value in an existing operator and that's it.

I could be misinterpreting the robustness of this new endpoint. If I am, the checking of the last job run status could be implemented in the hook and then carry on with the endpoint decision.

I hope that helps!

github-actions · 2024-05-12T00:13:25Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

andyguwc requested a review from josh-fell as a code owner March 8, 2024 18:07

boring-cyborg bot added area:providers area:system-tests kind:documentation provider:dbt-cloud labels Mar 8, 2024

andyguwc force-pushed the tianyou/dbt-cloud-retry-operator branch 2 times, most recently from 3f48cf4 to 4933fe5 Compare March 10, 2024 03:01

andyguwc force-pushed the tianyou/dbt-cloud-retry-operator branch 3 times, most recently from 2271132 to e42e524 Compare March 20, 2024 03:30

andyguwc added 3 commits March 20, 2024 19:32

feat: Add DbtCloudRetryJobOperator to retry failed dbt job

b10be70

format: minor formatting fix

ed7f23d

docs: minor fix for dbt docs

4129dc7

andyguwc force-pushed the tianyou/dbt-cloud-retry-operator branch from e42e524 to 4129dc7 Compare March 21, 2024 02:32

Bowrna reviewed Mar 22, 2024

View reviewed changes

josh-fell reviewed Mar 22, 2024

View reviewed changes

boraberke mentioned this pull request Apr 9, 2024

Add retry_from_failure parameter to DbtCloudRunJobOperator #38868

Open

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label May 12, 2024

github-actions bot closed this May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add DbtCloudRetryJobOperator to retry failed dbt job #38001

feat: Add DbtCloudRetryJobOperator to retry failed dbt job #38001

andyguwc commented Mar 8, 2024

andyguwc commented Mar 13, 2024

andyguwc commented Mar 21, 2024

Bowrna Mar 22, 2024

andyguwc Mar 22, 2024

josh-fell left a comment

andyguwc commented Mar 22, 2024

andyguwc commented Mar 22, 2024

josh-fell commented Mar 27, 2024

github-actions bot commented May 12, 2024

feat: Add DbtCloudRetryJobOperator to retry failed dbt job #38001

feat: Add DbtCloudRetryJobOperator to retry failed dbt job #38001

Conversation

andyguwc commented Mar 8, 2024

andyguwc commented Mar 13, 2024

andyguwc commented Mar 21, 2024

Bowrna Mar 22, 2024

Choose a reason for hiding this comment

andyguwc Mar 22, 2024

Choose a reason for hiding this comment

josh-fell left a comment

Choose a reason for hiding this comment

andyguwc commented Mar 22, 2024

andyguwc commented Mar 22, 2024

josh-fell commented Mar 27, 2024

github-actions bot commented May 12, 2024