Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Redshift DAGs to catch appropriate exceptions #348

Merged
merged 3 commits into from
May 16, 2022

Conversation

pankajkoti
Copy link
Collaborator

@pankajkoti pankajkoti commented May 13, 2022

Cluster delete and snapshot delete tasks keep on waiting while
the status is in deleting state. Since, the status get is part
of a while loop, the last iteration when they get deleted raises
an exception that the corresponding resource is not found and it
marks the task as failed. We are fixing this by catching the relevant
status code and other exceptions are re-raised.

Additionally, it is observed quite often that the DAG tasks fail
without any logs. @ephraimbuddy suggested that this could be due
to DAG processing timeouts occurring due to time spent in importing
heavy libraries (although we could not find the dag processing logs in Astro cloud).
So taking this guess & the reference of
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code
we're delaying the import of boto to the task execution stage and
avoiding to import it at top module level.

We're also also renaming the operator's python reference variable
names to be consistent with the task_id.

Closes: #279

Cluster delete and snapshot delete tasks keep on waiting while
the status is in 'deleting' state. Since, the status get is part
of a while loop, the last iteration when they get deleted raises
an exception that the corresponding resource is not found and it
marks the task as failed. We are fixing this by catching the relevant
status code and other exceptions are re-raised.

Additionally, it is observed quite often that the DAG tasks fail
without any logs. @ephraimbuddy suggested that this could be due
to DAG processing timeouts occuring due to time spent in importing
heavy libraries. So taking reference of
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code
we're delaying the import of 'boto' to the task execution stage and
avoiding to it import it at top module level.

We're also also renaming the operator's python reference variable
names to be consistent with the 'task_id'.
@codecov
Copy link

codecov bot commented May 13, 2022

Codecov Report

Merging #348 (ee2216a) into main (a33125c) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #348   +/-   ##
=======================================
  Coverage   96.78%   96.78%           
=======================================
  Files          56       56           
  Lines        2925     2925           
=======================================
  Hits         2831     2831           
  Misses         94       94           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a33125c...ee2216a. Read the comment docs.

except ClientError as exception:
logging.exception("Error deleting redshift cluster")
raise exception
if exception.response.get("Error", {}).get("Code", "") == "ClusterNotFound":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a check on L55 to see if the cluster exists or not? try..except is good too :) but just thinking out loud

Copy link
Collaborator Author

@pankajkoti pankajkoti May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, like the idea of proactively checking rather than reacting with a try-catch. We have the except more for the while loop on L60 where the cluster remains in deleting state for a while and then throws this error. In my opinion it makes sense for L55 to have that check as it will be only once; but for L60 it will then make 2 API calls each time in the loop before the cluster is finally deleted.
However, I do not seem to find a relevant method available to check beforehand that the cluster exists https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift.html

Copy link
Contributor

@bharanidharan14 bharanidharan14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me

@pankajastro
Copy link
Contributor

LGTM

@phanikumv phanikumv merged commit ed4d0bb into main May 16, 2022
@phanikumv phanikumv deleted the 279-fix-redshift-dags branch May 16, 2022 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix Redshift example DAG
5 participants