-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Redshift DAGs to catch appropriate exceptions #348
Conversation
Cluster delete and snapshot delete tasks keep on waiting while the status is in 'deleting' state. Since, the status get is part of a while loop, the last iteration when they get deleted raises an exception that the corresponding resource is not found and it marks the task as failed. We are fixing this by catching the relevant status code and other exceptions are re-raised. Additionally, it is observed quite often that the DAG tasks fail without any logs. @ephraimbuddy suggested that this could be due to DAG processing timeouts occuring due to time spent in importing heavy libraries. So taking reference of https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code we're delaying the import of 'boto' to the task execution stage and avoiding to it import it at top module level. We're also also renaming the operator's python reference variable names to be consistent with the 'task_id'.
Codecov Report
@@ Coverage Diff @@
## main #348 +/- ##
=======================================
Coverage 96.78% 96.78%
=======================================
Files 56 56
Lines 2925 2925
=======================================
Hits 2831 2831
Misses 94 94 Continue to review full report at Codecov.
|
astronomer/providers/amazon/aws/example_dags/example_redshift_cluster_management.py
Show resolved
Hide resolved
astronomer/providers/amazon/aws/example_dags/example_redshift_cluster_management.py
Show resolved
Hide resolved
except ClientError as exception: | ||
logging.exception("Error deleting redshift cluster") | ||
raise exception | ||
if exception.response.get("Error", {}).get("Code", "") == "ClusterNotFound": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a check on L55 to see if the cluster exists or not? try..except is good too :) but just thinking out loud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, like the idea of proactively checking rather than reacting with a try-catch. We have the except
more for the while
loop on L60 where the cluster remains in deleting state for a while and then throws this error. In my opinion it makes sense for L55 to have that check as it will be only once; but for L60 it will then make 2 API calls each time in the loop before the cluster is finally deleted.
However, I do not seem to find a relevant method available to check beforehand that the cluster exists https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me
LGTM |
Cluster delete and snapshot delete tasks keep on waiting while
the status is in
deleting
state. Since, the status get is partof a while loop, the last iteration when they get deleted raises
an exception that the corresponding resource is not found and it
marks the task as failed. We are fixing this by catching the relevant
status code and other exceptions are re-raised.
Additionally, it is observed quite often that the DAG tasks fail
without any logs. @ephraimbuddy suggested that this could be due
to DAG processing timeouts occurring due to time spent in importing
heavy libraries (although we could not find the dag processing logs in Astro cloud).
So taking this guess & the reference of
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code
we're delaying the import of
boto
to the task execution stage andavoiding to import it at top module level.
We're also also renaming the operator's python reference variable
names to be consistent with the
task_id
.Closes: #279