Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase timeout while waiting for RDS database rename #2961

Closed
AetherUnbound opened this issue Sep 1, 2023 · 5 comments
Closed

Increase timeout while waiting for RDS database rename #2961

AetherUnbound opened this issue Sep 1, 2023 · 5 comments
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs ⛔ status: blocked Blocked & therefore, not ready for work 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python

Comments

@AetherUnbound
Copy link
Contributor

Description

The last several runs of the staging database restore DAG have all had the await_<rename> steps time out before the new database has become available. Upon retries, the database rename had completed and all steps resumed as normal.

From the logs, it looks like a failure actually occurs when the database does not exist yet, rather than an empty response. The current DAG is set up to poke for up to an hour, but all requests are failing which then defaults to Airflow's retry mechanisms which are much shorter than one hour:

def make_rds_sensor(task_id: str, db_identifier: str, retries: int = 0) -> RdsDbSensor:
return RdsDbSensor(
task_id=task_id,
db_identifier=db_identifier,
target_statuses=["available"],
aws_conn_id=AWS_RDS_CONN_ID,
mode="reschedule",
timeout=60 * 60, # 1 hour
retries=retries,
retry_delay=timedelta(minutes=1),
)

Logs for the failure:

[2023-09-01, 00:14:45 UTC] {base.py:73} INFO - Using connection ID 'aws_default' for task execution.
[2023-09-01, 00:14:45 UTC] {connection_wrapper.py:340} INFO - AWS Connection (conn_id='aws_default', conn_type='aws') credentials retrieved from login and password.
[2023-09-01, 00:14:45 UTC] {taskinstance.py:1943} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sensors/base.py", line 251, in execute
    raise e
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sensors/base.py", line 238, in execute
    poke_return = self.poke(context)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/sensors/rds.py", line 181, in poke
    state = self.hook.get_db_instance_state(self.db_identifier)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/rds.py", line 245, in get_db_instance_state
    raise e
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/rds.py", line 241, in get_db_instance_state
    response = self.conn.describe_db_instances(DBInstanceIdentifier=db_instance_id)
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/client.py", line 535, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/client.py", line 980, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.DBInstanceNotFoundFault: An error occurred (DBInstanceNotFound) when calling the DescribeDBInstances operation: DBInstance dev-old-openverse-db not found.

We should investigate to see if there is a way to retry the sensor even on DBInstanceNotFound errors. Otherwise, we should leverage Airflow's retry mechanisms but extend either the retry count or the delays so it allows up to one hour for the database rename.

@AetherUnbound AetherUnbound added 🐍 tech: python Involves Python 💻 aspect: code Concerns the software code in the repository 🔧 tech: airflow Involves Apache Airflow 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Sep 1, 2023
@rwidom rwidom self-assigned this Sep 9, 2023
@rwidom
Copy link
Collaborator

rwidom commented Sep 9, 2023

@AetherUnbound , I did a little poking around (ha ha, sorry, pun intended) and am making some notes on things to try. If any of this looks totally misguided, please do let me know!

  • It looks like the sensor poker returns false and moves on if Airflow throws an error because it doesn't know where to look for the DB, but it errors out completely if Airflow knows where to look but botocore throws an error on finding the database. Is that right?
  • The high level sensor docs have failure and success parameters in their example, but I wonder if that's only for SQL sensors or if we could add in those options so that if poke fails with DBInstanceNotFound we keep it in the path that returns false and keeps checking.
  • This is a little counter intuitive to me, but it does seem like the basic idea is that sensors are supposed to return false when the goal is to wait and check again. Maybe we just need to use the silent_fail option to say ignore the error and keep trying as if it were returning a nice false? Would that be too broad? Doesn't matter, because that's part of the python sensor, but not the base sensor which is what this sensor uses.
  • Or is it just as simple as changing the default retries to 60 rather than 0 and/or add exponential_backoff?

Oh, oops! Just realized that I don't have the access I would need to test the dag. Fun thought experiment though, and I'll be interested to see what happens!

@rwidom rwidom removed their assignment Sep 9, 2023
@AetherUnbound
Copy link
Contributor Author

Sorry it's taken me so long to get back to you @rwidom! Your assessment looks correct, it looks like it should just emit False but is currently raising an exception instead. And we can't use failure since this isn't an SQL sensor.

What's looks buggy to me is this bit of the hook's call:

https://github.com/apache/airflow/blob/0c8e30e43b70e9d033e1686b327eb00aab82479c/airflow/providers/amazon/aws/hooks/rds.py#L229-L246

The DBInstanceNotFoundFault error check makes me think that it should be handling the AirflowNotFoundException in the poke call, but it's clearly not! I'm going to see if I can replicate this with the RDS hook, this definitely seems like a bug with the way everything is currently implemented.

@AetherUnbound
Copy link
Contributor Author

Made an upstream PR to try and address this, since it seems counter to what the docs behavior describes! 😄 apache/airflow#34773

@AetherUnbound AetherUnbound added the ⛔ status: blocked Blocked & therefore, not ready for work label Oct 10, 2023
@AetherUnbound
Copy link
Contributor Author

The upstream PR was merged! 🚀 Now we just need to wait for it to be released 😄

@AetherUnbound
Copy link
Contributor Author

This should be fixed, we just need to upgrade Airflow so I'm going to go ahead and close it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs ⛔ status: blocked Blocked & therefore, not ready for work 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python
Projects
Archived in project
Development

No branches or pull requests

2 participants