AIRFLOW-2511 - Investigation to mitigate Deadlocking #4807
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Make sure you have checked all steps below.
Jira
Description
In 1.10.2, when running Subdags using the k8s executor (and possibly
others) a common problem which can arrise is an issue of Deadlocks
around a transactional update to the database when updating the task
state.
As a measure to mitigate this, we have implemented a catch to swallow
the OperationalError (if raised) and automatically perform a
session.rollback(). By doing this, we can continue with the Sundag.
I'm aware that this isn't the most ideal solution to this issue, and we
are being quite general in regards to rolling back on an
OperationalError exception, but as sqlalchemy doesn't give a specific
error code relating to the specifics of a Deadlock, and as Postgresql
and MySQL return a different error code and string when faced with the
same error, it could lead to some fairly db-specific exception catches
in the code.
We did also toy with the idea of using a retry loop, and utilising the
session.begin_nested() function to create a SAVEPOINT, but this is not
supported by all setups, so may lead to further unwanted complications
depending on setup.
I'm open to ideas/suggestions on further investigations of this issue,
and whether we should be handling this error elswehere.
Tests
Other than manually creating a deadlock within the airflow meta database, and the infrequent occurance of this issue, we've not implemented a method to test this.
Commits
Documentation
Code Quality
flake8