Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIRFLOW-2511 - Investigation to mitigate Deadlocking #4807

Closed
wants to merge 1 commit into from

Conversation

PaulW
Copy link
Contributor

@PaulW PaulW commented Mar 1, 2019

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-2511] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-2511
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal(AIP).

Description

  • Here are some details about my PR, including screenshots of any UI changes:

In 1.10.2, when running Subdags using the k8s executor (and possibly
others) a common problem which can arrise is an issue of Deadlocks
around a transactional update to the database when updating the task
state.

As a measure to mitigate this, we have implemented a catch to swallow
the OperationalError (if raised) and automatically perform a
session.rollback(). By doing this, we can continue with the Sundag.

I'm aware that this isn't the most ideal solution to this issue, and we
are being quite general in regards to rolling back on an
OperationalError exception, but as sqlalchemy doesn't give a specific
error code relating to the specifics of a Deadlock, and as Postgresql
and MySQL return a different error code and string when faced with the
same error, it could lead to some fairly db-specific exception catches
in the code.

We did also toy with the idea of using a retry loop, and utilising the
session.begin_nested() function to create a SAVEPOINT, but this is not
supported by all setups, so may lead to further unwanted complications
depending on setup.

I'm open to ideas/suggestions on further investigations of this issue,
and whether we should be handling this error elswehere.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Other than manually creating a deadlock within the airflow meta database, and the infrequent occurance of this issue, we've not implemented a method to test this.

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.
    • All the public functions and the classes in the PR contain docstrings that explain what it does

Code Quality

  • Passes flake8

In 1.10.2, when running Subdags using the k8s executor (and possibly
others) a common problem which can arrise is an issue of Deadlocks
around a transactional update to the database when updating the task
state.

As a measure to mitigate this, we have implemented a catch to swallow
the OperationalError (if raised) and automatically perform a
session.rollback().  By doing this, we can continue with the Sundag.

I'm aware that this isn't the most ideal solution to this issue, and we
are being quite general in regards to rolling back on an
OperationalError exception, but as sqlalchemy doesn't give a specific
error code relating to the specifics of a Deadlock, and as Postgresql
and MySQL return a different error code and string when faced with the
same error, it could lead to some fairly db-specific exception catches
in the code.

We did also toy with the idea of using a retry loop, and utilising the
session.begin_nested() function to create a SAVEPOINT, but this is not
supported by all setups, so may lead to further unwanted complications
depending on setup.

I'm open to ideas/suggestions on further investigations of this issue,
and whether we should be handling this error elswehere.
@codecov-io
Copy link

Codecov Report

Merging #4807 into master will decrease coverage by 0.01%.
The diff coverage is 40%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4807      +/-   ##
==========================================
- Coverage   74.44%   74.43%   -0.02%     
==========================================
  Files         450      450              
  Lines       28971    28975       +4     
==========================================
  Hits        21568    21568              
- Misses       7403     7407       +4
Impacted Files Coverage Δ
airflow/jobs.py 76.18% <40%> (-0.28%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b51712c...01fc822. Read the comment docs.

1 similar comment
@codecov-io
Copy link

Codecov Report

Merging #4807 into master will decrease coverage by 0.01%.
The diff coverage is 40%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4807      +/-   ##
==========================================
- Coverage   74.44%   74.43%   -0.02%     
==========================================
  Files         450      450              
  Lines       28971    28975       +4     
==========================================
  Hits        21568    21568              
- Misses       7403     7407       +4
Impacted Files Coverage Δ
airflow/jobs.py 76.18% <40%> (-0.28%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b51712c...01fc822. Read the comment docs.

@fenglu-db
Copy link
Contributor

@PaulW I made a similar PR a few days back, how about we consolidate and move the discussion to #4769?

@PaulW
Copy link
Contributor Author

PaulW commented Mar 1, 2019

@fenglu-g Yes it looks that way. There was no PR attached to the Jira ticket so assumed that there was nothing raised.

@ashb
Copy link
Member

ashb commented Mar 1, 2019

Closing in favour of #4769

@ashb ashb closed this Mar 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants