Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: retry loop for exception caused by deadlock on badge node #404

Merged
merged 11 commits into from Dec 3, 2020

Conversation

allisonsuarez
Copy link
Contributor

Summary of Changes

Explanation from @dikshathakur3119 :
We index Hive tables using index_table_mateadata and to expedite indexing all the hive table we are leveraging multiprocessing.
When a relationship is created, its endpoint nodes are write-locked. If multiple threads attempt to create relationships involving the same set of endpoint nodes, deadlocks can occur.
Earlier this deadlock could not take place since we are creating threads on the basis of shema name, so never used to process the same table node from different threads.
But we have recently added a new model - Badge, and we create a single badge for unique badge names, whether it is table level badge or column level badge. E.g there will be only one node with badge name ‘deprecated’ or ‘partition column’. We have n:n relationship between badge-table and badge-column. Now when we try to index table metadata, our task try to access the same badge node from different threads which causes a deadlock.

Tests

N/A

Documentation

N/A

CheckList

Make sure you have checked all steps below to ensure a timely review.

  • PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
  • PR includes a summary of changes.
  • PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
  • PR passes make test

… with badge node, still a draft

Signed-off-by: Allison Suarez Miranda <asuarezmiranda@lyft.com>
Signed-off-by: Allison Suarez Miranda <asuarezmiranda@lyft.com>
Signed-off-by: Allison Suarez Miranda <asuarezmiranda@lyft.com>
Signed-off-by: Allison Suarez Miranda <asuarezmiranda@lyft.com>
Signed-off-by: Allison Suarez Miranda <asuarezmiranda@lyft.com>
@allisonsuarez
Copy link
Contributor Author

I have some open questions on the code comments that I need help with so please leave some comments with your thoughts @dikshathakur3119 @feng-tao @jinhyukchang

Copy link
Contributor

@dikshathakur3119 dikshathakur3119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, and it solves the issue we are facing currently when we try to index hive tables in multiple processes. Do you think we should add note in our badge documents to make sure people understand that why sometimes a task of ingesting nodes may fail even after retries if they are trying to add same Badge node in parallel tasks?

databuilder/publisher/neo4j_csv_publisher.py Outdated Show resolved Hide resolved
@stale
Copy link

stale bot commented Nov 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale stalebot believes this issue/PR is no longer active label Nov 26, 2020
Signed-off-by: Allison Suarez Miranda <asuarezmiranda@lyft.com>
@stale stale bot removed the stale stalebot believes this issue/PR is no longer active label Nov 30, 2020
Signed-off-by: Allison Suarez Miranda <asuarezmiranda@lyft.com>
Signed-off-by: Allison Suarez Miranda <asuarezmiranda@lyft.com>
Signed-off-by: Allison Suarez Miranda <asuarezmiranda@lyft.com>
@feng-tao
Copy link
Member

feng-tao commented Dec 2, 2020

CI fails?

@feng-tao feng-tao added the keep fresh Disables stalebot from closing an issue label Dec 2, 2020
Signed-off-by: Allison Suarez Miranda <asuarezmiranda@lyft.com>
@allisonsuarez allisonsuarez merged commit 9fd1513 into master Dec 3, 2020
@allisonsuarez allisonsuarez deleted the asm-deadlock-fix branch December 3, 2020 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep fresh Disables stalebot from closing an issue
Projects
None yet
3 participants