-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-2629] Adding a node can result in a deadlock #849
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #849 +/- ##
==========================================
- Coverage 67.33% 67.23% -0.11%
==========================================
Files 70 70
Lines 7598 7611 +13
==========================================
+ Hits 5116 5117 +1
- Misses 2271 2280 +9
- Partials 211 214 +3 ☔ View full report in Codecov by Sentry. |
Hi @pbacsko, The temporary release context lock method makes sense to me. However, I guess the root cause of the deadlock is other tests.
the EventHandler will be cleaned and the wait group will never meet the count.
So the context lock is not released. |
This problem also occurs in real environments when adding a node. It's not just test code which fails. BTW our idea is that this fix is good enough for 1.5.2 (it has been validated by Jacob Salway) and even 1.6.0. We can do a more thorough review of |
I am OK with the change as it is for 1.5.2. We need to have a proper look at the context lock for the 1.6.0 release and we should try to prevent this change from becoming the final solution. |
OK. Putting it back to draft. I'll commit it directly to branch-1.5. |
What is this PR for?
Prevent deadlock in
registerNodes()
by releasing/re-acquiring the write lock.What type of PR is it?
Todos
What is the Jira issue?
https://issues.apache.org/jira/browse/YUNIKORN-2629
How should this be tested?
Screenshots (if appropriate)
Questions: