Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2629] Adding a node can result in a deadlock #849

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pbacsko
Copy link
Contributor

@pbacsko pbacsko commented May 23, 2024

What is this PR for?

Prevent deadlock in registerNodes() by releasing/re-acquiring the write lock.

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2629

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@pbacsko pbacsko self-assigned this May 23, 2024
@pbacsko pbacsko marked this pull request as draft May 23, 2024 09:48
@pbacsko pbacsko requested review from craigcondit, wilfred-s, chia7712, brandboat and chenyulin0719 and removed request for craigcondit and wilfred-s May 23, 2024 09:49
Copy link

codecov bot commented May 23, 2024

Codecov Report

Attention: Patch coverage is 58.33333% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 67.23%. Comparing base (5f80f49) to head (4853d08).
Report is 3 commits behind head on master.

Files Patch % Lines
pkg/cache/context.go 58.33% 6 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #849      +/-   ##
==========================================
- Coverage   67.33%   67.23%   -0.11%     
==========================================
  Files          70       70              
  Lines        7598     7611      +13     
==========================================
+ Hits         5116     5117       +1     
- Misses       2271     2280       +9     
- Partials      211      214       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chenyulin0719
Copy link
Contributor

chenyulin0719 commented May 24, 2024

Hi @pbacsko,

The temporary release context lock method makes sense to me.

However, I guess the root cause of the deadlock is other tests.
When another test called 'dispatcher.UnregisterAllEventHandlers()',

the EventHandler will be cleaned and the wait group will never meet the count.

So the context lock is not released.

@pbacsko
Copy link
Contributor Author

pbacsko commented May 28, 2024

the EventHandler will be cleaned and the wait group will never meet the count.

This problem also occurs in real environments when adding a node. It's not just test code which fails.

BTW our idea is that this fix is good enough for 1.5.2 (it has been validated by Jacob Salway) and even 1.6.0. We can do a more thorough review of Context later in a separate JIRA.

@pbacsko pbacsko marked this pull request as ready for review May 28, 2024 19:29
@wilfred-s
Copy link
Contributor

I am OK with the change as it is for 1.5.2. We need to have a proper look at the context lock for the 1.6.0 release and we should try to prevent this change from becoming the final solution.

@pbacsko
Copy link
Contributor Author

pbacsko commented May 30, 2024

I am OK with the change as it is for 1.5.2. We need to have a proper look at the context lock for the 1.6.0 release and we should try to prevent this change from becoming the final solution.

OK. Putting it back to draft. I'll commit it directly to branch-1.5.

@pbacsko pbacsko marked this pull request as draft May 30, 2024 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants