New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-1328] Handle application state changes and trigger tracker interfaces #441
Conversation
09ae825
to
7728234
Compare
Notes: Why increase resource usage in app#AddAllocationInternal instead of enter_starting app transition code block?
Hence, app#AddAllocationInternal is the right place to increase the resource usage. However, With this change, queue metrics and scheduler metrics to increase app running would be moved from enter_running app transition code block to enter_starting app transition code block. Similarly, app#removeAllocationInternal would be used to decrease the resource usage but not remove the app by calling the decrease method with removeApp as false. Finally, when the app enters “completing” state, the app would be removed by calling the decrease method with removeApp as true. With this change, even queue metrics and scheduler metrics to do decrease operations would be taken care too. Even in case of "resuming" (gang app start to work as non gang app because all ph timeouts and gang scheduling fallback style is soft) state, app would be removed as anyway it would be added again when it start to function as non gang app). This activity happens in enter_resuming state transition code block. In case "hard" gang scheduling style, app would be moved to failed state when all ph's timeout happens, hence app would be removed by calling the decrease method with removeApp as true. This activity happens in enter_failing state transition code block. With this change, even queue metrics and scheduler metrics to do decrease operations would be taken care too. Stack traces of app#AddAllocationInternal Node Registration p#addAllocation -> app#AddAllocation -> app#AddAllocationInternal Alloc releases context#processAllocationReleases -> p#removeAllocation -> app#ReplaceAllocation -> app#AddAllocationInternal Node Removal p#removeNodeAllocations -> app#ReplaceAllocation -> app#AddAllocationInternal Regular scheduling core path app#tryAllocate or app#tryNodes or app#tryNodesNoReserve or app#tryReservedAllocate -> app#AddAllocationInternal Stack traces of app#removeAllocationInternal Normal Alloc releases context#processAllocationReleases -> p#removeAllocation -> app#RemoveAllocation -> app#removeAllocationInternal Normal Alloc removal during Node Removal p#removeNodeAllocations -> app#RemoveAllocation -> app#removeAllocationInternal ph Alloc releases context#processAllocationReleases -> p#removeAllocation -> app#ReplaceAllocation -> Ph Alloc removal during Node Removal p#removeNodeAllocations -> app#ReplaceAllocation -> User resource usage enforcement and update User quota would be enforced in app#tryAllocate just before the queue headroom check to ensure user quota is not exceeded. if !headRoom.FitInMaxUndef(request.GetAllocatedResource()) { After these above check, app#tryAllocate calls app#tryNode to Node pre allocate checks, adding allocation into node, incrementing queue allocated resource for resource updates etc and finally calls app#addAllocationInternal. As described above, app#addAllocationInternal does the actual user’s resource usage updates. |
pkg/scheduler/ugm/manager.go
Outdated
|
||
// getGroup Based on the current limitations, username and group name is same. Groups[0] is always set and same as username. | ||
// It would be changed in future based on user group resolution, limit configuration processing etc | ||
func (m *Manager) getGroup(user security.UserGroup) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: what if, for whatever reason, we don't have groups for a certain user? Can that happen? Eg. we disable the user-group retrieval in the AC, what happens here? Should we really return an error if there are no groups?
c3c715f
to
ec06e03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the discussion: we need to be careful when we remove a placeholder as it could be a timed out placeholder on an app that already has other allocations. So going to zero for placeholders can never (?) trigger an app usage tracking removal. The other thing that we need to check is the size difference between placeholder and allocation. If the placeholder is larger than the real one do we account correctly?
pkg/scheduler/objects/application.go
Outdated
sa.allocatedPlaceholder = resources.Sub(sa.allocatedPlaceholder, alloc.GetAllocatedResource()) | ||
if resources.IsZero(sa.allocatedPlaceholder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same check is happening below, please make merge the change into one check
pkg/scheduler/objects/application.go
Outdated
sa.allocatedPlaceholder = resources.Sub(sa.allocatedPlaceholder, alloc.GetAllocatedResource()) | ||
if resources.IsZero(sa.allocatedPlaceholder) { | ||
removeApp = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure this is correct: if we call removeAllocationInternal()
when we time out the placeholders that have not been allocated yet we might remove the app from tracking which is incorrect.
We need to account for the same kind of cases as below with Fail and Run event.
49a73ec
to
1d009bd
Compare
We seem to have a deadlock in the code. One of the tests times out which normally means we have locked ourselves out. |
Codecov Report
@@ Coverage Diff @@
## master #441 +/- ##
==========================================
+ Coverage 72.68% 72.81% +0.12%
==========================================
Files 67 67
Lines 9933 9990 +57
==========================================
+ Hits 7220 7274 +54
- Misses 2471 2474 +3
Partials 242 242
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
3204614
to
0f8a106
Compare
@manirajv06 this PR seems to have quite a bit of unnecessary bits from other JIRAs included. Can you clean this up and rebase / squash? |
aa64876
to
95727ae
Compare
@craigcondit @wilfred-s Cleaned up unwanted old commits (included as part of dependent jira rebase). |
d1a3965
to
3855c00
Compare
@wilfred-s and I had one more round of review mostly around simplifying test related helper methods, clean up etc. Incorporated the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thanks for sticking with me through the long review process.
What is this PR for?
Integrated ugm module with application state transition change blocks and other places.
What type of PR is it?
Todos
What is the Jira issue?
https://issues.apache.org/jira/browse/YUNIKORN-1328
How should this be tested?
Screenshots (if appropriate)
Questions: