Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-1328] Handle application state changes and trigger tracker interfaces #441

Closed
wants to merge 8 commits into from

Conversation

manirajv06
Copy link
Contributor

What is this PR for?

Integrated ugm module with application state transition change blocks and other places.

What type of PR is it?

  • - Feature

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-1328

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@manirajv06 manirajv06 self-assigned this Oct 6, 2022
@manirajv06 manirajv06 force-pushed the YUNIKORN-1328 branch 3 times, most recently from 09ae825 to 7728234 Compare October 16, 2022 07:47
@manirajv06
Copy link
Contributor Author

Notes:

Why increase resource usage in app#AddAllocationInternal instead of enter_starting app transition code block?

  1. In case of a gang app, app state would be changed to "starting" only after all ph's have been allocated. But we need to update the user's resource usage during this ph's allocation stage itself.
  2. In case of a non - gang app, after 1st allocation (starting->running), the app would be in running state for future allocations (from 2nd allocation onwards) too. So these future allocations need to be accounted for in the user's resource usage.

Hence, app#AddAllocationInternal is the right place to increase the resource usage. However, With this change, queue metrics and scheduler metrics to increase app running would be moved from enter_running app transition code block to enter_starting app transition code block.

Similarly, app#removeAllocationInternal would be used to decrease the resource usage but not remove the app by calling the decrease method with removeApp as false.

Finally, when the app enters “completing” state, the app would be removed by calling the decrease method with removeApp as true. With this change, even queue metrics and scheduler metrics to do decrease operations would be taken care too.

Even in case of "resuming" (gang app start to work as non gang app because all ph timeouts and gang scheduling fallback style is soft) state, app would be removed as anyway it would be added again when it start to function as non gang app). This activity happens in enter_resuming state transition code block. In case "hard" gang scheduling style, app would be moved to failed state when all ph's timeout happens, hence app would be removed by calling the decrease method with removeApp as true. This activity happens in enter_failing state transition code block. With this change, even queue metrics and scheduler metrics to do decrease operations would be taken care too.

Stack traces of app#AddAllocationInternal

Node Registration

p#addAllocation -> app#AddAllocation -> app#AddAllocationInternal

Alloc releases

context#processAllocationReleases -> p#removeAllocation -> app#ReplaceAllocation -> app#AddAllocationInternal

Node Removal

p#removeNodeAllocations -> app#ReplaceAllocation -> app#AddAllocationInternal

Regular scheduling core path

app#tryAllocate or app#tryNodes or app#tryNodesNoReserve or app#tryReservedAllocate -> app#AddAllocationInternal

Stack traces of app#removeAllocationInternal

Normal Alloc releases

context#processAllocationReleases -> p#removeAllocation -> app#RemoveAllocation -> app#removeAllocationInternal

Normal Alloc removal during Node Removal

p#removeNodeAllocations -> app#RemoveAllocation -> app#removeAllocationInternal

ph Alloc releases

context#processAllocationReleases -> p#removeAllocation -> app#ReplaceAllocation ->
app#removeAllocationInternal

Ph Alloc removal during Node Removal

p#removeNodeAllocations -> app#ReplaceAllocation ->
app#removeAllocationInternal

User resource usage enforcement and update

User quota would be enforced in app#tryAllocate just before the queue headroom check to ensure user quota is not exceeded.

if !headRoom.FitInMaxUndef(request.GetAllocatedResource()) {
}

After these above check, app#tryAllocate calls app#tryNode to Node pre allocate checks, adding allocation into node, incrementing queue allocated resource for resource updates etc and finally calls app#addAllocationInternal. As described above, app#addAllocationInternal does the actual user’s resource usage updates.


// getGroup Based on the current limitations, username and group name is same. Groups[0] is always set and same as username.
// It would be changed in future based on user group resolution, limit configuration processing etc
func (m *Manager) getGroup(user security.UserGroup) (string, error) {
Copy link
Contributor

@pbacsko pbacsko Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: what if, for whatever reason, we don't have groups for a certain user? Can that happen? Eg. we disable the user-group retrieval in the AC, what happens here? Should we really return an error if there are no groups?

Copy link
Contributor

@wilfred-s wilfred-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the discussion: we need to be careful when we remove a placeholder as it could be a timed out placeholder on an app that already has other allocations. So going to zero for placeholders can never (?) trigger an app usage tracking removal. The other thing that we need to check is the size difference between placeholder and allocation. If the placeholder is larger than the real one do we account correctly?

sa.allocatedPlaceholder = resources.Sub(sa.allocatedPlaceholder, alloc.GetAllocatedResource())
if resources.IsZero(sa.allocatedPlaceholder) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same check is happening below, please make merge the change into one check

sa.allocatedPlaceholder = resources.Sub(sa.allocatedPlaceholder, alloc.GetAllocatedResource())
if resources.IsZero(sa.allocatedPlaceholder) {
removeApp = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure this is correct: if we call removeAllocationInternal() when we time out the placeholders that have not been allocated yet we might remove the app from tracking which is incorrect.
We need to account for the same kind of cases as below with Fail and Run event.

@wilfred-s
Copy link
Contributor

We seem to have a deadlock in the code. One of the tests times out which normally means we have locked ourselves out.
Can you please check?

@codecov
Copy link

codecov bot commented Nov 14, 2022

Codecov Report

Merging #441 (c855af2) into master (2247f62) will increase coverage by 0.12%.
The diff coverage is 87.32%.

@@            Coverage Diff             @@
##           master     #441      +/-   ##
==========================================
+ Coverage   72.68%   72.81%   +0.12%     
==========================================
  Files          67       67              
  Lines        9933     9990      +57     
==========================================
+ Hits         7220     7274      +54     
- Misses       2471     2474       +3     
  Partials      242      242              
Impacted Files Coverage Δ
pkg/scheduler/partition.go 75.21% <ø> (ø)
pkg/webservice/handlers.go 78.63% <ø> (ø)
pkg/scheduler/ugm/manager.go 59.37% <75.00%> (+4.32%) ⬆️
pkg/scheduler/objects/application.go 58.32% <96.42%> (+0.87%) ⬆️
pkg/scheduler/objects/application_state.go 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@craigcondit
Copy link
Contributor

@manirajv06 this PR seems to have quite a bit of unnecessary bits from other JIRAs included. Can you clean this up and rebase / squash?

@manirajv06 manirajv06 force-pushed the YUNIKORN-1328 branch 2 times, most recently from aa64876 to 95727ae Compare November 21, 2022 11:12
@manirajv06
Copy link
Contributor Author

@craigcondit @wilfred-s Cleaned up unwanted old commits (included as part of dependent jira rebase).

@manirajv06 manirajv06 mentioned this pull request Nov 22, 2022
5 tasks
pkg/scheduler/objects/application.go Show resolved Hide resolved
pkg/scheduler/objects/application.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/application.go Outdated Show resolved Hide resolved
pkg/scheduler/objects/application.go Show resolved Hide resolved
pkg/scheduler/objects/application_test.go Outdated Show resolved Hide resolved
pkg/scheduler/ugm/manager.go Outdated Show resolved Hide resolved
pkg/scheduler/ugm/manager.go Outdated Show resolved Hide resolved
pkg/scheduler/ugm/manager.go Show resolved Hide resolved
pkg/scheduler/ugm/tracker.go Outdated Show resolved Hide resolved
pkg/scheduler/partition_test.go Outdated Show resolved Hide resolved
@manirajv06
Copy link
Contributor Author

@wilfred-s and I had one more round of review mostly around simplifying test related helper methods, clean up etc. Incorporated the changes.

Copy link
Contributor

@wilfred-s wilfred-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
thanks for sticking with me through the long review process.

@wilfred-s wilfred-s closed this in 9b90e8a Nov 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants