-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-169] panic in ask release #152
Conversation
When an ask is released while an allocation for that ask is processed the scheduler panics. The panic is caused by a race between two event handling routines updating the same app. The changes remove the need for the ask to be accessed on confirmation. The inflight allocation is now also correctly removed from the cache when the confirmation fails. A race condition in reservation removal was fixed. A sync issue between cache and scheduling node for available resources was fixed.
Travis tests have failedHey @wilfred-s, 1st Buildcurl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(go env GOPATH)/bin v1.22.2
make lint
TravisBuddy Request Identifier: 8109fa20-9a74-11ea-ae6f-3d1c6927e18c |
When an ask s released iwhile an allocation for that ask is processed the scheduler panics. The panic is caused by a race between two event handling routines updating the same app. The changes remove the need for the ask to be accessed on confirmation.
Hey @wilfred-s, TravisBuddy Request Identifier: 56ef3f40-9a81-11ea-ae6f-3d1c6927e18c |
Codecov Report
@@ Coverage Diff @@
## master #152 +/- ##
==========================================
- Coverage 58.62% 58.57% -0.06%
==========================================
Files 62 62
Lines 6096 6160 +64
==========================================
+ Hits 3574 3608 +34
- Misses 2381 2411 +30
Partials 141 141
Continue to review full report at Codecov.
|
When an ask s released iwhile an allocation for that ask is processed the scheduler panics. The panic is caused by a race between two event handling routines updating the same app. The changes remove the need for the ask to be accessed on confirmation.
Hey @wilfred-s, TravisBuddy Request Identifier: d2fc0660-9a98-11ea-ae6f-3d1c6927e18c |
pkg/cache/application_info.go
Outdated
// Return the allocation for the allocation key. This will return the first allocation for the | ||
// allocation key. It will only work for the case that the repeat is 1. | ||
func (ai *ApplicationInfo) GetAllocationUUID(allocKey string) string { | ||
ai.lock.RLock() | ||
defer ai.lock.RUnlock() | ||
|
||
// just get the first one, ignore multiple repeats for now. | ||
// we cannot detect multiple repeats easily here. | ||
for uuid, alloc := range ai.allocations { | ||
if alloc.AllocationProto.AllocationKey == allocKey { | ||
return uuid | ||
} | ||
} | ||
// not found return an empty string | ||
return "" | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is less efficient, and it assumes a proposal only has 1 allocation.
To fix the uuid issue, can we use the approach we used in #151 ? That said, to define https://github.com/yangwwei/incubator-yunikorn-core/blob/8f2ffe4cbd54ad2e49bb193a938de0a0bd048009/pkg/scheduler/schedulerevent/events.go#L39. And use this like: https://github.com/yangwwei/incubator-yunikorn-core/blob/8f2ffe4cbd54ad2e49bb193a938de0a0bd048009/pkg/scheduler/scheduler.go#L326-L337.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no proper support multiple repeats at the moment.
Let me explain what I think is the problem with multiple repeats and with what you are proposing in code:
If the asks has multiple repeats we might have allocated one or more of those repeats. We're currently just link the ask to the allocation. One ask multiple allocations. Allocations are added and removed based on UUID. So far all is good.
In this case we are removing an ask. One of the repeats of that ask is in the process of being allocated. That ask can have a number of confirmed allocations and has at least one allocation in flight otherwise we would not have to clean up where we are. There might be unallocated repeats also but they are not important.
We need to clean up the in flight allocation that is one allocation. We cannot just remove all allocations that have been added and confirmed. They should not be impacted as they are not related to the current change. If those allocations need to be removed then that is an allocation removal request.
Those earlier allocated ask repeats might have been running for a while and because an outstanding ask gets deleted I cannot and may not remove confirmed allocations.
For the case that there are multiple allocation in flight there is a whole other complexity that needs to be looked at.
In the proposed code of PR #151 you are not only cancelling the inflight allocation but also all confirmed allocations, that would be an unexpected behaviour of cancelling an ask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no proper support multiple repeats at the moment.
I agree. But we should not break the semantics as much as possible.
In the proposed code of PR #151 you are not only cancelling the inflight allocation but also all confirmed allocations, that would be an unexpected behaviour of cancelling an ask.
To get more consistency, we can do the following
From the cache, it first generates the allocation, and then send exact same message to the scheduler for confirmation. along with the proposal itself. Each time it only handles one allocation for one proposal.
m.EventHandlers.SchedulerEventHandler.HandleEvent(&schedulerevent.SchedulerAllocationUpdatesEvent{
AcceptedAllocations: []*schedulerevent.AcceptedAllocationProposal{
{
AllocationProposal: proposal,
Allocation: allocInfo.AllocationProto,
},
},
})
In the scheduler, it processes the proposal, if it fails, notify cache to release the allocation:
alloc := ev.AcceptedAllocations[0]
// Update pending resource
if err := s.confirmAllocationProposal(alloc.AllocationProposal); err != nil {
log.Logger().Error("failed to confirm allocation proposal", zap.Error(err))
// if the scheduler could not confirm the proposal,
// we need to notify the cache to revert the allocations
s.eventHandlers.CacheEventHandler.HandleEvent(&cacheevent.ReleaseAllocationsEvent{
AllocationsToRelease: []*commonevents.ReleaseAllocation{
{
UUID: alloc.Allocation.UUID,
ApplicationID: alloc.AllocationProposal.ApplicationID,
PartitionName: alloc.AllocationProposal.PartitionName,
Message: "scheduler is unable to confirm the allocation",
ReleaseType: si.AllocationReleaseResponse_STOPPED_BY_RM,
},
},
})
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New commit has a simple fix by passing back the UUID of the allocation in the proposal instead of the whole allocation with all the things that hang of it.
@@ -382,6 +383,86 @@ func TestAddNewNode(t *testing.T) { | |||
assert.Equal(t, 0, ms.getPartitionReservations()[appID], "reservations not removed from partition") | |||
} | |||
|
|||
// simple reservation one app one queue | |||
func TestUnReservationAndDeletion(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to run multiple times, one time is not enough to reproduce the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It crashed for me and the logs show that the ask is removed before the allocation is processed. If that does not happen the allocation will be confirmed. The checks at the end of the test case should not return no allocations on the nodes as they are not in the correct state etc.
That would cause the test case to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually only get this reproduced after a few runs. I don't think it crashes every time.
If that does not happen the allocation will be confirmed. The checks at the end of the test case should not return no allocations on the nodes as they are not in the correct state etc.
If that does not happen, then the allocation proposal was not made because it is already removed. That doesn't cover the case we are expecting it to cover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running multiple times only lowers the chance of that happening but it still does not guarantee that the events come in correctly. The only way that could be done is to stop part of the handling or manipulating the objects directly.
simplify passing back the UUID of the allocation in the proposal
Hey @wilfred-s, TravisBuddy Request Identifier: dc8839e0-a1b1-11ea-b485-534d9f3e4954 |
The latest commit only adds the UUID to the |
When an ask is released while an allocation for that ask is processed
the scheduler panics. The panic is caused by a race between two event
handling routines updating the same app. The changes remove the need for
the ask to be accessed on confirmation.
The inflight allocation is now also correctly removed from the cache
when the confirmation fails.
A race condition in reservation removal was fixed.
A sync issue between cache and scheduling node for available resources
was fixed.