[YUNIKORN-169] panic in ask release #152

wilfred-s · 2020-05-20T07:59:34Z

When an ask is released while an allocation for that ask is processed
the scheduler panics. The panic is caused by a race between two event
handling routines updating the same app. The changes remove the need for
the ask to be accessed on confirmation.
The inflight allocation is now also correctly removed from the cache
when the confirmation fails.

A race condition in reservation removal was fixed.
A sync issue between cache and scheduling node for available resources
was fixed.

When an ask is released while an allocation for that ask is processed the scheduler panics. The panic is caused by a race between two event handling routines updating the same app. The changes remove the need for the ask to be accessed on confirmation. The inflight allocation is now also correctly removed from the cache when the confirmation fails. A race condition in reservation removal was fixed. A sync issue between cache and scheduling node for available resources was fixed.

TravisBuddy · 2020-05-20T08:33:05Z

Travis tests have failed

Hey @wilfred-s,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

1st Build

View build log

curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(go env GOPATH)/bin v1.22.2

golangci/golangci-lint info checking GitHub for tag 'v1.22.2'
golangci/golangci-lint info found version: 1.22.2 for v1.22.2/linux/amd64
golangci/golangci-lint info installed /home/travis/gopath/bin/golangci-lint

make lint

running golangci-lint
pkg/scheduler/scheduling_node.go:59: File is not `goimports`-ed with -local github.com/apache/incubator-yunikorn (goimports)
		cachedAvailable:			 resources.NewResource(),
Makefile:45: recipe for target 'lint' failed
make: *** [lint] Error 1

TravisBuddy Request Identifier: 8109fa20-9a74-11ea-ae6f-3d1c6927e18c

When an ask s released iwhile an allocation for that ask is processed the scheduler panics. The panic is caused by a race between two event handling routines updating the same app. The changes remove the need for the ask to be accessed on confirmation.

TravisBuddy · 2020-05-20T10:04:51Z

Hey @wilfred-s,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 56ef3f40-9a81-11ea-ae6f-3d1c6927e18c

codecov-commenter · 2020-05-20T10:05:06Z

Codecov Report

Merging #152 into master will decrease coverage by 0.05%.
The diff coverage is 63.39%.

@@            Coverage Diff             @@
##           master     #152      +/-   ##
==========================================
- Coverage   58.62%   58.57%   -0.06%     
==========================================
  Files          62       62              
  Lines        6096     6160      +64     
==========================================
+ Hits         3574     3608      +34     
- Misses       2381     2411      +30     
  Partials      141      141

Impacted Files	Coverage Δ
pkg/cache/cluster_info.go	`2.43% <0.00%> (-0.01%)`	⬇️
pkg/scheduler/scheduler.go	`0.00% <0.00%> (ø)`
pkg/scheduler/scheduling_partition.go	`41.40% <15.78%> (+0.30%)`	⬆️
pkg/cache/node_info.go	`72.47% <72.72%> (-9.35%)`	⬇️
pkg/scheduler/scheduling_node.go	`81.50% <100.00%> (+0.09%)`	⬆️
pkg/scheduler/scheduling_queue.go	`88.21% <100.00%> (+0.32%)`	⬆️
pkg/scheduler/tests/mock_scheduler.go	`100.00% <100.00%> (ø)`
pkg/cache/application_info.go	`71.96% <0.00%> (-3.53%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ba56cc...5709c98. Read the comment docs.

When an ask s released iwhile an allocation for that ask is processed the scheduler panics. The panic is caused by a race between two event handling routines updating the same app. The changes remove the need for the ask to be accessed on confirmation.

TravisBuddy · 2020-05-20T12:52:57Z

Hey @wilfred-s,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: d2fc0660-9a98-11ea-ae6f-3d1c6927e18c

yangwwei · 2020-05-21T03:52:19Z

pkg/cache/application_info.go

+// Return the allocation for the allocation key. This will return the first allocation for the
+// allocation key. It will only work for the case that the repeat is 1.
+func (ai *ApplicationInfo) GetAllocationUUID(allocKey string) string {
+	ai.lock.RLock()
+	defer ai.lock.RUnlock()
+
+	// just get the first one, ignore multiple repeats for now.
+	// we cannot detect multiple repeats easily here.
+	for uuid, alloc := range ai.allocations {
+		if alloc.AllocationProto.AllocationKey == allocKey {
+			return uuid
+		}
+	}
+	// not found return an empty string
+	return ""
+}
+


This is less efficient, and it assumes a proposal only has 1 allocation.
To fix the uuid issue, can we use the approach we used in #151 ? That said, to define https://github.com/yangwwei/incubator-yunikorn-core/blob/8f2ffe4cbd54ad2e49bb193a938de0a0bd048009/pkg/scheduler/schedulerevent/events.go#L39. And use this like: https://github.com/yangwwei/incubator-yunikorn-core/blob/8f2ffe4cbd54ad2e49bb193a938de0a0bd048009/pkg/scheduler/scheduler.go#L326-L337.

We have no proper support multiple repeats at the moment.

Let me explain what I think is the problem with multiple repeats and with what you are proposing in code:
If the asks has multiple repeats we might have allocated one or more of those repeats. We're currently just link the ask to the allocation. One ask multiple allocations. Allocations are added and removed based on UUID. So far all is good.

In this case we are removing an ask. One of the repeats of that ask is in the process of being allocated. That ask can have a number of confirmed allocations and has at least one allocation in flight otherwise we would not have to clean up where we are. There might be unallocated repeats also but they are not important.
We need to clean up the in flight allocation that is one allocation. We cannot just remove all allocations that have been added and confirmed. They should not be impacted as they are not related to the current change. If those allocations need to be removed then that is an allocation removal request.

Those earlier allocated ask repeats might have been running for a while and because an outstanding ask gets deleted I cannot and may not remove confirmed allocations.
For the case that there are multiple allocation in flight there is a whole other complexity that needs to be looked at.

In the proposed code of PR #151 you are not only cancelling the inflight allocation but also all confirmed allocations, that would be an unexpected behaviour of cancelling an ask.

We have no proper support multiple repeats at the moment.

I agree. But we should not break the semantics as much as possible.

In the proposed code of PR #151 you are not only cancelling the inflight allocation but also all confirmed allocations, that would be an unexpected behaviour of cancelling an ask.

To get more consistency, we can do the following

From the cache, it first generates the allocation, and then send exact same message to the scheduler for confirmation. along with the proposal itself. Each time it only handles one allocation for one proposal.

m.EventHandlers.SchedulerEventHandler.HandleEvent(&schedulerevent.SchedulerAllocationUpdatesEvent{ AcceptedAllocations: []*schedulerevent.AcceptedAllocationProposal{ { AllocationProposal: proposal, Allocation: allocInfo.AllocationProto, }, }, })

In the scheduler, it processes the proposal, if it fails, notify cache to release the allocation:

alloc := ev.AcceptedAllocations[0] // Update pending resource if err := s.confirmAllocationProposal(alloc.AllocationProposal); err != nil { log.Logger().Error("failed to confirm allocation proposal", zap.Error(err)) // if the scheduler could not confirm the proposal, // we need to notify the cache to revert the allocations s.eventHandlers.CacheEventHandler.HandleEvent(&cacheevent.ReleaseAllocationsEvent{ AllocationsToRelease: []*commonevents.ReleaseAllocation{ { UUID: alloc.Allocation.UUID, ApplicationID: alloc.AllocationProposal.ApplicationID, PartitionName: alloc.AllocationProposal.PartitionName, Message: "scheduler is unable to confirm the allocation", ReleaseType: si.AllocationReleaseResponse_STOPPED_BY_RM, }, }, }) }

New commit has a simple fix by passing back the UUID of the allocation in the proposal instead of the whole allocation with all the things that hang of it.

yangwwei · 2020-05-21T03:56:16Z

pkg/scheduler/tests/scheduler_reservation_test.go

@@ -382,6 +383,86 @@ func TestAddNewNode(t *testing.T) {
 	assert.Equal(t, 0, ms.getPartitionReservations()[appID], "reservations not removed from partition")
 }

+// simple reservation one app one queue
+func TestUnReservationAndDeletion(t *testing.T) {


This needs to run multiple times, one time is not enough to reproduce the issue.

It crashed for me and the logs show that the ask is removed before the allocation is processed. If that does not happen the allocation will be confirmed. The checks at the end of the test case should not return no allocations on the nodes as they are not in the correct state etc.
That would cause the test case to fail.

I usually only get this reproduced after a few runs. I don't think it crashes every time.

If that does not happen the allocation will be confirmed. The checks at the end of the test case should not return no allocations on the nodes as they are not in the correct state etc.

If that does not happen, then the allocation proposal was not made because it is already removed. That doesn't cover the case we are expecting it to cover.

Running multiple times only lowers the chance of that happening but it still does not guarantee that the events come in correctly. The only way that could be done is to stop part of the handling or manipulating the objects directly.

simplify passing back the UUID of the allocation in the proposal

TravisBuddy · 2020-05-29T13:39:49Z

Hey @wilfred-s,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: dc8839e0-a1b1-11ea-b485-534d9f3e4954

yangwwei · 2020-05-29T17:07:41Z

The latest commit only adds the UUID to the AllocationProposal, this should be OK for now.
+1, let's merge this.
Thanks @wilfred-s

wilfred-s self-assigned this May 20, 2020

[YUNIKORN-169] lint check fix

3fb4faa

When an ask s released iwhile an allocation for that ask is processed the scheduler panics. The panic is caused by a race between two event handling routines updating the same app. The changes remove the need for the ask to be accessed on confirmation.

wilfred-s mentioned this pull request May 20, 2020

[YUNIKORN-169] Panic when removing allocation ask with inflight allocation #151

Closed

yangwwei reviewed May 21, 2020

View reviewed changes

wilfred-s added 2 commits May 27, 2020 12:47

Merge branch 'master' into yunikorn-169

76ba4c4

[YUNIKORN-169] review comments

5709c98

simplify passing back the UUID of the allocation in the proposal

yangwwei merged commit 7415412 into apache:master May 29, 2020

wilfred-s deleted the yunikorn-169 branch June 1, 2020 03:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-169] panic in ask release #152

[YUNIKORN-169] panic in ask release #152

wilfred-s commented May 20, 2020

TravisBuddy commented May 20, 2020

TravisBuddy commented May 20, 2020

codecov-commenter commented May 20, 2020 •

edited

Loading

TravisBuddy commented May 20, 2020

yangwwei May 21, 2020

wilfred-s May 21, 2020

yangwwei May 21, 2020 •

edited

Loading

wilfred-s May 29, 2020

yangwwei May 21, 2020

wilfred-s May 21, 2020

yangwwei May 21, 2020

wilfred-s May 29, 2020

TravisBuddy commented May 29, 2020

yangwwei commented May 29, 2020

[YUNIKORN-169] panic in ask release #152

[YUNIKORN-169] panic in ask release #152

Conversation

wilfred-s commented May 20, 2020

TravisBuddy commented May 20, 2020

Travis tests have failed

1st Build

TravisBuddy Request Identifier: 8109fa20-9a74-11ea-ae6f-3d1c6927e18c

TravisBuddy commented May 20, 2020

TravisBuddy Request Identifier: 56ef3f40-9a81-11ea-ae6f-3d1c6927e18c

codecov-commenter commented May 20, 2020 • edited Loading

Codecov Report

TravisBuddy commented May 20, 2020

TravisBuddy Request Identifier: d2fc0660-9a98-11ea-ae6f-3d1c6927e18c

yangwwei May 21, 2020

Choose a reason for hiding this comment

wilfred-s May 21, 2020

Choose a reason for hiding this comment

yangwwei May 21, 2020 • edited Loading

Choose a reason for hiding this comment

wilfred-s May 29, 2020

Choose a reason for hiding this comment

yangwwei May 21, 2020

Choose a reason for hiding this comment

wilfred-s May 21, 2020

Choose a reason for hiding this comment

yangwwei May 21, 2020

Choose a reason for hiding this comment

wilfred-s May 29, 2020

Choose a reason for hiding this comment

TravisBuddy commented May 29, 2020

TravisBuddy Request Identifier: dc8839e0-a1b1-11ea-b485-534d9f3e4954

yangwwei commented May 29, 2020

codecov-commenter commented May 20, 2020 •

edited

Loading

yangwwei May 21, 2020 •

edited

Loading