Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-169] panic in ask release #152

Merged
merged 5 commits into from
May 29, 2020
Merged

Conversation

wilfred-s
Copy link
Contributor

When an ask is released while an allocation for that ask is processed
the scheduler panics. The panic is caused by a race between two event
handling routines updating the same app. The changes remove the need for
the ask to be accessed on confirmation.
The inflight allocation is now also correctly removed from the cache
when the confirmation fails.

A race condition in reservation removal was fixed.
A sync issue between cache and scheduling node for available resources
was fixed.

When an ask is released while an allocation for that ask is processed
the scheduler panics. The panic is caused by a race between two event
handling routines updating the same app. The changes remove the need for
the ask to be accessed on confirmation.
The inflight allocation is now also correctly removed from the cache
when the confirmation fails.

A race condition in reservation removal was fixed.
A sync issue between cache and scheduling node for available resources
was fixed.
@wilfred-s wilfred-s self-assigned this May 20, 2020
@TravisBuddy
Copy link

Travis tests have failed

Hey @wilfred-s,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

1st Build

View build log

curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(go env GOPATH)/bin v1.22.2
golangci/golangci-lint info checking GitHub for tag 'v1.22.2'
golangci/golangci-lint info found version: 1.22.2 for v1.22.2/linux/amd64
golangci/golangci-lint info installed /home/travis/gopath/bin/golangci-lint
make lint
running golangci-lint
pkg/scheduler/scheduling_node.go:59: File is not `goimports`-ed with -local github.com/apache/incubator-yunikorn (goimports)
		cachedAvailable:			 resources.NewResource(),
Makefile:45: recipe for target 'lint' failed
make: *** [lint] Error 1
TravisBuddy Request Identifier: 8109fa20-9a74-11ea-ae6f-3d1c6927e18c

When an ask s released iwhile an allocation for that ask is processed
the scheduler panics. The panic is caused by a race between two event
handling routines updating the same app. The changes remove the need for
the ask to be accessed on confirmation.
@TravisBuddy
Copy link

Hey @wilfred-s,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 56ef3f40-9a81-11ea-ae6f-3d1c6927e18c

@codecov-commenter
Copy link

codecov-commenter commented May 20, 2020

Codecov Report

Merging #152 into master will decrease coverage by 0.05%.
The diff coverage is 63.39%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #152      +/-   ##
==========================================
- Coverage   58.62%   58.57%   -0.06%     
==========================================
  Files          62       62              
  Lines        6096     6160      +64     
==========================================
+ Hits         3574     3608      +34     
- Misses       2381     2411      +30     
  Partials      141      141              
Impacted Files Coverage Δ
pkg/cache/cluster_info.go 2.43% <0.00%> (-0.01%) ⬇️
pkg/scheduler/scheduler.go 0.00% <0.00%> (ø)
pkg/scheduler/scheduling_partition.go 41.40% <15.78%> (+0.30%) ⬆️
pkg/cache/node_info.go 72.47% <72.72%> (-9.35%) ⬇️
pkg/scheduler/scheduling_node.go 81.50% <100.00%> (+0.09%) ⬆️
pkg/scheduler/scheduling_queue.go 88.21% <100.00%> (+0.32%) ⬆️
pkg/scheduler/tests/mock_scheduler.go 100.00% <100.00%> (ø)
pkg/cache/application_info.go 71.96% <0.00%> (-3.53%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ba56cc...5709c98. Read the comment docs.

When an ask s released iwhile an allocation for that ask is processed
the scheduler panics. The panic is caused by a race between two event
handling routines updating the same app. The changes remove the need for
the ask to be accessed on confirmation.
@TravisBuddy
Copy link

Hey @wilfred-s,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: d2fc0660-9a98-11ea-ae6f-3d1c6927e18c

Comment on lines 76 to 92
// Return the allocation for the allocation key. This will return the first allocation for the
// allocation key. It will only work for the case that the repeat is 1.
func (ai *ApplicationInfo) GetAllocationUUID(allocKey string) string {
ai.lock.RLock()
defer ai.lock.RUnlock()

// just get the first one, ignore multiple repeats for now.
// we cannot detect multiple repeats easily here.
for uuid, alloc := range ai.allocations {
if alloc.AllocationProto.AllocationKey == allocKey {
return uuid
}
}
// not found return an empty string
return ""
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no proper support multiple repeats at the moment.

Let me explain what I think is the problem with multiple repeats and with what you are proposing in code:
If the asks has multiple repeats we might have allocated one or more of those repeats. We're currently just link the ask to the allocation. One ask multiple allocations. Allocations are added and removed based on UUID. So far all is good.

In this case we are removing an ask. One of the repeats of that ask is in the process of being allocated. That ask can have a number of confirmed allocations and has at least one allocation in flight otherwise we would not have to clean up where we are. There might be unallocated repeats also but they are not important.
We need to clean up the in flight allocation that is one allocation. We cannot just remove all allocations that have been added and confirmed. They should not be impacted as they are not related to the current change. If those allocations need to be removed then that is an allocation removal request.

Those earlier allocated ask repeats might have been running for a while and because an outstanding ask gets deleted I cannot and may not remove confirmed allocations.
For the case that there are multiple allocation in flight there is a whole other complexity that needs to be looked at.

In the proposed code of PR #151 you are not only cancelling the inflight allocation but also all confirmed allocations, that would be an unexpected behaviour of cancelling an ask.

Copy link
Contributor

@yangwwei yangwwei May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no proper support multiple repeats at the moment.

I agree. But we should not break the semantics as much as possible.

In the proposed code of PR #151 you are not only cancelling the inflight allocation but also all confirmed allocations, that would be an unexpected behaviour of cancelling an ask.

To get more consistency, we can do the following

From the cache, it first generates the allocation, and then send exact same message to the scheduler for confirmation. along with the proposal itself. Each time it only handles one allocation for one proposal.

m.EventHandlers.SchedulerEventHandler.HandleEvent(&schedulerevent.SchedulerAllocationUpdatesEvent{
  AcceptedAllocations: []*schedulerevent.AcceptedAllocationProposal{
    {
      AllocationProposal: proposal,
      Allocation: allocInfo.AllocationProto,
    },
  },
})

In the scheduler, it processes the proposal, if it fails, notify cache to release the allocation:

alloc := ev.AcceptedAllocations[0]
// Update pending resource
if err := s.confirmAllocationProposal(alloc.AllocationProposal); err != nil {
  log.Logger().Error("failed to confirm allocation proposal", zap.Error(err))
  // if the scheduler could not confirm the proposal,
  // we need to notify the cache to revert the allocations			 
 s.eventHandlers.CacheEventHandler.HandleEvent(&cacheevent.ReleaseAllocationsEvent{
    AllocationsToRelease: []*commonevents.ReleaseAllocation{
      {
         UUID:          alloc.Allocation.UUID,
	 ApplicationID: alloc.AllocationProposal.ApplicationID,
	 PartitionName: alloc.AllocationProposal.PartitionName,
	 Message:       "scheduler is unable to confirm the allocation",
	 ReleaseType:   si.AllocationReleaseResponse_STOPPED_BY_RM,
	 },
	 },
  })
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New commit has a simple fix by passing back the UUID of the allocation in the proposal instead of the whole allocation with all the things that hang of it.

@@ -382,6 +383,86 @@ func TestAddNewNode(t *testing.T) {
assert.Equal(t, 0, ms.getPartitionReservations()[appID], "reservations not removed from partition")
}

// simple reservation one app one queue
func TestUnReservationAndDeletion(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to run multiple times, one time is not enough to reproduce the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It crashed for me and the logs show that the ask is removed before the allocation is processed. If that does not happen the allocation will be confirmed. The checks at the end of the test case should not return no allocations on the nodes as they are not in the correct state etc.
That would cause the test case to fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually only get this reproduced after a few runs. I don't think it crashes every time.

If that does not happen the allocation will be confirmed. The checks at the end of the test case should not return no allocations on the nodes as they are not in the correct state etc.

If that does not happen, then the allocation proposal was not made because it is already removed. That doesn't cover the case we are expecting it to cover.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running multiple times only lowers the chance of that happening but it still does not guarantee that the events come in correctly. The only way that could be done is to stop part of the handling or manipulating the objects directly.

simplify passing back the UUID of the allocation in the proposal
@TravisBuddy
Copy link

Hey @wilfred-s,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: dc8839e0-a1b1-11ea-b485-534d9f3e4954

@yangwwei
Copy link
Contributor

The latest commit only adds the UUID to the AllocationProposal, this should be OK for now.
+1, let's merge this.
Thanks @wilfred-s

@yangwwei yangwwei merged commit 7415412 into apache:master May 29, 2020
@wilfred-s wilfred-s deleted the yunikorn-169 branch June 1, 2020 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants