-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-1717] cache sortedQueues #567
Conversation
d790a53
to
af776a4
Compare
Signed-off-by: Frank Yang <yangpoan@gmail.com>
af776a4
to
5bf1a7e
Compare
var object string | ||
switch v := event.Args[0].(type) { | ||
case *Queue: | ||
object = v.QueuePath | ||
case string: | ||
object = v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is it "Queue" and when is it "string"? It seems to me that we only handle events from a single place so event.Args[0]
will always be a "Queue" from this point on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PartitionContext
also uses the same state machine. It gives a partition name.
yunikorn-core/pkg/scheduler/partition.go
Line 88 in 68a8282
stateMachine: objects.NewObjectState(), |
yunikorn-core/pkg/scheduler/partition.go
Line 268 in 68a8282
err := pc.stateMachine.Event(context.Background(), event.String(), pc.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then simply introduce a new state machine for Queue.
@@ -1823,3 +1853,10 @@ func (sq *Queue) recalculatePriority() int32 { | |||
sq.currentPriority = curr | |||
return priorityValueByPolicy(sq.priorityPolicy, sq.priorityOffset, curr) | |||
} | |||
|
|||
// clearSortedQueues remove sortedQueue property. This is used for child queue get into stopped state. | |||
func (sq *Queue) clearSortedQueues() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this should be called all the way up to root. If a leaf is stopped, then it affects the entire hierarchy which is above, not just the immediate parent.
if !reflect.DeepEqual(parent.sortedQueues, parent.getSortedQueues()) { | ||
t.Errorf("parent queue did not use cached sorted queues: %v", parent) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's based on existing code, but let's use assert.Equal()
or assert.Assert()
, there's no need for t.Errorf()
. It's much cleaner.
if sq.IsLeafQueue() { | ||
return nil | ||
} | ||
|
||
sq.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract this to a separate internal method, that defer
right before the return doesn't make any sense.
Like:
func (sq *Queue) getCachedSortedQueues() []*Queue {
sq.RLock()
defer sq.RUnlock()
return sq.sortedQueues
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
There are some things that need to be addressed.
Also, I'd be happy to see some sort of profiling with a slightly more complicated queue hierarchy (not just "'root.default", at least some parents and 15-20 leafs) to see how this performs.
Thanks for the review. I will do some profiling and update my code tomorrow. |
Codecov Report
@@ Coverage Diff @@
## master #567 +/- ##
==========================================
+ Coverage 75.01% 75.22% +0.20%
==========================================
Files 72 72
Lines 11555 11680 +125
==========================================
+ Hits 8668 8786 +118
- Misses 2582 2587 +5
- Partials 305 307 +2
... and 5 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
AFAIK this solution won't give too much speedup. It will eliminate sorting when no allocation occurs, however, all allocation will invalidate the cache. We want to optimize for the busy case, so I don't expect to see earth shattering results. Two things we can do:
We have to check if But the thing is, eliminating request and application sorting (I have some local changes related to this) already gives a substantial performance boost. So we might as well just ignore it. We don't usually have thousands of queues. Performing quicksort (actually it's insertion sort if "n" less than 12) on 20-30 queues is not something which kills performance. |
I have some concerns about this. Queue sorting has never really showed up as a hot spot for several reasons. For one thing, there are generally far fewer queues than outstanding requests (if not, the cluster is relatively idle and performance isn't an issue). Sorting of queues also happens only once per scheduling cycle, and if we are able to schedule a task, the sorting needs to be re-done anyway, so the only thing caching the sorted queues gets us is a slight speedup when there is nothing available to schedule anyway. Some of the changes needed here to make this work (including potentially a new FSM for Queue?) strike me as a bridge too far, given the likely negligible impact on performance. Additionally, changes this invasive are going to make implementing additional features more difficult -- things like multiple partition support come to mind. In short, I think we should table this JIRA. If at some point in the future this becomes a bottleneck, we can readdress it. |
Hi @pbacsko and @craigcondit, I agree with you. Most of the time, we don't have many queues, and cached results may be stale after allocated/pending resources are updated. Also, including cached result makes the code more complex. I will close the PR and see what we can do in the future. Thanks! |
What is this PR for?
Only need sorting queues if the following occurred:
What type of PR is it?
Todos
What is the Jira issue?
https://issues.apache.org/jira/browse/YUNIKORN-1717
How should this be tested?
Screenshots (if appropriate)
Questions: