-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize scheduling decision when there are stale activations #5344
optimize scheduling decision when there are stale activations #5344
Conversation
I think an additional optimization would be to do the stale calculation and normal available message case in the same decision such that we take this code:
and consolidate it to this:
such that if you do need more containers for available msg, you don't need to wait an additional scheduling interval to create them. So take the same example from the description and say that there are 6 existing containers. The second code block will create 1 container for the stale activations and 5 containers for the 100 available messages that are not stale bringing us to the correct 11 capacity in one scheduling interval. If there were already 10 containers available like in the original example and we just need 1 additional container for the stale activations it would enter the |
Codecov Report
@@ Coverage Diff @@
## master #5344 +/- ##
===========================================
- Coverage 81.62% 66.41% -15.21%
===========================================
Files 240 240
Lines 14313 14317 +4
Branches 610 608 -2
===========================================
- Hits 11683 9509 -2174
- Misses 2630 4808 +2178
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
// if it tries to create more containers than existing messages, we just create shortage | ||
val actualNum = (if (num > availableMsg) availableMsg else num) - inProgress | ||
val actualNum = (if (num > staleActivationNum) staleActivationNum else num) - inProgress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still believe we need to consider all available messages.
This is because there can be both stale activations and non-stale(but about to be staled) activations.
It takes some time to provision more containers. Even if we can add more containers based on stale activations but we would easily need to add more containers if the container throughput itself is not enough.
To avoid too much over-provision, I think we can just substitute existing
as well.
val actualNum = (if (num > availableMsg) availableMsg else num) - inProgress - existing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code in my comment above covers this. I just pushed it so it's the code in the pr now. In the same decision interval, it will both make the calculation for additional containers needed for activations that are stale and increase containers based on # of available messages to meet the needed transactions per second. I think this is the most optimal case.
val actualNum = | ||
if (num + totalContainers > availableNonStaleActivations) availableNonStaleActivations - totalContainers | ||
else num | ||
addServersIfPossible( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unit test to cover this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added two test cases to cover 1. when need to add containers both for stale messages and increase tps for non-stale available messages 2. when only need to add container to cover stale messages and tps is still met for non-stale messages.
Would appreciate approval on this one too when you have some time @style95
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a minor nit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a minor nit.
...test/scala/org/apache/openwhisk/core/scheduler/queue/test/SchedulingDecisionMakerTests.scala
Show resolved
Hide resolved
…#5344) * optimize scheduling decision when there are stale activations * further optimization * scalafmt * add new test cases Co-authored-by: Brendan Doyle <brendand@qualtrics.com>
Description
I believe this optimizes scheduling decisions and reduces overprovisioning without sacrificing any performance. The reasoning here is if a queue has stale activations, you risk over-provisioning if you make the calculation based on all available messages, not just stale ones. Here is an example:
t0ms: 5 activation arrives and for whatever reason it takes 100ms before the activation is processed even if there are containers existing (random network latency or whatever the brief blip may be). Let's say container throughput is 10 (10ms duration) and there's already 10 containers.
t50ms: 100 activations arrive and are available
t100ms: 5 activation is stale and enters the decision case when there is a stale activation. However it uses all 105 activations available to make a scheduling decision. 105 / 10 will create 11 containers. But it's not yet certain that the available messages can't be processed before going stale. You've therefore doubled containers for the action to 21 unnecessarily when 11 would be satisfactory. What this pr does is use the 5 stale activations at decision time to determine how much additional capacity is needed instead of total available activations, in this case it would be ceil(5 / 10) = 1.
Related issue and scope
My changes affect the following components
Types of changes
Checklist: