optimize scheduling decision when there are stale activations #5344

bdoyle0182 · 2022-10-26T22:25:23Z

Description

I believe this optimizes scheduling decisions and reduces overprovisioning without sacrificing any performance. The reasoning here is if a queue has stale activations, you risk over-provisioning if you make the calculation based on all available messages, not just stale ones. Here is an example:

t0ms: 5 activation arrives and for whatever reason it takes 100ms before the activation is processed even if there are containers existing (random network latency or whatever the brief blip may be). Let's say container throughput is 10 (10ms duration) and there's already 10 containers.
t50ms: 100 activations arrive and are available
t100ms: 5 activation is stale and enters the decision case when there is a stale activation. However it uses all 105 activations available to make a scheduling decision. 105 / 10 will create 11 containers. But it's not yet certain that the available messages can't be processed before going stale. You've therefore doubled containers for the action to 21 unnecessarily when 11 would be satisfactory. What this pr does is use the 5 stale activations at decision time to determine how much additional capacity is needed instead of total available activations, in this case it would be ceil(5 / 10) = 1.

Related issue and scope

I opened an issue to propose and discuss this change (#????)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

bdoyle0182 · 2022-10-26T22:56:48Z

I think an additional optimization would be to do the stale calculation and normal available message case in the same decision such that we take this code:

          case (Running, Some(duration)) if staleActivationNum > 0 =>
            // we can safely get the value as we already checked the existence
            val containerThroughput = staleThreshold / duration
            val num = ceiling(staleActivationNum.toDouble / containerThroughput)
            // if it tries to create more containers than existing messages, we just create shortage
            val actualNum = (if (num > staleActivationNum) staleActivationNum else num) - inProgress
            addServersIfPossible(
              existing,
              inProgress,
              containerThroughput,
              availableMsg,
              capacity,
              actualNum,
              staleActivationNum,
              duration,
              Running)

          // need more containers and a message is already processed
          case (Running, Some(duration)) =>
            // we can safely get the value as we already checked the existence
            val containerThroughput = staleThreshold / duration
            val expectedTps = containerThroughput * (existing + inProgress)

            if (availableMsg >= expectedTps && existing + inProgress < availableMsg) {
              val num = ceiling((availableMsg / containerThroughput) - existing - inProgress)
              // if it tries to create more containers than existing messages, we just create shortage
              val actualNum = if (num + totalContainers > availableMsg) availableMsg - totalContainers else num
              addServersIfPossible(
                existing,
                inProgress,
                containerThroughput,
                availableMsg,
                capacity,
                actualNum,
                staleActivationNum,
                duration,
                Running)
            } else {
              Future.successful(DecisionResults(Skip, 0))
            }

and consolidate it to this:

              // need more containers and a message is already processed
          case (Running, Some(duration)) =>
            // we can safely get the value as we already checked the existence
            val containerThroughput = staleThreshold / duration
            val expectedTps = containerThroughput * (existing + inProgress)
            val availableNonStaleActivations = availableMsg - staleActivationNum
            
            var staleContainerProvision = 0
            if (staleActivationNum > 0) {
              val num = ceiling(staleActivationNum.toDouble / containerThroughput)
              // if it tries to create more containers than existing messages, we just create shortage
              staleContainerProvision = (if (num > staleActivationNum) staleActivationNum else num) - inProgress
            }
            
            if (availableNonStaleActivations >= expectedTps && existing + inProgress < availableMsg) {
              val num = ceiling((availableNonStaleActivations / containerThroughput) - existing - inProgress)
              // if it tries to create more containers than existing messages, we just create shortage
              val actualNum = if (num + totalContainers > availableNonStaleActivations) availableNonStaleActivations - totalContainers else num
              addServersIfPossible(
                existing,
                inProgress,
                containerThroughput,
                availableMsg,
                capacity,
                actualNum + staleContainerProvision,
                staleActivationNum,
                duration,
                Running)
            } else if (staleContainerProvision > 0) {
              addServersIfPossible(
                existing,
                inProgress,
                containerThroughput,
                availableMsg,
                capacity,
                staleContainerProvision,
                staleActivationNum,
                duration,
                Running)
            } else {
              Future.successful(DecisionResults(Skip, 0))
            }

such that if you do need more containers for available msg, you don't need to wait an additional scheduling interval to create them. So take the same example from the description and say that there are 6 existing containers. The second code block will create 1 container for the stale activations and 5 containers for the 100 available messages that are not stale bringing us to the correct 11 capacity in one scheduling interval. If there were already 10 containers available like in the original example and we just need 1 additional container for the stale activations it would enter the else if case and just add the 1 container to account for the stale activations to bring it to 11.

codecov-commenter · 2022-10-27T02:58:38Z

Codecov Report

Merging #5344 (3199cd0) into master (ef725a6) will decrease coverage by 15.20%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           master    #5344       +/-   ##
===========================================
- Coverage   81.62%   66.41%   -15.21%     
===========================================
  Files         240      240               
  Lines       14313    14317        +4     
  Branches      610      608        -2     
===========================================
- Hits        11683     9509     -2174     
- Misses       2630     4808     +2178

Impacted Files	Coverage Δ
...ntainerpool/v2/FunctionPullingContainerProxy.scala	`78.07% <ø> (-0.34%)`	⬇️
...core/scheduler/queue/SchedulingDecisionMaker.scala	`0.00% <0.00%> (-97.60%)`	⬇️
...scala/org/apache/openwhisk/common/time/Clock.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...nwhisk/core/scheduler/queue/ContainerCounter.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...hisk/core/scheduler/message/ContainerMessage.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...core/database/cosmosdb/RxObservableImplicits.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...ore/database/cosmosdb/cache/CacheInvalidator.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...e/database/cosmosdb/cache/ChangeFeedConsumer.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...sk/core/scheduler/container/ContainerManager.scala	`0.00% <0.00%> (-95.91%)`	⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala	`0.00% <0.00%> (-95.85%)`	⬇️
... and 46 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

style95 · 2022-10-27T07:15:50Z

...duler/src/main/scala/org/apache/openwhisk/core/scheduler/queue/SchedulingDecisionMaker.scala

            // if it tries to create more containers than existing messages, we just create shortage
-            val actualNum = (if (num > availableMsg) availableMsg else num) - inProgress
+            val actualNum = (if (num > staleActivationNum) staleActivationNum else num) - inProgress


I still believe we need to consider all available messages.
This is because there can be both stale activations and non-stale(but about to be staled) activations.
It takes some time to provision more containers. Even if we can add more containers based on stale activations but we would easily need to add more containers if the container throughput itself is not enough.
To avoid too much over-provision, I think we can just substitute existing as well.

val actualNum = (if (num > availableMsg) availableMsg else num) - inProgress - existing

I think the code in my comment above covers this. I just pushed it so it's the code in the pr now. In the same decision interval, it will both make the calculation for additional containers needed for activations that are stale and increase containers based on # of available messages to meet the needed transactions per second. I think this is the most optimal case.

style95 · 2022-10-28T07:15:46Z

...duler/src/main/scala/org/apache/openwhisk/core/scheduler/queue/SchedulingDecisionMaker.scala

+              val actualNum =
+                if (num + totalContainers > availableNonStaleActivations) availableNonStaleActivations - totalContainers
+                else num
+              addServersIfPossible(


Can we add a unit test to cover this case?

Added two test cases to cover 1. when need to add containers both for stale messages and increase tps for non-stale available messages 2. when only need to add container to cover stale messages and tps is still met for non-stale messages.

Would appreciate approval on this one too when you have some time @style95

style95

LGTM with a minor nit.

style95

LGTM with a minor nit.

...test/scala/org/apache/openwhisk/core/scheduler/queue/test/SchedulingDecisionMakerTests.scala

…#5344) * optimize scheduling decision when there are stale activations * further optimization * scalafmt * add new test cases Co-authored-by: Brendan Doyle <brendand@qualtrics.com>

optimize scheduling decision when there are stale activations

2ac471b

further optimization

b9f9cf7

style95 reviewed Oct 27, 2022

View reviewed changes

scalafmt

22bf174

style95 reviewed Oct 28, 2022

View reviewed changes

add new test cases

3199cd0

style95 approved these changes Nov 1, 2022

View reviewed changes

...test/scala/org/apache/openwhisk/core/scheduler/queue/test/SchedulingDecisionMakerTests.scala Show resolved Hide resolved

bdoyle0182 merged commit 07c9202 into apache:master Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize scheduling decision when there are stale activations #5344

optimize scheduling decision when there are stale activations #5344

bdoyle0182 commented Oct 26, 2022 •

edited

Loading

bdoyle0182 commented Oct 26, 2022 •

edited

Loading

codecov-commenter commented Oct 27, 2022 •

edited

Loading

style95 Oct 27, 2022

bdoyle0182 Oct 27, 2022

style95 Oct 28, 2022

bdoyle0182 Oct 31, 2022

style95 left a comment

style95 left a comment

optimize scheduling decision when there are stale activations #5344

optimize scheduling decision when there are stale activations #5344

Conversation

bdoyle0182 commented Oct 26, 2022 • edited Loading

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

bdoyle0182 commented Oct 26, 2022 • edited Loading

codecov-commenter commented Oct 27, 2022 • edited Loading

Codecov Report

style95 Oct 27, 2022

Choose a reason for hiding this comment

bdoyle0182 Oct 27, 2022

Choose a reason for hiding this comment

style95 Oct 28, 2022

Choose a reason for hiding this comment

bdoyle0182 Oct 31, 2022

Choose a reason for hiding this comment

style95 left a comment

Choose a reason for hiding this comment

style95 left a comment

Choose a reason for hiding this comment

bdoyle0182 commented Oct 26, 2022 •

edited

Loading

bdoyle0182 commented Oct 26, 2022 •

edited

Loading

codecov-commenter commented Oct 27, 2022 •

edited

Loading