Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize scheduling decision when there are stale activations #5344

Merged
merged 4 commits into from
Nov 1, 2022

Conversation

bdoyle0182
Copy link
Contributor

@bdoyle0182 bdoyle0182 commented Oct 26, 2022

Description

I believe this optimizes scheduling decisions and reduces overprovisioning without sacrificing any performance. The reasoning here is if a queue has stale activations, you risk over-provisioning if you make the calculation based on all available messages, not just stale ones. Here is an example:

t0ms: 5 activation arrives and for whatever reason it takes 100ms before the activation is processed even if there are containers existing (random network latency or whatever the brief blip may be). Let's say container throughput is 10 (10ms duration) and there's already 10 containers.
t50ms: 100 activations arrive and are available
t100ms: 5 activation is stale and enters the decision case when there is a stale activation. However it uses all 105 activations available to make a scheduling decision. 105 / 10 will create 11 containers. But it's not yet certain that the available messages can't be processed before going stale. You've therefore doubled containers for the action to 21 unnecessarily when 11 would be satisfactory. What this pr does is use the 5 stale activations at decision time to determine how much additional capacity is needed instead of total available activations, in this case it would be ceil(5 / 10) = 1.

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@bdoyle0182
Copy link
Contributor Author

bdoyle0182 commented Oct 26, 2022

I think an additional optimization would be to do the stale calculation and normal available message case in the same decision such that we take this code:

          case (Running, Some(duration)) if staleActivationNum > 0 =>
            // we can safely get the value as we already checked the existence
            val containerThroughput = staleThreshold / duration
            val num = ceiling(staleActivationNum.toDouble / containerThroughput)
            // if it tries to create more containers than existing messages, we just create shortage
            val actualNum = (if (num > staleActivationNum) staleActivationNum else num) - inProgress
            addServersIfPossible(
              existing,
              inProgress,
              containerThroughput,
              availableMsg,
              capacity,
              actualNum,
              staleActivationNum,
              duration,
              Running)

          // need more containers and a message is already processed
          case (Running, Some(duration)) =>
            // we can safely get the value as we already checked the existence
            val containerThroughput = staleThreshold / duration
            val expectedTps = containerThroughput * (existing + inProgress)

            if (availableMsg >= expectedTps && existing + inProgress < availableMsg) {
              val num = ceiling((availableMsg / containerThroughput) - existing - inProgress)
              // if it tries to create more containers than existing messages, we just create shortage
              val actualNum = if (num + totalContainers > availableMsg) availableMsg - totalContainers else num
              addServersIfPossible(
                existing,
                inProgress,
                containerThroughput,
                availableMsg,
                capacity,
                actualNum,
                staleActivationNum,
                duration,
                Running)
            } else {
              Future.successful(DecisionResults(Skip, 0))
            }

and consolidate it to this:

              // need more containers and a message is already processed
          case (Running, Some(duration)) =>
            // we can safely get the value as we already checked the existence
            val containerThroughput = staleThreshold / duration
            val expectedTps = containerThroughput * (existing + inProgress)
            val availableNonStaleActivations = availableMsg - staleActivationNum
            
            var staleContainerProvision = 0
            if (staleActivationNum > 0) {
              val num = ceiling(staleActivationNum.toDouble / containerThroughput)
              // if it tries to create more containers than existing messages, we just create shortage
              staleContainerProvision = (if (num > staleActivationNum) staleActivationNum else num) - inProgress
            }
            
            if (availableNonStaleActivations >= expectedTps && existing + inProgress < availableMsg) {
              val num = ceiling((availableNonStaleActivations / containerThroughput) - existing - inProgress)
              // if it tries to create more containers than existing messages, we just create shortage
              val actualNum = if (num + totalContainers > availableNonStaleActivations) availableNonStaleActivations - totalContainers else num
              addServersIfPossible(
                existing,
                inProgress,
                containerThroughput,
                availableMsg,
                capacity,
                actualNum + staleContainerProvision,
                staleActivationNum,
                duration,
                Running)
            } else if (staleContainerProvision > 0) {
              addServersIfPossible(
                existing,
                inProgress,
                containerThroughput,
                availableMsg,
                capacity,
                staleContainerProvision,
                staleActivationNum,
                duration,
                Running)
            } else {
              Future.successful(DecisionResults(Skip, 0))
            }

such that if you do need more containers for available msg, you don't need to wait an additional scheduling interval to create them. So take the same example from the description and say that there are 6 existing containers. The second code block will create 1 container for the stale activations and 5 containers for the 100 available messages that are not stale bringing us to the correct 11 capacity in one scheduling interval. If there were already 10 containers available like in the original example and we just need 1 additional container for the stale activations it would enter the else if case and just add the 1 container to account for the stale activations to bring it to 11.

@codecov-commenter
Copy link

codecov-commenter commented Oct 27, 2022

Codecov Report

Merging #5344 (3199cd0) into master (ef725a6) will decrease coverage by 15.20%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           master    #5344       +/-   ##
===========================================
- Coverage   81.62%   66.41%   -15.21%     
===========================================
  Files         240      240               
  Lines       14313    14317        +4     
  Branches      610      608        -2     
===========================================
- Hits        11683     9509     -2174     
- Misses       2630     4808     +2178     
Impacted Files Coverage Δ
...ntainerpool/v2/FunctionPullingContainerProxy.scala 78.07% <ø> (-0.34%) ⬇️
...core/scheduler/queue/SchedulingDecisionMaker.scala 0.00% <0.00%> (-97.60%) ⬇️
...scala/org/apache/openwhisk/common/time/Clock.scala 0.00% <0.00%> (-100.00%) ⬇️
...nwhisk/core/scheduler/queue/ContainerCounter.scala 0.00% <0.00%> (-100.00%) ⬇️
...hisk/core/scheduler/message/ContainerMessage.scala 0.00% <0.00%> (-100.00%) ⬇️
...core/database/cosmosdb/RxObservableImplicits.scala 0.00% <0.00%> (-100.00%) ⬇️
...ore/database/cosmosdb/cache/CacheInvalidator.scala 0.00% <0.00%> (-100.00%) ⬇️
...e/database/cosmosdb/cache/ChangeFeedConsumer.scala 0.00% <0.00%> (-100.00%) ⬇️
...sk/core/scheduler/container/ContainerManager.scala 0.00% <0.00%> (-95.91%) ⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0.00% <0.00%> (-95.85%) ⬇️
... and 46 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

// if it tries to create more containers than existing messages, we just create shortage
val actualNum = (if (num > availableMsg) availableMsg else num) - inProgress
val actualNum = (if (num > staleActivationNum) staleActivationNum else num) - inProgress
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe we need to consider all available messages.
This is because there can be both stale activations and non-stale(but about to be staled) activations.
It takes some time to provision more containers. Even if we can add more containers based on stale activations but we would easily need to add more containers if the container throughput itself is not enough.
To avoid too much over-provision, I think we can just substitute existing as well.

val actualNum = (if (num > availableMsg) availableMsg else num) - inProgress - existing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code in my comment above covers this. I just pushed it so it's the code in the pr now. In the same decision interval, it will both make the calculation for additional containers needed for activations that are stale and increase containers based on # of available messages to meet the needed transactions per second. I think this is the most optimal case.

val actualNum =
if (num + totalContainers > availableNonStaleActivations) availableNonStaleActivations - totalContainers
else num
addServersIfPossible(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unit test to cover this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added two test cases to cover 1. when need to add containers both for stale messages and increase tps for non-stale available messages 2. when only need to add container to cover stale messages and tps is still met for non-stale messages.

Would appreciate approval on this one too when you have some time @style95

Copy link
Member

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor nit.

Copy link
Member

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor nit.

@bdoyle0182 bdoyle0182 merged commit 07c9202 into apache:master Nov 1, 2022
msciabarra pushed a commit to nuvolaris/openwhisk that referenced this pull request Nov 23, 2022
…#5344)

* optimize scheduling decision when there are stale activations

* further optimization

* scalafmt

* add new test cases

Co-authored-by: Brendan Doyle <brendand@qualtrics.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants