Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust prewarm container dynamically #4871

Merged

Conversation

ningyougang
Copy link
Contributor

@ningyougang ningyougang commented Apr 2, 2020

Description

As we already know, when invoker starts up, will create some prewarmed containers in prewarmedpool in advance.

If the prewarmed containers are not used for sometime, the prewarmed containers will exist forever.
it may lead to waste of resources

so there has need to provide some mechanism to save memory resources, e.g.

  • If the prewarmed containers are not used for sometime, should delete them automatically.
  • When cold start happened, we should supplement a prewarmed container which matched the kind/memory in runtime.

Node: Current, i didn't add test cases, after some reviews, i will add it.

Related issue and scope

#4725 (comment)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@@ -58,7 +58,8 @@
"stemCells": [
{
"count": 2,
"memory": "256 MB"
"memory": "256 MB",
"ttlMinutes": 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding more fine granularity by changing the unit to second?

Copy link
Contributor Author

@ningyougang ningyougang Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unit of second is very small, and the unit of hour is too big.
so i chosed the unit of minute (second < minute < hour).

Does it make sense if we use unit of second?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not FiniteDuration? So that operators can make the value whatever is appropriate for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tysonnorris ,got it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already changed to Duration

}
}
for ((container, data) <- containers) {
if (Duration.between(data.lastUsed, Instant.now).compareTo(Duration.ofMinutes(ttlMinutes)) > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no activation for prewarm kind for the given TTL time, it will always cause a cold start next time.
Also, even if we can add more prewarm containers when cold starts happen I think it may not fit well for some workloads. If loads spike for some runtime kind, it will cause a lot of cold starts. Also, it will try to create more prewarm containers at the same time. As we know, docker operation is quite expensive and it will end up with a huge number of container creation along with long wait-time.

Copy link
Contributor Author

@ningyougang ningyougang Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree with you.

I introduced a new mechanism to avoid the cold start
added threshold, increment in runtime.json
after prewarmed containers are deleted due to unused.
if action for that kind/memory's activation number >= threshold in previous one minute, create increment prewarmed containers.

refer to the new added commit: 8c36f33

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@style95 already fixed

@ningyougang ningyougang force-pushed the adjust-prewarm-container-dynamically branch 3 times, most recently from 2fd4dd1 to 8c36f33 Compare April 8, 2020 05:48
.foreach { _ =>
val time = new SimpleDateFormat("yyyy-MM-dd-HH:mm").format(Calendar.getInstance().getTime)
val key = s"${kind},${memoryLimit.toMB},${time}"
coldStartCount.getOrElseUpdate(key, new AtomicInteger(0)).getAndIncrement()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need AtomicInteger? Container allocation should happen in actor message processing, which is single threaded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed

@@ -80,8 +85,49 @@ class ContainerPool(childFactory: ActorRefFactory => ActorRef,
//periodically emit metrics (don't need to do this for each message!)
context.system.scheduler.schedule(30.seconds, 10.seconds, self, EmitMetrics)

// Key is `kind,memory,time`, value is the number of cold Start in minute
val coldStartCount = TrieMap[String, AtomicInteger]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a var immutable.Map instead, i think?

Copy link
Contributor

@tysonnorris tysonnorris Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I would probably not create a separate map here, but rather add expiration:Option[Deadline] to PreWarmedData class. Then you can track expiration AND more easily filter to find expired prewarms.
EDIT: sorry, to be clear: the expiration is required to remove unused ones, but there is a separate issue that needs to be tracked which is prewam "miss", where the config matches prewarm config, but no prewarm was available. For the miss, you can use an immutable.Map that is reset every minute (or configurable time period), without tracking timestamps or expirations there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed

@tysonnorris
Copy link
Contributor

Please also update the ContainerPool.emitMetrics() function to include count per stemcell config, instead of just aggregate count of all stemcells - since with fluctuating numbers, these values will get very confusing.

"memory": "256 MB",
"ttlMinutes": 10,
"threshold": 5,
"increment": 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this scenario, I think "count" behaves like "initialCount" -right? so now I think we should also have a "maxCount" so that additional prewarms cannot be created past the maxCount limit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider renaming "count" to "initialCount", but I'm not sure this is an issue for backwards compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed

@ningyougang ningyougang force-pushed the adjust-prewarm-container-dynamically branch 3 times, most recently from 98e06dd to cf30f7c Compare April 11, 2020 09:51
@ningyougang ningyougang reopened this Apr 12, 2020
@ningyougang
Copy link
Contributor Author

@style95 @tysonnorris
Have any other review comments?

backfillPrewarms(true)

// check periodically every 1 minute, delete prewarmed container if unused for sometime
context.system.scheduler.schedule(10.seconds, 1.minute, self, DeleteUnusedPrewarmedContainer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the main idea is removing unused prewarm containers every 1 minute and create more prewarm containers based on the usage during the last 1 minute.

I think it is a good starting point.
This is a kind of heuristic approach expecting the future based on history

One thing to consider is, we can't exactly expect the required prewarm types.
Any kind of activation can come at any time.
If an action is sparsely invoked or invoked with a bigger interval than prewarm TTL, it will always trigger a cold-start.
So how about introducing a minimum limit as well?
We can configure the minimum number of prewarm containers while controlling the ratio of them based on the usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding So how about introducing a minimum limit as well?
You mean add minCount in runtime.json

Regarding We can configure the minimum number of prewarm containers while controlling the ratio of them based on the usage.
What's mean? the minimum numbe can be changed on some cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that without a minimum, the prewarms may all go to zero, which is what you need if you have blackbox dedicated invokers if you don't want any prewarms running there.
So the stemcell config could have:
initialCount : this should probably be whatever the value is used today for number of prewarms of that config.
minCount : 0 is required, if you want to prevent case where unused prewarms are wasting resources, but it implies that you will get a cold start for that config on that invoker.
maxCount : this is an upper limit on max number of prewarms for that config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future work, I would also consider:

  • Does this invoker support only blackbox? If so, it can safely scale down to 0 prewarms, since blackbox are never prewarmed.
  • Can new prewarm configs be completely inferred? So that if lots of actions of a particular config arrive, they eventually get benefits of having prewarms.
  • Can blackbox then be allowed to have prewarms, if they receive enough traffic at this invoker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already added minCount in runtimes.json

calendar.add(Calendar.MINUTE, -1)
val lastDate = new SimpleDateFormat("yyyy-MM-dd-HH:mm").format(calendar.getTime)
val key = s"${kind},${memory.toMB},${lastDate}"
coldStartCount.get(key) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use coldStartCount.remove(key)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we changed from immutable.Map to TrieMap, we can use coldStartCount.remove
But here, coldStartCount can only be accessed via single thread due to actor, i think use immutable.Map is ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for immutable.Map

val calendar = Calendar.getInstance()
calendar.add(Calendar.MINUTE, -1)
val lastDate = new SimpleDateFormat("yyyy-MM-dd-HH:mm").format(calendar.getTime)
val key = s"${kind},${memory.toMB},${lastDate}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to use a typed key such as a case class rather than a String key with multiple fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -80,8 +86,59 @@ class ContainerPool(childFactory: ActorRefFactory => ActorRef,
//periodically emit metrics (don't need to do this for each message!)
context.system.scheduler.schedule(30.seconds, 10.seconds, self, EmitMetrics)

// Key is `kind,memory,time`, value is the number of cold Start in minute
var coldStartCount = immutable.Map.empty[String, Int]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this map can be used in multiple threads, should we use concurrent.map instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, really, i think coldStartCount can only be accessed in single thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

immutable.Map should be fine if it is only modified during actor message processing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok I thought it was being accessed by a scheduler and actor thread at the same time.
But it was not.

// prewarmed container got removed
case PreWarmedContainerRemoved =>
prewarmedPool.get(sender()).foreach { _ =>
logging.info(this, "prewarmed container is deleted due to unused for long time")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to include the type of prewarm container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already added

@@ -313,12 +391,12 @@ class ContainerPool(childFactory: ActorRefFactory => ActorRef,
}
val startingCount = prewarmStartingPool.count(p => p._2._1 == kind && p._2._2 == memory)
val containerCount = currentCount + startingCount
if (containerCount < config.count) {
if (containerCount < config.initialCount) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config.initialCount is being used multiple times.
How about introducing a local variable for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already added

case None => coldStartCount = coldStartCount + (key -> 1)
}
for ((k, v) <- coldStartCount) {
logging.info(this, s"===statistics the cold start, k: ${k}, v: ${v}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing ===?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already removed

}

/** statistics the cold start */
def countColdStart(kind: String, memoryLimit: ByteSize): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this is to increase a count for the number of cold-starts.
How about renaming this to something more intuitive such as incrementColdStartCount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already changed to incrementColdStartCount

}
.foreach { _ =>
val time = new SimpleDateFormat("yyyy-MM-dd-HH:mm").format(Calendar.getInstance().getTime)
val key = s"${kind},${memoryLimit.toMB},${time}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with the above.
A typed key is preferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

context.system.scheduler.schedule(10.seconds, 1.minute, self, DeleteUnusedPrewarmedContainer)

// check periodically for the cold start and create some increment containers automatically if activation >= threshold
context.system.scheduler.schedule(1.minute, 1.minute, self, SupplementPrewarmedContainer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use same scheduled message to do both delete + supplement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already used same message.

@@ -146,7 +146,7 @@ case class PreWarmedData(override val container: Container,
kind: String,
override val memoryLimit: ByteSize,
override val activeActivationCount: Int = 0)
extends ContainerStarted(container, Instant.EPOCH, memoryLimit, activeActivationCount)
extends ContainerStarted(container, Instant.now(), memoryLimit, activeActivationCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overloading the lastUse property is confusing - I would move that property to ContainerInUse trait, and create a dedicated property in PrewarmedData for tracking expiration Instant or Deadline, so that every scheduled period can easily check (without calculation of expiration) whether the expiration is in the past.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe your method(I would move that property to ContainerInUse trait, and create a dedicated property in PrewarmedData for tracking expiration Instant or Deadline) is good,

But when i wrote using your method, it seems i cannot wrote it well, can you check it in your local?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it takes refactoring some or most of the uses of "ContainerData" generic type. For now you can just:

  • change prewarmedPool to type var prewarmedPool = immutable.Map.empty[ActorRef, PreWarmedData]
  • add a val created = Instant.now() to PrewarmedData type
    Then least the ContainerData.lastUsed property for use when container is actually allocated to an action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly

for ((container, data) <- containers) {
if (JDuration.between(data.lastUsed, Instant.now).compareTo(JDuration.ofSeconds(ttlSeconds)) > 0) {
// Don't recover a new one under this situation
container ! RemovePreWarmedContainer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just use Remove message here?

Copy link
Contributor Author

@ningyougang ningyougang Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If use Remove message to remove unused prewarmed container, this lead to another new prewarmed created after delete the unused prewarmed container

Logic as below
firstly, codes will be executed here: https://github.com/apache/openwhisk/blob/master/core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/ContainerProxy.scala#L376, then after executed context.parent ! ContainerRemoved in def destroyContainer, codes will be executed here: https://github.com/apache/openwhisk/blob/master/core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/ContainerPool.scala#L272

So i used another message RemovePreWarmedContainer to distinguish them, and has another benefit, after deleted the prewarmed container, can print the prewarmed container's detail info via PreWarmedContainerRemoved(prewarmedData: PreWarmedData)'s prewarmedData field

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case, I suggest changing case object ContainerRemoved to case class ContainerRemoved(replacePrewarm:Booleam) so that we can limit the duplicated logic in ContainerProxy - would that work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, can work, updated accordingly

@@ -375,6 +377,8 @@ class ContainerProxy(factory: (TransactionId,

case Event(Remove, data: PreWarmedData) => destroyContainer(data)

case Event(RemovePreWarmedContainer, data: PreWarmedData) => destoryPreWarmedContainer(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using Remove message and existing destroyContainer() function is simpler?

Copy link
Contributor Author

@ningyougang ningyougang Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly

@ningyougang ningyougang force-pushed the adjust-prewarm-container-dynamically branch from cf30f7c to d6a6181 Compare April 23, 2020 09:10
val currentCount = prewarmedPool.count {
case (_, PreWarmedData(_, `kind`, `memory`, _)) => true //done starting
case _ => false //started but not finished starting
}
val startingCount = prewarmStartingPool.count(p => p._2._1 == kind && p._2._2 == memory)
val containerCount = currentCount + startingCount
if (containerCount < config.count) {
if (containerCount < initialCount) {
Copy link
Contributor

@tysonnorris tysonnorris Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be something like:

val requiredCount = if (init) { initialCount } else {minCount}
if (containerCount < requiredCount) {

Otherwise if min is 0 and initial is 2, we can never scale down to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly

// check periodically every 1 minute, adjust prewarmed container(delete if unused for some time and create some increment containers)
context.system.scheduler.schedule(1.minute, 1.minute, self, AdjustPrewarmedContainer)

def adjustPrewarmedContainer(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you DRY up the code in adjustPrewarmedContainer and backfillPrewarms, possibly using the same function for both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm.. it seems adjustPrewarmedContainer's logic is not same as backfillPrewarms, and in its inner, it seems can't extract the common code as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of duplicated delicate logic - these need to be consolidated, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a proposed version that:

  • consolidates backfillPrewarms() and adjustPrewarmedContainer()
  • removes prewarmContainerIfpossible()
  def adjustPrewarmedContainer(init: Boolean, scheduled: Boolean): Unit = {
    prewarmConfig.foreach { config =>
      // Delete unused prewarmed container until minCount is reached
      val kind = config.exec.kind
      val memory = config.memoryLimit

      val runningCount = prewarmedPool.count {
        //done starting, and not expired
        case (_, p @ PreWarmedData(_, `kind`, `memory`, _, _)) if !p.isExpired() => true
        //started but not finished starting (or expired)
        case _ => false
      }
      val startingCount = prewarmStartingPool.count(p => p._2._1 == kind && p._2._2 == memory)
      val currentCount = runningCount + startingCount
      //determine how many are needed
      val desiredCount: Int =
        if (init) config.initialCount
        else {
          if (scheduled) {
            //scheduled/reactive config backfill
            config.reactive
              .map(c => getReactiveCold(c, kind, memory).getOrElse(c.minCount)) //reactive -> desired is either cold start driven, or minCount
              .getOrElse(config.initialCount) //not reactive -> desired is always initial count
          } else {
            //normal backfill after removal - make sure at least minCount or initialCount is started
            config.reactive.map(_.minCount).getOrElse(config.initialCount)
          }
        }
      if (currentCount < desiredCount) {
        logging.info(
          this,
          s"found ${currentCount} started and ${startingCount} starting; ${if (init) "initing" else "backfilling"} ${desiredCount - currentCount} pre-warms to desired count: ${desiredCount} for kind:${config.exec.kind} mem:${config.memoryLimit.toString}")(
          TransactionId.invokerWarmup)
        (currentCount until desiredCount).foreach { _ =>
          prewarmContainer(config.exec, config.memoryLimit, config.reactive.map(_.ttl))
        }

      }

      //remove expired
      config.reactive.foreach { reactiveValue =>
        prewarmedPool
          .filter { warmInfo =>
            warmInfo match {
              case (_, p @ PreWarmedData(_, `kind`, `memory`, _, _)) if p.isExpired() => true
              case _                                                                  => false
            }
          }
          .drop(reactiveValue.minCount) //keep minCount even if expired
          .map(_._1 ! Remove)
      }
    }
    if (scheduled) {
      //clear coldStartCounts each time scheduled event is processed to reset counts
      coldStartCount = immutable.Map.empty[ColdStartKey, Int]
    }
  }

  def getReactiveCold(config: ReactivePrewarmingConfig, kind: String, memory: ByteSize): Option[Int] = {
    coldStartCount.get(ColdStartKey(kind, memory)).map { value =>
      // Let's assume that threshold is `2`, increment is `1` in runtime.json
      // if cold start number in previous minute is `2`, requireCount is `2/2 * 1 = 1`
      // if cold start number in previous minute is `4`, requireCount is `4/2 * 1 = 2`
      math.min(math.max(config.minCount, (value / config.threshold) * config.increment), config.maxCount)
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! codes become more clear.
Updated accordingly.

@@ -32,10 +36,14 @@ sealed trait WorkerState
case object Busy extends WorkerState
case object Free extends WorkerState

case class ColdStartKey(kind: String, memory: Long, date: String)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is date used for?
Memory should be ByteSize for consistency elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly

}
case None =>
}
coldStartCount = coldStartCount - coldStartKey
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just reset the map to empty? Otherwise this map may grow without bound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, reset the map's key of kind/memory to empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that it might be good to simple reset the map at once, instead of removing each key separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly


/** statistics the cold start */
def incrementColdStartCount(kind: String, memoryLimit: ByteSize): Unit = {
prewarmConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would clarify that this is only for cold start of prewarm configs, e.g. not blackbox or other configs. Maybe just a comment is fine, but this is very important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly

kind == config.exec.kind && memoryLimit == config.memoryLimit
}
.foreach { _ =>
val time = new SimpleDateFormat("yyyy-MM-dd-HH:mm").format(Calendar.getInstance().getTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this time field in the key, i think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, because our schedule intervel is every minute, so can remove it

prewarmContainerIfpossible(kind, memory, count)
} else {
// at lease create 1 prewarmed container
prewarmContainerIfpossible(kind, memory, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this else should be here - it is not clear from configuration that when value==5 and threshold==5, you get 5/increment created, but if value== 1,2,3,or4, you get 1 created. It should just always use math.max(1, value/increment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly, you can check it again ^^

memoryLimit: ByteSize,
ttl: Duration,
threshold: Int,
increment: Int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be good to separate the dynamic affects into a separate config, which is disabled by default so that operators need to opt-in to this new behavior. e.g.

case class PrewarmingConfig(initialCount: Int,
                            exec: CodeExec[_],
                            memoryLimit: ByteSize,
                            reactiveConfig: Option[ReactivePrewarmingConfig]=None)

case class ReactivePrewarmingConfig(minCount: Int,
                            maxCount: Int,
                            ttl: Duration,
                            threshold: Int,
                            increment: Int)

Add some comments to indicate that the relationship of threshold and increment is; i.e. how do they determine the number of prewarms created and when.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, updated accordingly.

val memory = config.memoryLimit
val ttlSeconds = config.ttl.toSeconds
val minCount = config.minCount
val containers = prewarmedPool.filter { warmInfo =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest:

  • change PrewarmData.created:Instant to PrewarmData.expires: Option[Deadline]
  • add to PrewarmedData:def isExpired(): Boolean = expires.exists(_.isOverdue())
  • simplifying this removal logic to
      prewarmedPool
        .filter { warmInfo =>
          warmInfo match {
            case (_, p @ PreWarmedData(_, `kind`, `memory`, _, _)) if p.isExpired() => true
            case _                                                                  => false
          }
        }
        .drop(config.minCount)
        .foreach(_._1 ! Remove)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated accordingly.

@tysonnorris
Copy link
Contributor

tysonnorris commented Apr 24, 2020

BTW I think the takePrewarmContainer needs enhancement as well, so that the "oldest" prewarm is always used - otherwise, it could end up where a random but recent prewarm is taken, and then all the others suddenly expire and are removed, leaving only the min. But if we always take the oldest, then there will be more left when expiring ones are removed.
e.g. (assuming PrewarmedData.expires is a Option[Deadline] as commented elsewhere)

    val now = Deadline.now
    prewarmedPool
      .toSeq
      .sortBy(_._2.expires.getOrElse(now))
      .find {
      ...

@ningyougang ningyougang force-pushed the adjust-prewarm-container-dynamically branch from 3d21308 to 7b761ce Compare May 14, 2020 09:47
val coldStartValue = coldStart._2
MetricEmitter.emitHistogramMetric(
LoggingMarkers.CONTAINER_POOL_COLDSTART(coldStartKey.memory.toString, coldStartKey.kind),
coldStartValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this metric "prewarmColdstart" - to not confuse it with containerStart (which has a tag for containerState). These values are not just cold starts - but cold starts that matched prewarm configs (and did not have a prewarm available)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated accordingly

}
// emit expired container counter metric with memory + kind
val removedCount = expiredPrewarmedContainer.size
MetricEmitter.emitHistogramMetric(LoggingMarkers.CONTAINER_POOL_EXPIRED(memory.toString, kind), removedCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this metric "prewarmExpired"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, these metrics should be counters not histograms, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated accordingly

stream.reset()

// Make sure prewarmed containers can be deleted from prewarmedPool due to unused
Thread.sleep(3.seconds.toMillis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should sleep an amount based on ttl (e.g. ttl + 1.millis)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated accordingly

@@ -67,6 +68,9 @@ class ContainerPoolTests
// the values is done properly.
val exec = CodeExecAsString(RuntimeManifest("actionKind", ImageName("testImage")), "testCode", None)
val memoryLimit = 256.MB
val ttl = FiniteDuration(2, TimeUnit.SECONDS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make ttl shorter - so that test runs don't take longer than necessary. Maybe 500.millis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! updated accordingly

containers(1).send(pool, ContainerRemoved(false))

// Make sure prewarmed containers are deleted
Thread.sleep(2.seconds.toMillis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave out this sleep (and the log check) - receiving the Remove message should be enough to make sure it is removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated accordingly


pool ! AdjustPrewarmedContainer
// Make sure adjustPrewarmContainer run finished
Thread.sleep(2.seconds.toMillis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can avoid most of the Thread.sleep usage - e.g. use eventually, or similar

eventually { 
  stream.toString should include(s"currentCount: 0 prewarmed container")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! updated accordingly

Thread.sleep(ttl.toMillis)
pool ! AdjustPrewarmedContainer
// Make sure adjustPrewarmContainer run finished
Thread.sleep(2.seconds.toMillis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this sleep?

Copy link
Contributor Author

@ningyougang ningyougang May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed accordingly

// Make sure the created prewarmed containers are expired
stream.reset()
Thread.sleep(ttl.toMillis)
pool ! AdjustPrewarmedContainer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't pool send AdjustPrewarmedContainer to itself eventually after the poolConfig.prewarmExpiredCheckPeriod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think need, bacause the prewarmExpirationCheckInterval is 1 minute

  def poolConfig(userMemory: ByteSize) =
    ContainerPoolConfig(userMemory, 0.5, false, FiniteDuration(1, TimeUnit.MINUTES))

so need to send AdjustPrewarmedContainer to itself manually in codes.
Why i don't change above value to a less value, the reason is i want to make user's to know the detail clearly when see the test codes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the test is written in a way suggesting that the AdjustPrewarmedContainer message is explicitly sent by logic, but it never is - it is only ever sent via schedule. So it would be more clear to rely on the timing (with a shorter value) and see that the results of the message occur without interference of the test logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly.

@@ -278,6 +278,7 @@
"CONFIG_whisk_invoker_https_keystorePassword": "{{ invoker.ssl.keystore.password }}"
"CONFIG_whisk_invoker_https_keystoreFlavor": "{{ invoker.ssl.storeFlavor }}"
"CONFIG_whisk_invoker_https_clientAuth": "{{ invoker.ssl.clientAuth }}"
"CONFIG_whisk_containerPool_prewarmExpiredCheckPeriod": "{{ container_pool_prewarmExpiredCheckPeriod | default('1 minute') }}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit picky but how about changing the name to a more intuitive one such as container_pool_prewarm_expirationCheckInterval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, updated accordingly.

.orElse(fields.get("count"))
.map(_.convertTo[Int])
.get
val memory = fields.get("memory").map(_.convertTo[ByteSize]).get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid using get on Option.

Since these configurations are required, I think we can do something simliar to this.

Suggested change
val memory = fields.get("memory").map(_.convertTo[ByteSize]).get
override def read(value: JsValue): StemCell = {
val fields = value.asJsObject.fields
val initialCount: Option[Int] =
fields
.get("initialCount")
.orElse(fields.get("count"))
.map(_.convertTo[Int])
val memory: Option[ByteSize] = fields.get("memory").map(_.convertTo[ByteSize])
val config = fields.get("reactive").map(_.convertTo[ReactivePrewarmingConfig])
(initialCount, memory) match {
case (Some(c), Some(m)) => StemCell(c, m, config)
case (Some(c), None) => throw new IllegalArgumentException(s"blah blah: $c")
case (None, Some(m)) => throw new IllegalArgumentException(s"blah blah: $m")
case _ => throw new IllegalArgumentException(s"blah blah")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated accordingly.

(py.kind, py.image, 2, 256.MB))
}

it should "parse manifest with reactive from JSON string" in {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add another test case for parse manifest with reactive from JSON string

@tysonnorris
Copy link
Contributor

@ningyougang Thanks for all your work on this! I think this PR is close to done - do you want to solicit feedback on the dev list once more to let other people chime in on this before it gets merged?

@ningyougang ningyougang force-pushed the adjust-prewarm-container-dynamically branch from 2db7514 to 5ae5fa4 Compare May 19, 2020 03:35
@codecov-commenter
Copy link

codecov-commenter commented May 19, 2020

Codecov Report

Merging #4871 into master will decrease coverage by 5.94%.
The diff coverage is 93.12%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4871      +/-   ##
==========================================
- Coverage   83.19%   77.25%   -5.95%     
==========================================
  Files         201      201              
  Lines        9351     9441      +90     
  Branches      397      396       -1     
==========================================
- Hits         7780     7294     -486     
- Misses       1571     2147     +576     
Impacted Files Coverage Δ
...rg/apache/openwhisk/core/entity/ExecManifest.scala 91.83% <78.57%> (-5.39%) ⬇️
.../openwhisk/core/containerpool/ContainerProxy.scala 93.85% <85.00%> (+0.30%) ⬆️
...in/scala/org/apache/openwhisk/common/Logging.scala 76.41% <100.00%> (-8.41%) ⬇️
...penwhisk/core/containerpool/ContainerFactory.scala 87.50% <100.00%> (+0.83%) ⬆️
...e/openwhisk/core/containerpool/ContainerPool.scala 97.45% <100.00%> (+1.80%) ⬆️
...pache/openwhisk/core/invoker/InvokerReactive.scala 79.64% <100.00%> (ø)
...core/database/cosmosdb/RxObservableImplicits.scala 0.00% <0.00%> (-100.00%) ⬇️
...ore/database/cosmosdb/cache/CacheInvalidator.scala 0.00% <0.00%> (-100.00%) ⬇️
...e/database/cosmosdb/cache/ChangeFeedConsumer.scala 0.00% <0.00%> (-100.00%) ⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0.00% <0.00%> (-96.23%) ⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4122fd...5ae5fa4. Read the comment docs.

Copy link
Contributor

@tysonnorris tysonnorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉 Thanks! Will wait a day for any additional feedback, and merge tomorrow if nobody else does it first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants