Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttle the system based on active-ack timeouts. #3875

Merged
merged 8 commits into from
Jul 26, 2018

Conversation

markusthoemmes
Copy link
Contributor

This is a fairly big change but bear with me!

Today, we have an arbitrary system-wide limit of maximum concurrent connections. In general that is fine, but it doesn't have a direct correlation to what's actually happening in the system.

This adds a new state to each monitored invoker: Overloaded. An invoker will go into overloaded state if active-acks are starting to timeout. Eventually, if the system is really overloaded, all Invokers will be in overloaded state which will cause the loadbalancer to return a failure. This failure now results in a 503 - System overloaded message back to the user.

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@markusthoemmes markusthoemmes added the review Review for this PR has been requested and yet needs to be done. label Jul 12, 2018
@codecov-io
Copy link

codecov-io commented Jul 12, 2018

Codecov Report

Merging #3875 into master will decrease coverage by 4.89%.
The diff coverage is 66.21%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #3875     +/-   ##
=========================================
- Coverage   75.82%   70.92%   -4.9%     
=========================================
  Files         145      145             
  Lines        6924     6930      +6     
  Branches      421      423      +2     
=========================================
- Hits         5250     4915    -335     
- Misses       1674     2015    +341
Impacted Files Coverage Δ
.../scala/src/main/scala/whisk/core/WhiskConfig.scala 91.86% <ø> (-0.14%) ⬇️
...a/whisk/core/entitlement/ActivationThrottler.scala 80% <ø> (ø) ⬆️
...src/main/scala/whisk/core/controller/Actions.scala 90.35% <0%> (-0.93%) ⬇️
.../main/scala/whisk/core/controller/Controller.scala 0% <0%> (ø) ⬆️
.../main/scala/whisk/core/controller/WebActions.scala 87.89% <0%> (-0.7%) ⬇️
...scala/src/main/scala/whisk/common/RingBuffer.scala 75% <100%> (ø) ⬆️
...on/scala/src/main/scala/whisk/common/Logging.scala 86.66% <100%> (-0.15%) ⬇️
...e/loadBalancer/ShardingContainerPoolBalancer.scala 29.41% <11.11%> (-2.02%) ⬇️
...ain/scala/whisk/core/entitlement/Entitlement.scala 78.07% <66.66%> (+1.45%) ⬆️
...a/whisk/core/loadBalancer/InvokerSupervision.scala 84.55% <93.47%> (+1.74%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b542fa6...1872f1e. Read the comment docs.

@rabbah
Copy link
Member

rabbah commented Jul 13, 2018

I haven’t looked at the changes yet- based on the description:

  • is this heuristic affected by a network partition which results in no active acks
  • timing out active acks can happen in the controller side if it can’t drain active acks fast enough (too slow, too many, gc); is the heuristic affected?

@markusthoemmes
Copy link
Contributor Author

@rabbah, in theory, yes!

to 1.: If a network partition causes no active-acks, this network partition will also either lead to the controller not being able to write to kafka or the invoker not being able to read/write from kafka. These are valid cases to consider an invoker unusable?

to 2.: Again, I think this is a valid case of general overload. If there are too many active-acks for whatever reason, its safer to call the system done than continue to process with more crpytic errors (timeouts, crashes etc.)

WDYT?

@markusthoemmes
Copy link
Contributor Author

Summarizing a discussion with @rabbah: We agree that this is not optimal and might be subject to heuristics. It is however better what we have today and it serves as a catch-all around the invoker to catch any errors possible.

@markusthoemmes
Copy link
Contributor Author

PG4 2003 🔵

Copy link
Contributor

@cbickel cbickel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the PR looks good to me, but I see a problem on sending already timed-out active acks to the invoker-pool. This can cause a flapping state in the invoker.

(1 to InvokerActor.bufferSize).foreach { _ =>
invoker ! InvocationFinishedMessage(InvokerInstanceId(0), InvocationFinishedResult.Success)
}
pool.expectMsg(Transition(invoker, Unhealthy, Healthy))

// Fill buffer with errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... with timeouts

// Pings are arriving fine, the invoker returns system errors though
case object Unhealthy extends Unusable { val asString = "unhealthy" }
// Pings are arriving fine, the invoker does not respond with active-acks in the expected time though
case object Overloaded extends Unusable { val asString = "overloaded" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to call the state Unresponsible or something like that?

case None if !forced =>
// the entry has already been removed but we receive an active ack for this activation Id.
// This happens for health actions, because they don't have an entry in Loadbalancerdata or
// for activations that already timed out.
invokerPool ! InvocationFinishedMessage(invoker, isSuccess)
invokerPool ! InvocationFinishedMessage(invoker, invocationResult)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending active-acks to the invokerPool, after they already did timeout will recover the invoker too early, as it might have still a queue, that is too large.
But this case also handles the active-acks of the health actions, which still need to be sent to the invokerpool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant find! Indeed this kinda breaks the protocol because in an overloaded scenario, the invokers will swap between Healthy and Overloaded continuously.

As discussed in person, we now no longer send the result of an activation that finished after it was forced into the invokerPool 👍

Copy link
Contributor

@cbickel cbickel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markusthoemmes
Copy link
Contributor Author

PG2 3408 🔵

@cbickel cbickel merged commit 9dd34f2 into apache:master Jul 26, 2018
BillZong pushed a commit to BillZong/openwhisk that referenced this pull request Nov 18, 2019
Today, we have an arbitrary system-wide limit of maximum concurrent connections. In general that is fine, but it doesn't have a direct correlation to what's actually happening in the system.

This adds a new state to each monitored invoker: Overloaded. An invoker will go into overloaded state if active-acks are starting to timeout. Eventually, if the system is really overloaded, all Invokers will be in overloaded state which will cause the loadbalancer to return a failure. This failure now results in a 503 - System overloaded message back to the user.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review Review for this PR has been requested and yet needs to be done.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants