Throttle the system based on active-ack timeouts. #3875

markusthoemmes · 2018-07-12T15:31:54Z

This is a fairly big change but bear with me!

Today, we have an arbitrary system-wide limit of maximum concurrent connections. In general that is fine, but it doesn't have a direct correlation to what's actually happening in the system.

This adds a new state to each monitored invoker: Overloaded. An invoker will go into overloaded state if active-acks are starting to timeout. Eventually, if the system is really overloaded, all Invokers will be in overloaded state which will cause the loadbalancer to return a failure. This failure now results in a 503 - System overloaded message back to the user.

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

codecov-io · 2018-07-12T18:55:29Z

Codecov Report

Merging #3875 into master will decrease coverage by 4.89%.
The diff coverage is 66.21%.

@@            Coverage Diff            @@
##           master    #3875     +/-   ##
=========================================
- Coverage   75.82%   70.92%   -4.9%     
=========================================
  Files         145      145             
  Lines        6924     6930      +6     
  Branches      421      423      +2     
=========================================
- Hits         5250     4915    -335     
- Misses       1674     2015    +341

Impacted Files	Coverage Δ
.../scala/src/main/scala/whisk/core/WhiskConfig.scala	`91.86% <ø> (-0.14%)`	⬇️
...a/whisk/core/entitlement/ActivationThrottler.scala	`80% <ø> (ø)`	⬆️
...src/main/scala/whisk/core/controller/Actions.scala	`90.35% <0%> (-0.93%)`	⬇️
.../main/scala/whisk/core/controller/Controller.scala	`0% <0%> (ø)`	⬆️
.../main/scala/whisk/core/controller/WebActions.scala	`87.89% <0%> (-0.7%)`	⬇️
...scala/src/main/scala/whisk/common/RingBuffer.scala	`75% <100%> (ø)`	⬆️
...on/scala/src/main/scala/whisk/common/Logging.scala	`86.66% <100%> (-0.15%)`	⬇️
...e/loadBalancer/ShardingContainerPoolBalancer.scala	`29.41% <11.11%> (-2.02%)`	⬇️
...ain/scala/whisk/core/entitlement/Entitlement.scala	`78.07% <66.66%> (+1.45%)`	⬆️
...a/whisk/core/loadBalancer/InvokerSupervision.scala	`84.55% <93.47%> (+1.74%)`	⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b542fa6...1872f1e. Read the comment docs.

rabbah · 2018-07-13T11:03:37Z

I haven’t looked at the changes yet- based on the description:

is this heuristic affected by a network partition which results in no active acks
timing out active acks can happen in the controller side if it can’t drain active acks fast enough (too slow, too many, gc); is the heuristic affected?

markusthoemmes · 2018-07-13T11:36:05Z

@rabbah, in theory, yes!

to 1.: If a network partition causes no active-acks, this network partition will also either lead to the controller not being able to write to kafka or the invoker not being able to read/write from kafka. These are valid cases to consider an invoker unusable?

to 2.: Again, I think this is a valid case of general overload. If there are too many active-acks for whatever reason, its safer to call the system done than continue to process with more crpytic errors (timeouts, crashes etc.)

WDYT?

markusthoemmes · 2018-07-25T15:14:23Z

Summarizing a discussion with @rabbah: We agree that this is not optimal and might be subject to heuristics. It is however better what we have today and it serves as a catch-all around the invoker to catch any errors possible.

markusthoemmes · 2018-07-25T21:18:15Z

PG4 2003 🔵

cbickel

In general the PR looks good to me, but I see a problem on sending already timed-out active acks to the invoker-pool. This can cause a flapping state in the invoker.

cbickel · 2018-07-26T06:16:15Z

tests/src/test/scala/whisk/core/loadBalancer/test/InvokerSupervisionTests.scala

+      (1 to InvokerActor.bufferSize).foreach { _ =>
+        invoker ! InvocationFinishedMessage(InvokerInstanceId(0), InvocationFinishedResult.Success)
+      }
+      pool.expectMsg(Transition(invoker, Unhealthy, Healthy))

      // Fill buffer with errors


... with timeouts

cbickel · 2018-07-26T06:18:44Z

core/controller/src/main/scala/whisk/core/loadBalancer/InvokerSupervision.scala

+  // Pings are arriving fine, the invoker returns system errors though
+  case object Unhealthy extends Unusable { val asString = "unhealthy" }
+  // Pings are arriving fine, the invoker does not respond with active-acks in the expected time though
+  case object Overloaded extends Unusable { val asString = "overloaded" }


Does it make sense to call the state Unresponsible or something like that?

cbickel · 2018-07-26T08:15:42Z

core/controller/src/main/scala/whisk/core/loadBalancer/ShardingContainerPoolBalancer.scala

      case None if !forced =>
        // the entry has already been removed but we receive an active ack for this activation Id.
        // This happens for health actions, because they don't have an entry in Loadbalancerdata or
        // for activations that already timed out.
-        invokerPool ! InvocationFinishedMessage(invoker, isSuccess)
+        invokerPool ! InvocationFinishedMessage(invoker, invocationResult)


Sending active-acks to the invokerPool, after they already did timeout will recover the invoker too early, as it might have still a queue, that is too large.
But this case also handles the active-acks of the health actions, which still need to be sent to the invokerpool.

Brilliant find! Indeed this kinda breaks the protocol because in an overloaded scenario, the invokers will swap between Healthy and Overloaded continuously.

As discussed in person, we now no longer send the result of an activation that finished after it was forced into the invokerPool 👍

cbickel

LGTM

markusthoemmes · 2018-07-26T11:24:11Z

PG2 3408 🔵

Today, we have an arbitrary system-wide limit of maximum concurrent connections. In general that is fine, but it doesn't have a direct correlation to what's actually happening in the system. This adds a new state to each monitored invoker: Overloaded. An invoker will go into overloaded state if active-acks are starting to timeout. Eventually, if the system is really overloaded, all Invokers will be in overloaded state which will cause the loadbalancer to return a failure. This failure now results in a 503 - System overloaded message back to the user.

markusthoemmes added the review Review for this PR has been requested and yet needs to be done. label Jul 12, 2018

markusthoemmes assigned cbickel Jul 12, 2018

markusthoemmes requested a review from cbickel July 12, 2018 15:31

markusthoemmes added 6 commits July 25, 2018 19:22

Add Overloaded state to the InvokerSupervision protocol.

3eef08e

Add test for the overloaded state.

d4c0057

Remove system-wide throttle, adjust loadbalancer to act accordingly.

9f0ae87

Cleanup.

1361552

More cleanup.

d3724ad

More cleanups.

53a6e21

markusthoemmes force-pushed the active-ack branch from cc0b15d to 53a6e21 Compare July 25, 2018 17:22

cbickel requested changes Jul 26, 2018

View reviewed changes

markusthoemmes added 2 commits July 26, 2018 11:03

Review comments.

9f7d583

Remove some more references.

1872f1e

cbickel approved these changes Jul 26, 2018

View reviewed changes

cbickel merged commit 9dd34f2 into apache:master Jul 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throttle the system based on active-ack timeouts. #3875

Throttle the system based on active-ack timeouts. #3875

markusthoemmes commented Jul 12, 2018

codecov-io commented Jul 12, 2018 •

edited

Loading

rabbah commented Jul 13, 2018 •

edited

Loading

markusthoemmes commented Jul 13, 2018

markusthoemmes commented Jul 25, 2018

markusthoemmes commented Jul 25, 2018

cbickel left a comment

cbickel Jul 26, 2018

cbickel Jul 26, 2018

cbickel Jul 26, 2018

markusthoemmes Jul 26, 2018

cbickel left a comment

markusthoemmes commented Jul 26, 2018

Throttle the system based on active-ack timeouts. #3875

Throttle the system based on active-ack timeouts. #3875

Conversation

markusthoemmes commented Jul 12, 2018

My changes affect the following components

Types of changes

Checklist:

codecov-io commented Jul 12, 2018 • edited Loading

Codecov Report

rabbah commented Jul 13, 2018 • edited Loading

markusthoemmes commented Jul 13, 2018

markusthoemmes commented Jul 25, 2018

markusthoemmes commented Jul 25, 2018

cbickel left a comment

Choose a reason for hiding this comment

cbickel Jul 26, 2018

Choose a reason for hiding this comment

cbickel Jul 26, 2018

Choose a reason for hiding this comment

cbickel Jul 26, 2018

Choose a reason for hiding this comment

markusthoemmes Jul 26, 2018

Choose a reason for hiding this comment

cbickel left a comment

Choose a reason for hiding this comment

markusthoemmes commented Jul 26, 2018

codecov-io commented Jul 12, 2018 •

edited

Loading

rabbah commented Jul 13, 2018 •

edited

Loading