-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throttle the system based on active-ack timeouts. #3875
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3875 +/- ##
=========================================
- Coverage 75.82% 70.92% -4.9%
=========================================
Files 145 145
Lines 6924 6930 +6
Branches 421 423 +2
=========================================
- Hits 5250 4915 -335
- Misses 1674 2015 +341
Continue to review full report at Codecov.
|
I haven’t looked at the changes yet- based on the description:
|
@rabbah, in theory, yes! to 1.: If a network partition causes no active-acks, this network partition will also either lead to the controller not being able to write to kafka or the invoker not being able to read/write from kafka. These are valid cases to consider an invoker unusable? to 2.: Again, I think this is a valid case of general overload. If there are too many active-acks for whatever reason, its safer to call the system done than continue to process with more crpytic errors (timeouts, crashes etc.) WDYT? |
Summarizing a discussion with @rabbah: We agree that this is not optimal and might be subject to heuristics. It is however better what we have today and it serves as a catch-all around the invoker to catch any errors possible. |
cc0b15d
to
53a6e21
Compare
PG4 2003 🔵 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general the PR looks good to me, but I see a problem on sending already timed-out active acks to the invoker-pool. This can cause a flapping state in the invoker.
(1 to InvokerActor.bufferSize).foreach { _ => | ||
invoker ! InvocationFinishedMessage(InvokerInstanceId(0), InvocationFinishedResult.Success) | ||
} | ||
pool.expectMsg(Transition(invoker, Unhealthy, Healthy)) | ||
|
||
// Fill buffer with errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... with timeouts
// Pings are arriving fine, the invoker returns system errors though | ||
case object Unhealthy extends Unusable { val asString = "unhealthy" } | ||
// Pings are arriving fine, the invoker does not respond with active-acks in the expected time though | ||
case object Overloaded extends Unusable { val asString = "overloaded" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to call the state Unresponsible
or something like that?
case None if !forced => | ||
// the entry has already been removed but we receive an active ack for this activation Id. | ||
// This happens for health actions, because they don't have an entry in Loadbalancerdata or | ||
// for activations that already timed out. | ||
invokerPool ! InvocationFinishedMessage(invoker, isSuccess) | ||
invokerPool ! InvocationFinishedMessage(invoker, invocationResult) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending active-acks to the invokerPool, after they already did timeout will recover the invoker too early, as it might have still a queue, that is too large.
But this case also handles the active-acks of the health actions, which still need to be sent to the invokerpool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brilliant find! Indeed this kinda breaks the protocol because in an overloaded scenario, the invokers will swap between Healthy and Overloaded continuously.
As discussed in person, we now no longer send the result of an activation that finished after it was forced into the invokerPool 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PG2 3408 🔵 |
Today, we have an arbitrary system-wide limit of maximum concurrent connections. In general that is fine, but it doesn't have a direct correlation to what's actually happening in the system. This adds a new state to each monitored invoker: Overloaded. An invoker will go into overloaded state if active-acks are starting to timeout. Eventually, if the system is really overloaded, all Invokers will be in overloaded state which will cause the loadbalancer to return a failure. This failure now results in a 503 - System overloaded message back to the user.
This is a fairly big change but bear with me!
Today, we have an arbitrary system-wide limit of maximum concurrent connections. In general that is fine, but it doesn't have a direct correlation to what's actually happening in the system.
This adds a new state to each monitored invoker: Overloaded. An invoker will go into overloaded state if active-acks are starting to timeout. Eventually, if the system is really overloaded, all Invokers will be in overloaded state which will cause the loadbalancer to return a failure. This failure now results in a
503 - System overloaded
message back to the user.My changes affect the following components
Types of changes
Checklist: