-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add error handling to container manager when invoker query fails #5320
add error handling to container manager when invoker query fails #5320
Conversation
case t: Throwable => | ||
logging.error(this, s"Unable to get available invokers: ${t.getMessage}.") | ||
List.empty[InvokerHealth] | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- container manager swallows the message
You mean if Throwable happens, there has no ack message? - the memory queue will never hear back the status of the container creation
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A request is done to etcd to get the list of healthy invokers with .getAvailableInvokers
. If the future for the request to etcd fails for any reason, there was no failure handing on that call prior to this. This function is just a synchronous unit
so if there's no failure handling on that future it just completes and never makes an acknowledgement back to the memory queue that the creation message has been processed either successfully or failed for the memory queue to properly decrement creationIds
. Also should clarify if it silently fails at this point, it hasn't yet registered the creation job so the akka timer to timeout the creation job to make the call back to MemoryQueue on timeout is not created. The memory queue indefinitely thinks that there is a container creation in progress. If the action never needs more than one container, it will never be able to execute because the memory queue thinks one is in progress. Also the memory queue can not be stopped on timeout in this case because of the creationIds
not being 0.
else {
logging.info(
this,
s"[$invocationNamespace:$action:$stateName] The queue is timed out but there are still ${queue.size} activation messages or (running: ${containers.size}, in-progress: ${creationIds.size}) containers")
stay
}
So while I think this covers the only edge case I know of, we really need an additional safeguard in MemoryQueue
to eventually clear out knowledge of in progress containers if things get out of sync as there's no way to guarantee creationIds is perfectly in sync when it's dependent essentially on a fire and forget successfully making the callback at some point which is prone to introducing bugs even if this pr covers every case for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One solution could be to make the request from MemoryQueue
to ContainerManager
an ask rather than a tell and make the timeout of the ask the value of CONFIG_whisk_scheduler_inProgressJobRetention
plus one second for buffer. That would probably significantly reduce the complexity of the MemoryQueue as well for message cases you need to account for
(I think the CreationJobManager
actually gets the responsibility of responding to the MemoryQueue
in most cases, but you should be able to just forward the ref of the ask as a param to CreationJobManager
from ContainerManager
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got your point 👍
...rc/test/scala/org/apache/openwhisk/core/scheduler/container/test/ContainerManagerTests.scala
Outdated
Show resolved
Hide resolved
...rc/test/scala/org/apache/openwhisk/core/scheduler/container/test/ContainerManagerTests.scala
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #5320 +/- ##
==========================================
- Coverage 80.94% 76.54% -4.40%
==========================================
Files 239 239
Lines 14242 14245 +3
Branches 604 594 -10
==========================================
- Hits 11528 10904 -624
- Misses 2714 3341 +627
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…che#5320) * add error handling to container manager when invoker query fails * fix tests Co-authored-by: Brendan Doyle <brendand@qualtrics.com>
Description
There is no failure handling if the query to etcd for list of healthy invokers fails. The container manager swallows the message and the memory queue will never hear back the status of the container creation.
Related issue and scope
My changes affect the following components
Types of changes
Checklist: