add error handling to container manager when invoker query fails #5320

bdoyle0182 · 2022-08-31T00:43:33Z

Description

There is no failure handling if the query to etcd for list of healthy invokers fails. The container manager swallows the message and the memory queue will never hear back the status of the container creation.

Related issue and scope

I opened an issue to propose and discuss this change (#????)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

ningyougang · 2022-08-31T05:18:46Z

...cheduler/src/main/scala/org/apache/openwhisk/core/scheduler/container/ContainerManager.scala

+        case t: Throwable =>
+          logging.error(this, s"Unable to get available invokers: ${t.getMessage}.")
+          List.empty[InvokerHealth]
+      })


container manager swallows the message
You mean if Throwable happens, there has no ack message?

the memory queue will never hear back the status of the container creation
Why?

A request is done to etcd to get the list of healthy invokers with .getAvailableInvokers. If the future for the request to etcd fails for any reason, there was no failure handing on that call prior to this. This function is just a synchronous unit so if there's no failure handling on that future it just completes and never makes an acknowledgement back to the memory queue that the creation message has been processed either successfully or failed for the memory queue to properly decrement creationIds. Also should clarify if it silently fails at this point, it hasn't yet registered the creation job so the akka timer to timeout the creation job to make the call back to MemoryQueue on timeout is not created. The memory queue indefinitely thinks that there is a container creation in progress. If the action never needs more than one container, it will never be able to execute because the memory queue thinks one is in progress. Also the memory queue can not be stopped on timeout in this case because of the creationIds not being 0.

else { logging.info( this, s"[$invocationNamespace:$action:$stateName] The queue is timed out but there are still ${queue.size} activation messages or (running: ${containers.size}, in-progress: ${creationIds.size}) containers") stay }

So while I think this covers the only edge case I know of, we really need an additional safeguard in MemoryQueue to eventually clear out knowledge of in progress containers if things get out of sync as there's no way to guarantee creationIds is perfectly in sync when it's dependent essentially on a fire and forget successfully making the callback at some point which is prone to introducing bugs even if this pr covers every case for now.

One solution could be to make the request from MemoryQueue to ContainerManager an ask rather than a tell and make the timeout of the ask the value of CONFIG_whisk_scheduler_inProgressJobRetention plus one second for buffer. That would probably significantly reduce the complexity of the MemoryQueue as well for message cases you need to account for

(I think the CreationJobManager actually gets the responsibility of responding to the MemoryQueue in most cases, but you should be able to just forward the ref of the ask as a param to CreationJobManager from ContainerManager)

Got your point 👍

...rc/test/scala/org/apache/openwhisk/core/scheduler/container/test/ContainerManagerTests.scala

codecov-commenter · 2022-08-31T23:23:36Z

Codecov Report

Merging #5320 (ad0cc0f) into master (740c907) will decrease coverage by 4.39%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5320      +/-   ##
==========================================
- Coverage   80.94%   76.54%   -4.40%     
==========================================
  Files         239      239              
  Lines       14242    14245       +3     
  Branches      604      594      -10     
==========================================
- Hits        11528    10904     -624     
- Misses       2714     3341     +627

Impacted Files	Coverage Δ
...sk/core/scheduler/container/ContainerManager.scala	`95.90% <100.00%> (+0.05%)`	⬆️
...core/database/cosmosdb/RxObservableImplicits.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...ore/database/cosmosdb/cache/CacheInvalidator.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...e/database/cosmosdb/cache/ChangeFeedConsumer.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala	`0.00% <0.00%> (-95.85%)`	⬇️
...sk/core/database/cosmosdb/CosmosDBViewMapper.scala	`0.00% <0.00%> (-93.90%)`	⬇️
...tabase/cosmosdb/cache/CacheInvalidatorConfig.scala	`0.00% <0.00%> (-92.31%)`	⬇️
...enwhisk/connector/kafka/KamonMetricsReporter.scala	`0.00% <0.00%> (-83.34%)`	⬇️
...e/database/cosmosdb/cache/KafkaEventProducer.scala	`0.00% <0.00%> (-78.58%)`	⬇️
...whisk/core/database/cosmosdb/CosmosDBSupport.scala	`0.00% <0.00%> (-74.08%)`	⬇️
... and 18 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…che#5320) * add error handling to container manager when invoker query fails * fix tests Co-authored-by: Brendan Doyle <brendand@qualtrics.com>

add error handling to container manager when invoker query fails

29fd95e

style95 approved these changes Aug 31, 2022

View reviewed changes

ningyougang reviewed Aug 31, 2022

View reviewed changes

ningyougang approved these changes Aug 31, 2022

View reviewed changes

ningyougang reviewed Aug 31, 2022

View reviewed changes

...rc/test/scala/org/apache/openwhisk/core/scheduler/container/test/ContainerManagerTests.scala Outdated Show resolved Hide resolved

style95 reviewed Aug 31, 2022

View reviewed changes

...rc/test/scala/org/apache/openwhisk/core/scheduler/container/test/ContainerManagerTests.scala Outdated Show resolved Hide resolved

fix tests

ad0cc0f

bdoyle0182 merged commit 138f3d9 into apache:master Aug 31, 2022

bdoyle0182 deleted the add-error-handling-to-container-manager branch August 31, 2022 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add error handling to container manager when invoker query fails #5320

add error handling to container manager when invoker query fails #5320

bdoyle0182 commented Aug 31, 2022 •

edited

Loading

ningyougang Aug 31, 2022 •

edited

Loading

bdoyle0182 Aug 31, 2022 •

edited

Loading

bdoyle0182 Aug 31, 2022

ningyougang Aug 31, 2022

codecov-commenter commented Aug 31, 2022 •

edited

Loading

add error handling to container manager when invoker query fails #5320

add error handling to container manager when invoker query fails #5320

Conversation

bdoyle0182 commented Aug 31, 2022 • edited Loading

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

ningyougang Aug 31, 2022 • edited Loading

Choose a reason for hiding this comment

bdoyle0182 Aug 31, 2022 • edited Loading

Choose a reason for hiding this comment

bdoyle0182 Aug 31, 2022

Choose a reason for hiding this comment

ningyougang Aug 31, 2022

Choose a reason for hiding this comment

codecov-commenter commented Aug 31, 2022 • edited Loading

Codecov Report

bdoyle0182 commented Aug 31, 2022 •

edited

Loading

ningyougang Aug 31, 2022 •

edited

Loading

bdoyle0182 Aug 31, 2022 •

edited

Loading

codecov-commenter commented Aug 31, 2022 •

edited

Loading