Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add error handling to container manager when invoker query fails #5320

Merged

Conversation

bdoyle0182
Copy link
Contributor

@bdoyle0182 bdoyle0182 commented Aug 31, 2022

Description

There is no failure handling if the query to etcd for list of healthy invokers fails. The container manager swallows the message and the memory queue will never hear back the status of the container creation.

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

case t: Throwable =>
logging.error(this, s"Unable to get available invokers: ${t.getMessage}.")
List.empty[InvokerHealth]
})
Copy link
Contributor

@ningyougang ningyougang Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • container manager swallows the message
    You mean if Throwable happens, there has no ack message?
  • the memory queue will never hear back the status of the container creation
    Why?

Copy link
Contributor Author

@bdoyle0182 bdoyle0182 Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A request is done to etcd to get the list of healthy invokers with .getAvailableInvokers. If the future for the request to etcd fails for any reason, there was no failure handing on that call prior to this. This function is just a synchronous unit so if there's no failure handling on that future it just completes and never makes an acknowledgement back to the memory queue that the creation message has been processed either successfully or failed for the memory queue to properly decrement creationIds. Also should clarify if it silently fails at this point, it hasn't yet registered the creation job so the akka timer to timeout the creation job to make the call back to MemoryQueue on timeout is not created. The memory queue indefinitely thinks that there is a container creation in progress. If the action never needs more than one container, it will never be able to execute because the memory queue thinks one is in progress. Also the memory queue can not be stopped on timeout in this case because of the creationIds not being 0.

else {
        logging.info(
          this,
          s"[$invocationNamespace:$action:$stateName] The queue is timed out but there are still ${queue.size} activation messages or (running: ${containers.size}, in-progress: ${creationIds.size}) containers")
        stay
      }

So while I think this covers the only edge case I know of, we really need an additional safeguard in MemoryQueue to eventually clear out knowledge of in progress containers if things get out of sync as there's no way to guarantee creationIds is perfectly in sync when it's dependent essentially on a fire and forget successfully making the callback at some point which is prone to introducing bugs even if this pr covers every case for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution could be to make the request from MemoryQueue to ContainerManager an ask rather than a tell and make the timeout of the ask the value of CONFIG_whisk_scheduler_inProgressJobRetention plus one second for buffer. That would probably significantly reduce the complexity of the MemoryQueue as well for message cases you need to account for

(I think the CreationJobManager actually gets the responsibility of responding to the MemoryQueue in most cases, but you should be able to just forward the ref of the ask as a param to CreationJobManager from ContainerManager)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your point 👍

@codecov-commenter
Copy link

codecov-commenter commented Aug 31, 2022

Codecov Report

Merging #5320 (ad0cc0f) into master (740c907) will decrease coverage by 4.39%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5320      +/-   ##
==========================================
- Coverage   80.94%   76.54%   -4.40%     
==========================================
  Files         239      239              
  Lines       14242    14245       +3     
  Branches      604      594      -10     
==========================================
- Hits        11528    10904     -624     
- Misses       2714     3341     +627     
Impacted Files Coverage Δ
...sk/core/scheduler/container/ContainerManager.scala 95.90% <100.00%> (+0.05%) ⬆️
...core/database/cosmosdb/RxObservableImplicits.scala 0.00% <0.00%> (-100.00%) ⬇️
...ore/database/cosmosdb/cache/CacheInvalidator.scala 0.00% <0.00%> (-100.00%) ⬇️
...e/database/cosmosdb/cache/ChangeFeedConsumer.scala 0.00% <0.00%> (-100.00%) ⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0.00% <0.00%> (-95.85%) ⬇️
...sk/core/database/cosmosdb/CosmosDBViewMapper.scala 0.00% <0.00%> (-93.90%) ⬇️
...tabase/cosmosdb/cache/CacheInvalidatorConfig.scala 0.00% <0.00%> (-92.31%) ⬇️
...enwhisk/connector/kafka/KamonMetricsReporter.scala 0.00% <0.00%> (-83.34%) ⬇️
...e/database/cosmosdb/cache/KafkaEventProducer.scala 0.00% <0.00%> (-78.58%) ⬇️
...whisk/core/database/cosmosdb/CosmosDBSupport.scala 0.00% <0.00%> (-74.08%) ⬇️
... and 18 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bdoyle0182 bdoyle0182 merged commit 138f3d9 into apache:master Aug 31, 2022
@bdoyle0182 bdoyle0182 deleted the add-error-handling-to-container-manager branch August 31, 2022 23:43
msciabarra pushed a commit to nuvolaris/openwhisk that referenced this pull request Nov 23, 2022
…che#5320)

* add error handling to container manager when invoker query fails

* fix tests

Co-authored-by: Brendan Doyle <brendand@qualtrics.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants