Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add enable/disable invoker support to old scheduler #5205

Merged

Conversation

bdoyle0182
Copy link
Contributor

@bdoyle0182 bdoyle0182 commented Mar 24, 2022

add enable/disable invoker support to old scheduler

Description

Adds enable / disabling of original invoker implementation. Disabling will turn off the scheduler of health pings so the controller will view the invoker as offline. Enabling will reschedule the health pings if it was disabled. Also added a route for both invoker types to check if an invoker is disabled so that the state of a running invoker does not get lost.

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

Copy link
Member

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me.
Are you planning to automate this procedure in Ansible scripts?

complete("not supported")
if (healthScheduler.nonEmpty) {
actorSystem.stop(healthScheduler.get)
healthScheduler = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JFYI.
While it would be effective, invokers in my downstream are sending a certain message to controllers so that they can make invokers unhealthy immediately to avoid a long-running activation coming to a disabled invoker during the timeout.

But I think this is just a minor tweak.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that on the old scheduler for your downstream or a part of the new scheduler?

If on the old scheduler, I could do that as a follow up since it would require a change to the message bus and logic in the controller to read different health pings off the health topic. But think it's more of an optimization since the controller should recognize it as offline within ten seconds of disabling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the new scheduler relies on ETCD for health, the logic resides in the old scheduler.

I looked into the downstream code and realize that we just did a workaround to minimize the change by using an instance ID that is less than 0.
I think this would not be an official way but just a workaround.

@bdoyle0182
Copy link
Contributor Author

We don't use ansible so I wasn't planning on to here, but I'm sure it could be added easily with a config of how long you want to wait after disabling before stopping the invoker.

Copy link
Member

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rabbah
Copy link
Member

rabbah commented Mar 24, 2022

I have the same question as Dominic. When you disable an invoker (invoker reactive) what happens with the active acks it’s still sending since it could still be draining activations off the topic. Also is the idea to then take down the invoker? Should it shut down?

@bdoyle0182
Copy link
Contributor Author

bdoyle0182 commented Mar 24, 2022

The active acks would continue to still be sent back to the controller until there's no activations left to process off of kafka. This is where it's not perfect since you would configure a sleep time between disabling the invoker and then shutting it down. You may also not want to shut down but stop processing on that invoker to debug some issue.

InvokerSupervision in the controller will transition to Offline and stay in that state until it gets another ping back even if it continues to get completion messages.

Dominic did bring up a good point that I could add a disabled boolean field to the Ping message and send that as a final message on the health topic so the controller sees immediately that it is disabled to transition to Offline rather than waiting up to 10 seconds (the hardcoded time in InvokerSupervision to transition to Offline if no ping has been received. I think this could be done as an optimization separately.

The new scheduler's invoker enable / disable mechanism has more available insight on whether an invoker is safe to stop because of the scheduler metrics you can get through the http routes it has

@bdoyle0182
Copy link
Contributor Author

@style95 @rr I updated the pr to change the PingMessage format (the format change is backwards compatible for rolling restarts of components) to include a disabled flag so the controller can recognize it's disabled immediately! Just working on unit tests and then it should be good to go.

@codecov-commenter
Copy link

codecov-commenter commented Mar 25, 2022

Codecov Report

Merging #5205 (94d37af) into master (5332e6d) will increase coverage by 28.33%.
The diff coverage is 56.66%.

@@             Coverage Diff             @@
##           master    #5205       +/-   ##
===========================================
+ Coverage   44.65%   72.98%   +28.33%     
===========================================
  Files         238      238               
  Lines       13937    13957       +20     
  Branches      581      570       -11     
===========================================
+ Hits         6224    10187     +3963     
+ Misses       7713     3770     -3943     
Impacted Files Coverage Δ
...he/openwhisk/core/invoker/FPCInvokerReactive.scala 0.00% <0.00%> (ø)
...pache/openwhisk/core/invoker/InvokerReactive.scala 54.33% <21.42%> (-24.31%) ⬇️
...enwhisk/core/loadBalancer/InvokerSupervision.scala 90.27% <83.33%> (+90.27%) ⬆️
.../org/apache/openwhisk/core/connector/Message.scala 76.28% <100.00%> (+12.55%) ⬆️
.../openwhisk/core/invoker/DefaultInvokerServer.scala 100.00% <100.00%> (+54.54%) ⬆️
...ache/openwhisk/core/invoker/FPCInvokerServer.scala 76.92% <100.00%> (+76.92%) ⬆️
...la/org/apache/openwhisk/core/invoker/Invoker.scala 70.76% <100.00%> (+1.41%) ⬆️
...whisk/connector/kafka/KafkaProducerConnector.scala 60.00% <0.00%> (-20.00%) ⬇️
...whisk/connector/kafka/KafkaConsumerConnector.scala 81.69% <0.00%> (-5.64%) ⬇️
... and 135 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5332e6d...94d37af. Read the comment docs.

@bdoyle0182 bdoyle0182 merged commit 3b6d07a into apache:master Mar 29, 2022
@bdoyle0182 bdoyle0182 deleted the add-disable-support-for-old-scheduler branch March 29, 2022 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants