Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve http graceful termination #4957

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tysonnorris
Copy link
Contributor

Description

Deploying controller to kubernetes, we see some case where using /ping as livenessProbe and readinessProbe does not provide great resilience when deleting a controller pod while under load. A typical case for controller unexpected shutdown under load is eviction process for cluster autoscaler, or node maintenance.

This PR provides:

  • a separate /ready endpoint that will return 503 for a configurable amount of time upon SIGTERM
  • better logging for http service shutdown procedure
  • use of CoordinatedShutdown (instead of system hook) to terminate akka http service

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@@ -156,24 +163,50 @@ trait BasicHttpService extends Directives {
}

object BasicHttpService {
implicit val tid = TransactionId(systemPrefix + "http_service")
//start with ready true
protected[http] var ready = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used for anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ready var? It is used to alter the response to /ready endpoint defined in BasicRasService.
I'll try to refactor this so that we're not modifying the object var from different places.

Copy link
Member

@rabbah rabbah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks generally ok to me.

whisk {
http {
shutdown-unready-delay = 15 seconds # /ready will return 500 for this amount of time before connection draining starts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it would wait for 15 seconds, and start the termination phase.
But if there are still client connections connected, are they getting connection closed error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only allows an unready stage, which only affects the added /ready endpoint to return 503 (no other endpoints affected). This does not affect how akka http handles existing client connections during shutdown, so if akka allows gracefully terminating in flight requests, then no connection closed errors, but no new connections allowed. The purpose here is to allow external systems (e.g. kubernetes or reverse proxies) to probe the service and stop routing to this instance for some time before akka shutdown initiates.

if (readyState) {
complete("ok")
} else {
logging.warn(this, "not ready...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this just be a debug? I think ready route returning 503 should be self explanatory

@bdoyle0182
Copy link
Contributor

@tysonnorris would you still like to see this pr merged?

@style95 style95 added the stale old issue which needs to validate label May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale old issue which needs to validate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants