Invoke restart drain for failed healthcheck #195

philippthun · 2021-04-30T14:36:47Z

Thanks for contributing to the capi_release. To speed up the process of reviewing your pull request please provide us with:

A short explanation of the proposed change:
A failing health check stops the Cloud Controller without invoking the restart_drain script. BPM will terminate the process with a grace period of 20s; any long-running requests will be cancelled and lead to a 502 Bad Gateway error at the client. With this change the restart_drain script will also be called for failing health checks, giving long-running requests a higher (configurable) timeout.
An explanation of the use cases your change solves
On large foundations (i.e. many orgs, spaces, apps, services), we experience slow performance of Cloud Controllers. Sometimes cloud controllers even don't respond to new requests due to high load. Whereas it makes sense that these Cloud Controllers get deregistered from Gorouters, the timeout for the internal health check should be higher and a failing health check should trigger a restart with a graceful shutdown.
Links to any other associated PRs
I have viewed signed and have submitted the Contributor License Agreement
I have made this pull request to the develop branch
I have run CF Acceptance Tests on bosh lite

The current dependency chain in monit is: nginx_cc -> cloud_controller_ng -> ccng_monit_http_healthcheck The health check is defined as base process so that a failure triggers a restart of the dependent processes. This is now changed to the logical dependency chain, i.e. the order in which the processes should be started: ccng_monit_http_healthcheck -> nginx_cc -> cloud_controller_ng This removes the implicit restart trigger for the other processes in case the health check fails; but we can add a directive to explicitly invoke the 'restart_drain' script instead.

- The wait_for_server_to_become_healthy is not needed anymore as the health check is started after cc and nginx. - The printed status code was always 0 - it was the exit code of the 'if' statement, not the 'curl' command. Co-authored-by: Andy Paine <andy.paine@engineerbetter.com>

jobs/cloud_controller_ng/templates/ccng_monit_http_healthcheck.sh.erb

sethboyles · 2021-05-11T22:31:03Z

jobs/cloud_controller_ng/monit

+  start program "/var/vcap/jobs/bpm/bin/bpm start cloud_controller_ng -p ccng_monit_http_healthcheck"
+  stop program "/var/vcap/jobs/bpm/bin/bpm stop cloud_controller_ng -p ccng_monit_http_healthcheck"
+  if 1 restart within 2 cycles then exec "/var/vcap/jobs/cloud_controller_ng/bin/restart_drain"
+  depends on nginx_cc


isn't this a circular dependency? cloud_controller_ng goes down (say totalmem > threshhold), monit restarts nginx_cc, which causes this to call the restart_drain script?

And does monit end up running restart_drain twice in that case? Once at line L#9 and once at L#14?

The restart_drain script does two things to prevent any interference with monit:

It calls monit unmonitor here. This prevents monit from taking any actions in the following cycles. The same is done when BOSH stops a VM; it first calls monit unmonitor for all jobs and then the drain script of the jobs (if provided).

It uses a PID guard to ensure that there is no a parallel execution of this script (here).

So the scenario you described (totalmem > threshhold) will result in the following:

monit calls the restart_drain script,

restart_drain calls monit unmonitor,

graceful shutdown,

restart_drain calls monit monitor,

monit restarts the processes (as they are stopped)

There still might be an edge case that could lead to restart_drain being called twice. The totalmem (cloud_controller_ng) and the restart (ccng_monit_http_healthcheck) checks could be triggered independently at the same time (i.e. both conditions are true in the same monit cycle). Then the PID guard mentioned above should ensure that one invocation triggers the graceful shutdown and the other one just returns.

Got it. Thanks for the thorough explanation.

Instead of re-using config options from the route registration health check, add dedicated options to configure timeout and retries for the ccng_monit_http_healthcheck process. There are situations when cloud controller is busy with processing multiple long-running requests; although it makes sense to temporarily deregister its route to prevent additional load on the instance, the (internal) health check should not restart the instance at the same time, but only when it stays busy for too long.

- bin/shutdown_drain does not need to echo '0'; this is already done by bin/drain (https://github.com/cloudfoundry/capi-release/blob/2b52d846f82c7dbe893ffc21f879c15a3fa1da36/jobs/cloud_controller_ng/templates/drain.sh.erb#L9). - As the bin/shutdown_drain script is never invoked with arguments, logging the invocation is superfluous.

sethboyles · 2021-05-12T21:25:01Z

jobs/cloud_controller_ng/templates/shutdown_drain.rb.erb

@@ -6,8 +6,5 @@ $LOAD_PATH.unshift('/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/l
 require 'cloud_controller/drain'

 @drain = VCAP::CloudController::Drain.new('/var/vcap/sys/log/cloud_controller_ng')
-@drain.log_invocation(ARGV)


is there a particular reason to remove this log?

The bin/shutdown_drain script is never invoked with any arguments. So the log contained useless "Drain invoked with " messages.

sethboyles · 2021-05-12T23:59:12Z

jobs/cloud_controller_ng/templates/ccng_monit_http_healthcheck.sh.erb

-
-echo $(date --rfc-3339=ns) 'Waiting for Cloud Controller to initially become healthy at'
-
-wait_for_server_to_become_healthy "${URL}" "<%= p("cc.api_post_start_healthcheck_timeout_in_seconds") %>"


Is the removal of this initial wait period done with the idea that the health-check logic below should allow for sufficient time for CCNG to start up? We have done some digging and couldn't find a recorded need for this initial wait period, but were curious if you did any investigation around this.

This initial waiting was required before when the healthcheck was started as first process in the dependency chain. As it has now been moved to the end of the chain, both cc and nginx are already started.

Just now saw that you already answered this (and our other questions) in the individual commit descriptions 😅 . Thanks.

sethboyles · 2021-05-13T00:01:33Z

Thanks for the PR! We tried this out and successfully witnessed a graceful restart after a failed healthcheck. This seems like a valuable change to include, we just had one last question (commented above). We will merge after that question is resolved.

Seth and @tstannard

cf-gitbot added the unscheduled label Apr 30, 2021

philippthun and others added 2 commits May 3, 2021 14:17

philippthun marked this pull request as ready for review May 3, 2021 15:14

tstannard reviewed May 11, 2021

View reviewed changes

jobs/cloud_controller_ng/templates/ccng_monit_http_healthcheck.sh.erb Outdated Show resolved Hide resolved

sethboyles reviewed May 11, 2021

View reviewed changes

philippthun mentioned this pull request May 12, 2021

Request to be an elector for TOC election cloudfoundry/community#101

Closed

philippthun added 2 commits May 12, 2021 16:45

sethboyles reviewed May 12, 2021

View reviewed changes

sethboyles merged commit 7aae29e into cloudfoundry:develop May 13, 2021

cf-gitbot removed the unscheduled label May 13, 2021

elenasharma mentioned this pull request May 18, 2021

Log incomplete requests when stopping CC cloudfoundry/cloud_controller_ng#2256

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invoke restart drain for failed healthcheck #195

Invoke restart drain for failed healthcheck #195

philippthun commented Apr 30, 2021 •

edited

sethboyles May 11, 2021 •

edited

philippthun May 12, 2021

sethboyles May 12, 2021

sethboyles May 12, 2021

philippthun May 12, 2021

sethboyles May 12, 2021

philippthun May 13, 2021

sethboyles May 13, 2021

sethboyles commented May 13, 2021


		echo $(date --rfc-3339=ns) 'Waiting for Cloud Controller to initially become healthy at'

		wait_for_server_to_become_healthy "${URL}" "<%= p("cc.api_post_start_healthcheck_timeout_in_seconds") %>"

Invoke restart drain for failed healthcheck #195

Invoke restart drain for failed healthcheck #195

Conversation

philippthun commented Apr 30, 2021 • edited

sethboyles May 11, 2021 • edited

Choose a reason for hiding this comment

philippthun May 12, 2021

Choose a reason for hiding this comment

sethboyles May 12, 2021

Choose a reason for hiding this comment

sethboyles May 12, 2021

Choose a reason for hiding this comment

philippthun May 12, 2021

Choose a reason for hiding this comment

sethboyles May 12, 2021

Choose a reason for hiding this comment

philippthun May 13, 2021

Choose a reason for hiding this comment

sethboyles May 13, 2021

Choose a reason for hiding this comment

sethboyles commented May 13, 2021

philippthun commented Apr 30, 2021 •

edited

sethboyles May 11, 2021 •

edited