Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invoke restart drain for failed healthcheck #195

Merged
merged 4 commits into from May 13, 2021
Merged

Invoke restart drain for failed healthcheck #195

merged 4 commits into from May 13, 2021

Conversation

philippthun
Copy link
Member

@philippthun philippthun commented Apr 30, 2021

Thanks for contributing to the capi_release. To speed up the process of reviewing your pull request please provide us with:

  • A short explanation of the proposed change:
    A failing health check stops the Cloud Controller without invoking the restart_drain script. BPM will terminate the process with a grace period of 20s; any long-running requests will be cancelled and lead to a 502 Bad Gateway error at the client. With this change the restart_drain script will also be called for failing health checks, giving long-running requests a higher (configurable) timeout.

  • An explanation of the use cases your change solves
    On large foundations (i.e. many orgs, spaces, apps, services), we experience slow performance of Cloud Controllers. Sometimes cloud controllers even don't respond to new requests due to high load. Whereas it makes sense that these Cloud Controllers get deregistered from Gorouters, the timeout for the internal health check should be higher and a failing health check should trigger a restart with a graceful shutdown.

  • Links to any other associated PRs

  • I have viewed signed and have submitted the Contributor License Agreement

  • I have made this pull request to the develop branch

  • I have run CF Acceptance Tests on bosh lite

philippthun and others added 2 commits May 3, 2021 14:17
The current dependency chain in monit is:

  nginx_cc -> cloud_controller_ng -> ccng_monit_http_healthcheck

The health check is defined as base process so that a failure triggers a restart of the dependent processes.

This is now changed to the logical dependency chain, i.e. the order in which the processes should be started:

  ccng_monit_http_healthcheck -> nginx_cc -> cloud_controller_ng

This removes the implicit restart trigger for the other processes in case the health check fails; but we can add a directive to explicitly invoke the 'restart_drain' script instead.
- The wait_for_server_to_become_healthy is not needed anymore as the health check is started after cc and nginx.

- The printed status code was always 0 - it was the exit code of the 'if' statement, not the 'curl' command.

Co-authored-by: Andy Paine <andy.paine@engineerbetter.com>
@philippthun philippthun marked this pull request as ready for review May 3, 2021 15:14
start program "/var/vcap/jobs/bpm/bin/bpm start cloud_controller_ng -p ccng_monit_http_healthcheck"
stop program "/var/vcap/jobs/bpm/bin/bpm stop cloud_controller_ng -p ccng_monit_http_healthcheck"
if 1 restart within 2 cycles then exec "/var/vcap/jobs/cloud_controller_ng/bin/restart_drain"
depends on nginx_cc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this a circular dependency? cloud_controller_ng goes down (say totalmem > threshhold), monit restarts nginx_cc, which causes this to call the restart_drain script?

And does monit end up running restart_drain twice in that case? Once at line L#9 and once at L#14?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restart_drain script does two things to prevent any interference with monit:

  1. It calls monit unmonitor here. This prevents monit from taking any actions in the following cycles. The same is done when BOSH stops a VM; it first calls monit unmonitor for all jobs and then the drain script of the jobs (if provided).

  2. It uses a PID guard to ensure that there is no a parallel execution of this script (here).

So the scenario you described (totalmem > threshhold) will result in the following:

  1. monit calls the restart_drain script,
  2. restart_drain calls monit unmonitor,
  3. graceful shutdown,
  4. restart_drain calls monit monitor,
  5. monit restarts the processes (as they are stopped)

There still might be an edge case that could lead to restart_drain being called twice. The totalmem (cloud_controller_ng) and the restart (ccng_monit_http_healthcheck) checks could be triggered independently at the same time (i.e. both conditions are true in the same monit cycle). Then the PID guard mentioned above should ensure that one invocation triggers the graceful shutdown and the other one just returns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for the thorough explanation.

Instead of re-using config options from the route registration health check, add dedicated options to configure timeout and retries for the ccng_monit_http_healthcheck process.

There are situations when cloud controller is busy with processing multiple long-running requests; although it makes sense to temporarily deregister its route to prevent additional load on the instance, the (internal) health check should not restart the instance at the same time, but only when it stays busy for too long.
- bin/shutdown_drain does not need to echo '0'; this is already done by bin/drain (https://github.com/cloudfoundry/capi-release/blob/2b52d846f82c7dbe893ffc21f879c15a3fa1da36/jobs/cloud_controller_ng/templates/drain.sh.erb#L9).

- As the bin/shutdown_drain script is never invoked with arguments, logging the invocation is superfluous.
@@ -6,8 +6,5 @@ $LOAD_PATH.unshift('/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/l
require 'cloud_controller/drain'

@drain = VCAP::CloudController::Drain.new('/var/vcap/sys/log/cloud_controller_ng')
@drain.log_invocation(ARGV)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a particular reason to remove this log?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bin/shutdown_drain script is never invoked with any arguments. So the log contained useless "Drain invoked with " messages.


echo $(date --rfc-3339=ns) 'Waiting for Cloud Controller to initially become healthy at'

wait_for_server_to_become_healthy "${URL}" "<%= p("cc.api_post_start_healthcheck_timeout_in_seconds") %>"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the removal of this initial wait period done with the idea that the health-check logic below should allow for sufficient time for CCNG to start up? We have done some digging and couldn't find a recorded need for this initial wait period, but were curious if you did any investigation around this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This initial waiting was required before when the healthcheck was started as first process in the dependency chain. As it has now been moved to the end of the chain, both cc and nginx are already started.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just now saw that you already answered this (and our other questions) in the individual commit descriptions 😅 . Thanks.

@sethboyles
Copy link
Member

Thanks for the PR! We tried this out and successfully witnessed a graceful restart after a failed healthcheck. This seems like a valuable change to include, we just had one last question (commented above). We will merge after that question is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants