Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the healthcheck step #740

Merged
merged 9 commits into from
May 21, 2024
Merged

Remove the healthcheck step #740

merged 9 commits into from
May 21, 2024

Conversation

djmb
Copy link
Collaborator

@djmb djmb commented Mar 25, 2024

To speed up deployments, we'll remove the healthcheck step.

This adds some risk to deployments for non-web roles - if they don't have a Docker healthcheck configured then the only check we do is if the container is running.

If there is a bad image we might see the container running before it exits and deploy it. Previously the healthcheck step would have avoided this by ensuring a web container could boot and serve traffic first.

To mitigate this, we'll add a web barrier. Non web containers will wait before shutting down the old containers until at least one web container has passed its healthcheck.

It the web container fails its healthcheck, we'll close the barrier and shut down the new containers on the non-web roles.

We also have a new integration test to check we correctly handle a a broken image. This highlighted that SSHKit's default runner will stop at the first error it encounters. We'll now have a custom runner that waits for all threads to finish allowing them to clean up.

That means that if we have a deployment that completes on some hosts but not others we can run kamal app version --quiet to see which version is running on each host.

@djmb djmb force-pushed the remove-healthcheck-step branch 2 times, most recently from 117d4aa to 6d57012 Compare April 3, 2024 08:21
@djmb djmb force-pushed the remove-healthcheck-step branch 2 times, most recently from b8b38c6 to f1ea2c7 Compare April 3, 2024 09:53
Comment on lines 68 to 79
if barrier
if barrier_role?
barrier.open
else
wait_for_barrier
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early return suggestion:

Suggested change
if barrier
if barrier_role?
barrier.open
else
wait_for_barrier
end
end
return unless barrier
if barrier_role?
barrier.open
else
wait_for_barrier
end

@@ -282,4 +285,8 @@ def current_running_version(host: KAMAL.primary_host)
def version_or_latest
options[:version] || KAMAL.config.latest_tag
end

def web_and_non_web_roles?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This became unused after recent changes.

@djmb djmb force-pushed the remove-healthcheck-step branch 2 times, most recently from 5890b14 to 494c205 Compare April 3, 2024 15:33
@djmb
Copy link
Collaborator Author

djmb commented Apr 4, 2024

I'm hoping to get capistrano/sshkit#535 added to SSHKit, so we don't need SSHKit::Runner::ParallelCompleteAll. It works as the default runner, but if you are doing a rolling deploy, that uses the group runner which internally uses SSHKit::Runner::Parallel.

@argia-andreas
Copy link

@djmb
Hi!
If I understand correctly the first healthcheck step will be removed by this.
(But the container/docker healthchecks will remain, /up endpoint or a cmd for non web-roles.)

At the moment, we are using the fact that the healthcheck step runs before everything to run our migrations,
which works nicely as we assure they run successfully before we start serving traffic from new web traffic serving containers.

Any recommendations for achieving a similar outcome, that is to make sure commands are run on one container successfully before traffic is served to new containers or new containers are booted, when the healthcheck step is removed?

Thanks.

@argia-andreas
Copy link

argia-andreas commented Apr 6, 2024

Also, would this PR resolve the issue with not being able to deploy to a specific role on a host that doesn't have the primary role?

As mentioned in this issue: #709

For additional context:
This seems to be due to initial container healthcheck step which is performed with the primary role.

kamal deploy -r secondary-role

Ensure app can pass healthcheck...
  INFO [703ee48c] Running docker run --detach --name healthcheck-primary-role --publish 3999:80 --label service=healthcheck-primary-role -e KAMAL_CONTAINER_NAME="healthcheck-primary-role" --env-file .kamal/env/roles/primary-role.env --health-cmd "curl -f http://localhost/up || exit 1"
  INFO [a8ccc7f3] Finished in 0.075 seconds with exit status 123 (failed).
Releasing the deploy lock...
  Finished all in 22.6 seconds
  ERROR (SSHKit::Command::Failed): Exception while executing on host 192.168.0.11: docker exit status: 125
docker stdout: Nothing written
docker stderr: docker: open .kamal/env/roles/primary-role.env: no such file or directory.

@djmb
Copy link
Collaborator Author

djmb commented Apr 22, 2024

@argia-andreas - you should be able to run migrations before anything is booted with a pre-deploy hook.

You'd want to run something like:

kamal app exec -p -q -d $KAMAL_DESTINATION --version $KAMAL_VERSION "rails db:migrate"

Regarding #709 - this is intended to fix that though I'll need to double check that it does!

djmb added 6 commits May 20, 2024 12:18
To speed up deployments, we'll remove the healthcheck step.

This adds some risk to deployments for non-web roles - if they don't
have a Docker healthcheck configured then the only check we do is if
the container is running.

If there is a bad image we might see the container running before it
exits and deploy it. Previously the healthcheck step would have avoided
this by ensuring a web container could boot and serve traffic first.

To mitigate this, we'll add a deployment barrier. Until one of the
primary role containers passes its healthcheck, we'll keep the barrier
up and avoid stopping the containers on the non-primary roles.

It the primary role container fails its healthcheck, we'll close the
barrier and shut down the new containers on the waiting roles.

We also have a new integration test to check we correctly handle a
a broken image. This highlighted that SSHKit's default runner will
stop at the first error it encounters. We'll now have a custom runner
that waits for all threads to finish allowing them to clean up.
@djmb djmb force-pushed the remove-healthcheck-step branch from 39d0ab5 to ee758d9 Compare May 20, 2024 11:18
djmb added 3 commits May 21, 2024 08:37
If a primary role container is unhealthy, we might take a while to
timeout the health check poller. In the meantime if we have started the
other roles, they'll be running tow containers.

This could be a problem, especially if they read run jobs as that
doubles the worker capacity which could cause exessive load.

We'll wait for the first primary role container to boot successfully
before starting the other containers from other roles.
Using a different SSHKit runner doesn't work well, because the group
runner uses the Parallel runner internally. So instead we'll patch its
behaviour to fail slow.

We'll also get it to return all the errors so we can report on all the
hosts that failed.
@djmb djmb merged commit 2c2d94c into main May 21, 2024
8 checks passed
@djmb djmb deleted the remove-healthcheck-step branch May 21, 2024 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants