Remove the healthcheck step #740

djmb · 2024-03-25T11:36:39Z

To speed up deployments, we'll remove the healthcheck step.

This adds some risk to deployments for non-web roles - if they don't have a Docker healthcheck configured then the only check we do is if the container is running.

If there is a bad image we might see the container running before it exits and deploy it. Previously the healthcheck step would have avoided this by ensuring a web container could boot and serve traffic first.

To mitigate this, we'll add a web barrier. Non web containers will wait before shutting down the old containers until at least one web container has passed its healthcheck.

It the web container fails its healthcheck, we'll close the barrier and shut down the new containers on the non-web roles.

We also have a new integration test to check we correctly handle a a broken image. This highlighted that SSHKit's default runner will stop at the first error it encounters. We'll now have a custom runner that waits for all threads to finish allowing them to clean up.

That means that if we have a deployment that completes on some hosts but not others we can run kamal app version --quiet to see which version is running on each host.

Sija · 2024-04-03T10:54:22Z

lib/kamal/cli/app/boot.rb

+      if barrier
+        if barrier_role?
+          barrier.open
+        else
+          wait_for_barrier
+        end
+      end


Early return suggestion:

Suggested change

if barrier

if barrier_role?

barrier.open

else

wait_for_barrier

end

end

return unless barrier

if barrier_role?

barrier.open

else

wait_for_barrier

end

Sija · 2024-04-03T11:48:31Z

lib/kamal/cli/app.rb

@@ -282,4 +285,8 @@ def current_running_version(host: KAMAL.primary_host)
    def version_or_latest
      options[:version] || KAMAL.config.latest_tag
    end
+
+    def web_and_non_web_roles?


This became unused after recent changes.

djmb · 2024-04-04T10:20:50Z

I'm hoping to get capistrano/sshkit#535 added to SSHKit, so we don't need SSHKit::Runner::ParallelCompleteAll. It works as the default runner, but if you are doing a rolling deploy, that uses the group runner which internally uses SSHKit::Runner::Parallel.

argia-andreas · 2024-04-06T09:07:49Z

@djmb
Hi!
If I understand correctly the first healthcheck step will be removed by this.
(But the container/docker healthchecks will remain, /up endpoint or a cmd for non web-roles.)

At the moment, we are using the fact that the healthcheck step runs before everything to run our migrations,
which works nicely as we assure they run successfully before we start serving traffic from new web traffic serving containers.

Any recommendations for achieving a similar outcome, that is to make sure commands are run on one container successfully before traffic is served to new containers or new containers are booted, when the healthcheck step is removed?

Thanks.

argia-andreas · 2024-04-06T09:14:25Z

Also, would this PR resolve the issue with not being able to deploy to a specific role on a host that doesn't have the primary role?

As mentioned in this issue: #709

For additional context:
This seems to be due to initial container healthcheck step which is performed with the primary role.

kamal deploy -r secondary-role

Ensure app can pass healthcheck...
  INFO [703ee48c] Running docker run --detach --name healthcheck-primary-role --publish 3999:80 --label service=healthcheck-primary-role -e KAMAL_CONTAINER_NAME="healthcheck-primary-role" --env-file .kamal/env/roles/primary-role.env --health-cmd "curl -f http://localhost/up || exit 1"
  INFO [a8ccc7f3] Finished in 0.075 seconds with exit status 123 (failed).
Releasing the deploy lock...
  Finished all in 22.6 seconds
  ERROR (SSHKit::Command::Failed): Exception while executing on host 192.168.0.11: docker exit status: 125
docker stdout: Nothing written
docker stderr: docker: open .kamal/env/roles/primary-role.env: no such file or directory.

djmb · 2024-04-22T07:45:53Z

@argia-andreas - you should be able to run migrations before anything is booted with a pre-deploy hook.

You'd want to run something like:

kamal app exec -p -q -d $KAMAL_DESTINATION --version $KAMAL_VERSION "rails db:migrate"

Regarding #709 - this is intended to fix that though I'll need to double check that it does!

To speed up deployments, we'll remove the healthcheck step. This adds some risk to deployments for non-web roles - if they don't have a Docker healthcheck configured then the only check we do is if the container is running. If there is a bad image we might see the container running before it exits and deploy it. Previously the healthcheck step would have avoided this by ensuring a web container could boot and serve traffic first. To mitigate this, we'll add a deployment barrier. Until one of the primary role containers passes its healthcheck, we'll keep the barrier up and avoid stopping the containers on the non-primary roles. It the primary role container fails its healthcheck, we'll close the barrier and shut down the new containers on the waiting roles. We also have a new integration test to check we correctly handle a a broken image. This highlighted that SSHKit's default runner will stop at the first error it encounters. We'll now have a custom runner that waits for all threads to finish allowing them to clean up.

If a primary role container is unhealthy, we might take a while to timeout the health check poller. In the meantime if we have started the other roles, they'll be running tow containers. This could be a problem, especially if they read run jobs as that doubles the worker capacity which could cause exessive load. We'll wait for the first primary role container to boot successfully before starting the other containers from other roles.

Using a different SSHKit runner doesn't work well, because the group runner uses the Parallel runner internally. So instead we'll patch its behaviour to fail slow. We'll also get it to return all the errors so we can report on all the hosts that failed.

djmb force-pushed the remove-healthcheck-step branch 2 times, most recently from 117d4aa to 6d57012 Compare April 3, 2024 08:21

djmb force-pushed the remove-healthcheck-step branch 2 times, most recently from b8b38c6 to f1ea2c7 Compare April 3, 2024 09:53

Sija reviewed Apr 3, 2024

View reviewed changes

djmb force-pushed the remove-healthcheck-step branch 2 times, most recently from 5890b14 to 494c205 Compare April 3, 2024 15:33

djmb mentioned this pull request May 2, 2024

Deploy by role does not work #709

Closed

djmb force-pushed the remove-healthcheck-step branch from 494c205 to 4083c86 Compare May 7, 2024 16:07

djmb force-pushed the remove-healthcheck-step branch from 4083c86 to e9e51b7 Compare May 20, 2024 10:26

djmb added 6 commits May 20, 2024 12:18

Remove redundant method

07c5658

Improve comments

5be6fa3

Show container logs and healthcheck status on failure

773ba3a

Fix rebase method duplication

bb2ca81

Only use barrier when needed, more descriptive info

ee758d9

djmb force-pushed the remove-healthcheck-step branch from 39d0ab5 to ee758d9 Compare May 20, 2024 11:18

djmb added 3 commits May 21, 2024 08:37

Simplify messages and remove multiple execute error

706b82b

djmb merged commit 2c2d94c into main May 21, 2024
8 checks passed

djmb deleted the remove-healthcheck-step branch May 21, 2024 11:00

This was referenced Jun 5, 2024

Readiness status rails/solid_queue#179

Closed

Custom healthcheck options don't work at role-level #591

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the healthcheck step #740

Remove the healthcheck step #740

djmb commented Mar 25, 2024 •

edited

Loading

Sija Apr 3, 2024

Sija Apr 3, 2024

djmb commented Apr 4, 2024

argia-andreas commented Apr 6, 2024

argia-andreas commented Apr 6, 2024 •

edited

Loading

djmb commented Apr 22, 2024

Remove the healthcheck step #740

Remove the healthcheck step #740

Conversation

djmb commented Mar 25, 2024 • edited Loading

Sija Apr 3, 2024

Choose a reason for hiding this comment

Sija Apr 3, 2024

Choose a reason for hiding this comment

djmb commented Apr 4, 2024

argia-andreas commented Apr 6, 2024

argia-andreas commented Apr 6, 2024 • edited Loading

djmb commented Apr 22, 2024

djmb commented Mar 25, 2024 •

edited

Loading

argia-andreas commented Apr 6, 2024 •

edited

Loading