Skip to content

cogini/kubernetes_health_check

Repository files navigation

kubernetes_health_check

Health check Plug with Kubernetes semantics.

Kubernetes has well defined semantics for how health checks should behave, distinguishing between between startup, liveness, and readiness:

Liveness is the core health check. It determines whether the app is alive and able to respond to requests. It should be relatively fast, as it is called frequently, but should include checks for dependencies, e.g. whether the app can connect to a database or back end service. If the liveness check fails for a specified period, Kubernetes kills and replaces the instance.

Startup checks whether the app has finished booting up. It is useful when the app may take significant time to start, e.g. because it is loading data from a cache. Separating this from liveness allows us to use different timeouts, rather than making the liveness timeout long enough to support startup. Once startup has completed successfully, Kubernetes does not call it again, it uses the liveness check.

Readiness checks whether the app should receive requests. Kubernetes uses it to decide whether to route traffic to the the instance. If the readiness probe fails, Kubernetes doesn't kill and restart the container, instead it marks the pod as "unready" and stops sending traffic to it, e.g. in the ingress. It is useful to temporarily stop serving traffic, e.g. when the instance is overloaded or it has transient problems connecting to a back end service.

See this blog post for more background: https://www.cogini.com/blog/kubernetes-health-checks-for-elixir-apps/

Links:

Following is an example Kubernetes deployment yaml configuration:

  startupProbe:
    httpGet:
      path: /healthz/startup
      port: http
    periodSeconds: 3
    failureThreshold: 5

  livenessProbe:
    httpGet:
      path: /healthz/liveness
      port: http
    periodSeconds: 10
    failureThreshold: 6

  readinessProbe:
    httpGet:
      path: /healthz/readiness
      port: http
    periodSeconds: 10
    failureThreshold: 1

Installation

Add the package to your list of dependencies in mix.exs:

def deps do
  [
    {:kubernetes_health_check, "~> 0.7.0"}
  ]
end

Usage

Add KubernetesHealthCheck.Plug to your endpoint or router. Place it at the very top to avoid noise in your logs from health checks.

plug KubernetesHealthCheck.Plug,
  mod: Foo.Health,
  base_path: "/healthz"

Options:

  • :mod - Callback module which implements the health checks for the app, default KubernetesHealthCheck
  • :base_path - "Base request_path for health checks, default "/healthz"
  • :startup_path - "Path for startup check, default "<base_path>/startup"
  • :liveness_path - "Path for liveness check, default "<base_path>/liveness"
  • :readiness_path - "Path for readiness check, default "<base_path>/readiness"

Add a module which provides the app-specific health checks. Following is an example:

defmodule Example.Health do
  @moduledoc """
  Collect app status for Kubernetes health checks.
  """
  alias Example.Repo

  @app :example
  @repos Application.compile_env(@app, :ecto_repos) || []

  @doc """
  Check if the app has finished booting up.

  This returns app status for the Kubernetes `startupProbe`.
  Kubernetes checks this probe repeatedly until it returns a successful
  response. After that Kubernetes switches to executing the other two probes.
  If the app fails to successfully start before the `failureThreshold` time is
  reached, Kubernetes kills the container and restarts it.

  For example, this check might return OK when the app has started the
  web-server, connected to a DB, connected to external services, and performed
  initial setup tasks such as loading a large cache.
  """
  @spec startup ::
          :ok
          | {:error, {status_code :: non_neg_integer(), reason :: binary()}}
          | {:error, reason :: binary()}
  def startup do
    # Return error if there are available migrations which have not been executed.
    # This supports deployment to AWS ECS using the following strategy:
    # https://engineering.instawork.com/elegant-database-migrations-on-ecs-74f3487da99f
    #
    # By default Elixir migrations lock the database migration table, so they
    # will only run from a single instance.
    migrations =
      @repos
      |> Enum.map(&Ecto.Migrator.migrations/1)
      |> List.flatten()

    if Enum.empty?(migrations) do
      liveness()
    else
      {:error, "Database not migrated"}
    end
  end

  @doc """
  Check if the app is alive and working properly.

  This returns app status for the Kubernetes `livenessProbe`.
  Kubernetes continuously checks if the app is alive and working as expected.
  If it crashes or becomes unresponsive for a specified period of time,
  Kubernetes kills and replaces the container.

  This check should be lightweight, only determining if the server is
  responding to requests and can connect to the DB.
  """
  @spec liveness ::
          :ok
          | {:error, {status_code :: non_neg_integer(), reason :: binary()}}
          | {:error, reason :: binary()}
  def liveness do
    case Ecto.Adapters.SQL.query(Repo, "SELECT 1") do
      {:ok, %{num_rows: 1, rows: [[1]]}} ->
        :ok

      {:error, reason} ->
        {:error, inspect(reason)}
    end
  rescue
    e ->
      {:error, inspect(e)}
  end

  @doc """
  Check if app should be serving public traffic.

  This returns app status for the Kubernetes `readinessProbe`.
  Kubernetes continuously checks if the app should serve traffic. If the
  readiness probe fails, Kubernetes doesn't kill and restart the container,
  instead it marks the pod as "unready" and stops sending traffic to it, e.g.
  in the ingress.

  This is useful to temporarily stop serving requests. For example, if the app
  gets a timeout connecting to a back end service, it might return an error for
  the readiness probe. After multiple failed attempts, it would switch to
  returning false for the `livenessProbe`, triggering a restart.

  Similarly, the app might return an error if it is overloaded, shedding
  traffic until it has caught up.
  """
  @spec readiness ::
          :ok
          | {:error, {status_code :: non_neg_integer(), reason :: binary()}}
          | {:error, reason :: binary()}
  def readiness do
    liveness()
  end

  @spec basic ::
          :ok
  # | {:error, {status_code :: non_neg_integer(), reason :: binary()}}
  # | {:error, reason :: binary()}
  def basic do
    :ok
  end
end

Docs can be found at https://hexdocs.pm/kubernetes_health_check.

About

Phoenix plug for Kubernetes health checks

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages