Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the liveness probe for the Helm Chart's Worker StatefulSet. #2753

Closed
topherbullock opened this issue Nov 1, 2018 · 8 comments

Comments

@topherbullock
Copy link
Member

@topherbullock topherbullock commented Nov 1, 2018

By default the workers will retire (and then be pruned) if the following are logged by the worker:

guardian.api.garden-server.create.failed
baggageclaim.api.volume-server.create-volume-async.failed-to-create

This means that any failure to create a container will retire the worker. Creating a container can fail for any old reason, even a poorly written task, so this is less than ideal.

@ralekseenkov

This comment has been minimized.

Copy link
Contributor

@ralekseenkov ralekseenkov commented Nov 2, 2018

we filed this a while ago, with one specific example
#2656

@william-tran

This comment has been minimized.

Copy link
Contributor

@william-tran william-tran commented Nov 7, 2018

lifecycle related: #2757

Should this be changed? a worker coming back up after not having cleaned itself up yields this error: guardian.attach.lookup-failed

What about btrfs errors? disk out of space errors? We really need to disambiguate any old error from ones that require a worker restart to fix.

@topherbullock

This comment has been minimized.

Copy link
Member Author

@topherbullock topherbullock commented Nov 7, 2018

@ralekseenkov Thanks for linking that! Definitely sums up the exact case we figured would cause the liveness probe to retire the workers.

@william-tran

We really need to disambiguate any old error from ones that require a worker restart to fix.

I'm thinking we'll likely want a health check of some sort on the worker itself rather than rely on logs. We don't have much control over Garden's logs, so differentiating the root cause behind a failure to create purely by reading logs would be a pain.

@topherbullock

This comment has been minimized.

Copy link
Member Author

@topherbullock topherbullock commented Nov 12, 2018

@cirocosta and I are spiking on a worker component to ping Garden (via the /ping endpoint) and Baggageclaim (no /ping .. but we'll figure something out.. ).

Will add more details on what we discover.

cirocosta added a commit that referenced this issue Nov 13, 2018
Adds a server process to the worker command so that
we can have an endpoint to be reached to healthchecking
the worker.

Internally, such endpoint makes a request to both
`garden` and `baggageclaim`, ensuring that both of
them are up.

#2753
topherbullock pushed a commit that referenced this issue Nov 15, 2018
Adds a server process to the worker command so that
we can have an endpoint to be reached to healthchecking
the worker.

Internally, such endpoint makes a request to both
`garden` and `baggageclaim`, ensuring that both of
them are up.

It also Includes k8s deployment script for testing

#2753

Signed-off-by: Topher Bullock <cbullock@pivotal.io>
@jama22 jama22 added the enhancement label Nov 19, 2018
@cirocosta

This comment has been minimized.

Copy link
Member

@cirocosta cirocosta commented Dec 10, 2018

Hey @topherbullock ,

I just submitted to concourse/charts a branch that, like concourse/dex, keeps a bunch of PRs / branches merged into a single view:

That allows us to reference concourse/charts#merged and move forward with the work.

Do you think that is enough to consider this issue not blocked?

Thx!

@cirocosta

This comment has been minimized.

Copy link
Member

@cirocosta cirocosta commented Dec 24, 2018

Hey,

On the last Friday, I started working on making those health checks a bit more in-depth, creating a minimum load in the worker to inspect whether it's still capable of serving what its purpose (creating containers and volumes) - https://github.com/cirocosta/concourse-worker-health-checker.

Here's a quick walk through the code:


main.go

	aggregate := &checkers.Aggregate{
		Checkers: []checkers.Checker{
			&checkers.Baggageclaim{Address: *baggageclaimUrl},
			&checkers.Garden{Address: *gardenUrl},
		},
	}

where Checker is an interface:

checkers/checker.go

type Checker interface {
	// Check performs a health check with its execution time
	// limited by a context that might be canceled at any
	// time.
	Check(ctx context.Context) (err error)

So that a Baggageclaim checker looks like:

checkers/baggageclaim.go

func (b *Baggageclaim) Check(ctx context.Context) (err error) {
	handle := mustCreatedHandle()

	err = b.createVolume(ctx, handle)
	if err != nil {
		err = errors.Wrapf(err,
			"failed to create volume %s", handle)
		return
	}

	err = b.destroyVolume(ctx, handle)
	if err != nil {
		err = errors.Wrapf(err,
			"failed to delete volume %s", handle)
		return
	}

	return
}

and Aggregate provides a concurrent execution step which can be timed-out by a parent context:

checkers/aggregate.go

func (h *Aggregate) Check(ctx context.Context) (err error) {
	var group *errgroup.Group

	group, ctx = errgroup.WithContext(ctx)
	for _, checker := range h.Checkers {
		checker := checker // goroutine closure

		group.Go(func() error {
			return checker.Check(ctx)
		})
	}

	err = group.Wait()
	return
}

This way we can, in the future, increase this with either even more in-depth checkers or some others that could look into things like btrfs metadata usage or whatever we see fit.


I started that repo as something separate that gets deployed as a sidecar to the worker as a way of detecting any problems that might arise with it, but the intention is to get it upstream with the code being executed on every readinessProbe.

Wdyt?

thx!

@cirocosta cirocosta removed the blocked label Dec 24, 2018
cirocosta added a commit to cirocosta/concourse that referenced this issue Jan 1, 2019
As a way of improving the first iteration we did in terms of
making the worker more "healthcheck-able", this commit goes further
than performing a request to `garden.Ping` and `bc.ListVolumes` and
performs what would be the minimum workload that those components
should be able to handle:

- creating an empty volume; and
- creating a container.

By providing an endpoint for the healthchecking to occur we can
allow both BOSH, k8s, plain-docker ... to perform the checks and
determine whether the worker is in a good shape or not.

By providing a mininal interface we should *in theory* be able
to improve the health checks even more in the future.

concourse#2753

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
cirocosta added a commit to cirocosta/concourse that referenced this issue Jan 9, 2019
As a way of improving the first iteration we did in terms of
making the worker more "healthcheck-able", this commit goes further
than performing a request to `garden.Ping` and `bc.ListVolumes` and
performs what would be the minimum workload that those components
should be able to handle:

- creating an empty volume; and
- creating a container.

By providing an endpoint for the healthchecking to occur we can
allow both BOSH, k8s, plain-docker ... to perform the checks and
determine whether the worker is in a good shape or not.

By providing a mininal interface we should *in theory* be able
to improve the health checks even more in the future.

concourse#2753

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
cirocosta added a commit to cirocosta/concourse that referenced this issue Jan 12, 2019
As a way of improving the first iteration we did in terms of
making the worker more "healthcheck-able", this commit goes further
than performing a request to `garden.Ping` and `bc.ListVolumes` and
performs what would be the minimum workload that those components
should be able to handle:

- creating an empty volume; and
- creating a container.

By providing an endpoint for the healthchecking to occur we can
allow both BOSH, k8s, plain-docker ... to perform the checks and
determine whether the worker is in a good shape or not.

By providing a mininal interface we should *in theory* be able
to improve the health checks even more in the future.

concourse#2753

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
@cirocosta

This comment has been minimized.

Copy link
Member

@cirocosta cirocosta commented Jan 14, 2019

Hey,

Just a quick update; #3025 adds the functionality I described above.

Thanks!

@cirocosta

This comment has been minimized.

Copy link
Member

@cirocosta cirocosta commented Jan 21, 2019

Hey, I'll consider this one as closed as in master (see a7e13ce) we already have the healtcheck endpoint there which is used by the release candidate chart.

Work goes on in #3025 👍

@cirocosta cirocosta closed this Jan 21, 2019
cirocosta added a commit to cirocosta/concourse that referenced this issue Feb 5, 2019
As a way of improving the first iteration we did in terms of
making the worker more "healthcheck-able", this commit goes further
than performing a request to `garden.Ping` and `bc.ListVolumes` and
performs what would be the minimum workload that those components
should be able to handle:

- creating an empty volume; and
- creating a container.

By providing an endpoint for the healthchecking to occur we can
allow both BOSH, k8s, plain-docker ... to perform the checks and
determine whether the worker is in a good shape or not.

By providing a mininal interface we should *in theory* be able
to improve the health checks even more in the future.

concourse#2753

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
cirocosta added a commit to cirocosta/concourse that referenced this issue Feb 21, 2019
As a way of improving the first iteration we did in terms of
making the worker more "healthcheck-able", this commit goes further
than performing a request to `garden.Ping` and `bc.ListVolumes` and
performs what would be the minimum workload that those components
should be able to handle:

- creating an empty volume; and
- creating a container.

By providing an endpoint for the healthchecking to occur we can
allow both BOSH, k8s, plain-docker ... to perform the checks and
determine whether the worker is in a good shape or not.

By providing a mininal interface we should *in theory* be able
to improve the health checks even more in the future.

concourse#2753

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
cirocosta added a commit to cirocosta/concourse that referenced this issue Feb 21, 2019
As a way of improving the first iteration we did in terms of
making the worker more "healthcheck-able", this commit goes further
than performing a request to `garden.Ping` and `bc.ListVolumes` and
performs what would be the minimum workload that those components
should be able to handle:

- creating an empty volume; and
- creating a container.

By providing an endpoint for the healthchecking to occur we can
allow both BOSH, k8s, plain-docker ... to perform the checks and
determine whether the worker is in a good shape or not.

By providing a mininal interface we should *in theory* be able
to improve the health checks even more in the future.

concourse#2753

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
cirocosta added a commit to cirocosta/concourse that referenced this issue Feb 21, 2019
As a way of improving the first iteration we did in terms of
making the worker more "healthcheck-able", this commit goes further
than performing a request to `garden.Ping` and `bc.ListVolumes` and
performs what would be the minimum workload that those components
should be able to handle:

- creating an empty volume; and
- creating a container.

By providing an endpoint for the healthchecking to occur we can
allow both BOSH, k8s, plain-docker ... to perform the checks and
determine whether the worker is in a good shape or not.

By providing a mininal interface we should *in theory* be able
to improve the health checks even more in the future.

concourse#2753

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
cirocosta added a commit to cirocosta/concourse that referenced this issue Feb 26, 2019
As a way of improving the first iteration we did in terms of
making the worker more "healthcheck-able", this commit goes further
than performing a request to `garden.Ping` and `bc.ListVolumes` and
performs what would be the minimum workload that those components
should be able to handle:

- creating an empty volume; and
- creating a container.

By providing an endpoint for the healthchecking to occur we can
allow both BOSH, k8s, plain-docker ... to perform the checks and
determine whether the worker is in a good shape or not.

By providing a mininal interface we should *in theory* be able
to improve the health checks even more in the future.

concourse#2753

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.