Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upImprove the liveness probe for the Helm Chart's Worker StatefulSet. #2753
Comments
This comment has been minimized.
This comment has been minimized.
we filed this a while ago, with one specific example |
This comment has been minimized.
This comment has been minimized.
lifecycle related: #2757 Should this be changed? a worker coming back up after not having cleaned itself up yields this error: What about btrfs errors? disk out of space errors? We really need to disambiguate any old error from ones that require a worker restart to fix. |
This comment has been minimized.
This comment has been minimized.
@ralekseenkov Thanks for linking that! Definitely sums up the exact case we figured would cause the liveness probe to retire the workers.
I'm thinking we'll likely want a health check of some sort on the worker itself rather than rely on logs. We don't have much control over Garden's logs, so differentiating the root cause behind a failure to create purely by reading logs would be a pain. |
This comment has been minimized.
This comment has been minimized.
@cirocosta and I are spiking on a worker component to ping Garden (via the /ping endpoint) and Baggageclaim (no /ping .. but we'll figure something out.. ). Will add more details on what we discover. |
Adds a server process to the worker command so that we can have an endpoint to be reached to healthchecking the worker. Internally, such endpoint makes a request to both `garden` and `baggageclaim`, ensuring that both of them are up. It also Includes k8s deployment script for testing #2753 Signed-off-by: Topher Bullock <cbullock@pivotal.io>
This comment has been minimized.
This comment has been minimized.
Hey @topherbullock , I just submitted to That allows us to reference Do you think that is enough to consider this issue not blocked? Thx! |
This comment has been minimized.
This comment has been minimized.
Hey, On the last Friday, I started working on making those health checks a bit more in-depth, creating a minimum load in the worker to inspect whether it's still capable of serving what its purpose (creating containers and volumes) - https://github.com/cirocosta/concourse-worker-health-checker. Here's a quick walk through the code: aggregate := &checkers.Aggregate{
Checkers: []checkers.Checker{
&checkers.Baggageclaim{Address: *baggageclaimUrl},
&checkers.Garden{Address: *gardenUrl},
},
} where type Checker interface {
// Check performs a health check with its execution time
// limited by a context that might be canceled at any
// time.
Check(ctx context.Context) (err error) So that a Baggageclaim checker looks like: func (b *Baggageclaim) Check(ctx context.Context) (err error) {
handle := mustCreatedHandle()
err = b.createVolume(ctx, handle)
if err != nil {
err = errors.Wrapf(err,
"failed to create volume %s", handle)
return
}
err = b.destroyVolume(ctx, handle)
if err != nil {
err = errors.Wrapf(err,
"failed to delete volume %s", handle)
return
}
return
} and func (h *Aggregate) Check(ctx context.Context) (err error) {
var group *errgroup.Group
group, ctx = errgroup.WithContext(ctx)
for _, checker := range h.Checkers {
checker := checker // goroutine closure
group.Go(func() error {
return checker.Check(ctx)
})
}
err = group.Wait()
return
} This way we can, in the future, increase this with either even more in-depth checkers or some others that could look into things like I started that repo as something separate that gets deployed as a sidecar to the worker as a way of detecting any problems that might arise with it, but the intention is to get it upstream with the code being executed on every Wdyt? thx! |
As a way of improving the first iteration we did in terms of making the worker more "healthcheck-able", this commit goes further than performing a request to `garden.Ping` and `bc.ListVolumes` and performs what would be the minimum workload that those components should be able to handle: - creating an empty volume; and - creating a container. By providing an endpoint for the healthchecking to occur we can allow both BOSH, k8s, plain-docker ... to perform the checks and determine whether the worker is in a good shape or not. By providing a mininal interface we should *in theory* be able to improve the health checks even more in the future. concourse#2753 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
As a way of improving the first iteration we did in terms of making the worker more "healthcheck-able", this commit goes further than performing a request to `garden.Ping` and `bc.ListVolumes` and performs what would be the minimum workload that those components should be able to handle: - creating an empty volume; and - creating a container. By providing an endpoint for the healthchecking to occur we can allow both BOSH, k8s, plain-docker ... to perform the checks and determine whether the worker is in a good shape or not. By providing a mininal interface we should *in theory* be able to improve the health checks even more in the future. concourse#2753 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
As a way of improving the first iteration we did in terms of making the worker more "healthcheck-able", this commit goes further than performing a request to `garden.Ping` and `bc.ListVolumes` and performs what would be the minimum workload that those components should be able to handle: - creating an empty volume; and - creating a container. By providing an endpoint for the healthchecking to occur we can allow both BOSH, k8s, plain-docker ... to perform the checks and determine whether the worker is in a good shape or not. By providing a mininal interface we should *in theory* be able to improve the health checks even more in the future. concourse#2753 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
This comment has been minimized.
This comment has been minimized.
Hey, Just a quick update; #3025 adds the functionality I described above. Thanks! |
This comment has been minimized.
This comment has been minimized.
As a way of improving the first iteration we did in terms of making the worker more "healthcheck-able", this commit goes further than performing a request to `garden.Ping` and `bc.ListVolumes` and performs what would be the minimum workload that those components should be able to handle: - creating an empty volume; and - creating a container. By providing an endpoint for the healthchecking to occur we can allow both BOSH, k8s, plain-docker ... to perform the checks and determine whether the worker is in a good shape or not. By providing a mininal interface we should *in theory* be able to improve the health checks even more in the future. concourse#2753 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
As a way of improving the first iteration we did in terms of making the worker more "healthcheck-able", this commit goes further than performing a request to `garden.Ping` and `bc.ListVolumes` and performs what would be the minimum workload that those components should be able to handle: - creating an empty volume; and - creating a container. By providing an endpoint for the healthchecking to occur we can allow both BOSH, k8s, plain-docker ... to perform the checks and determine whether the worker is in a good shape or not. By providing a mininal interface we should *in theory* be able to improve the health checks even more in the future. concourse#2753 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
As a way of improving the first iteration we did in terms of making the worker more "healthcheck-able", this commit goes further than performing a request to `garden.Ping` and `bc.ListVolumes` and performs what would be the minimum workload that those components should be able to handle: - creating an empty volume; and - creating a container. By providing an endpoint for the healthchecking to occur we can allow both BOSH, k8s, plain-docker ... to perform the checks and determine whether the worker is in a good shape or not. By providing a mininal interface we should *in theory* be able to improve the health checks even more in the future. concourse#2753 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
As a way of improving the first iteration we did in terms of making the worker more "healthcheck-able", this commit goes further than performing a request to `garden.Ping` and `bc.ListVolumes` and performs what would be the minimum workload that those components should be able to handle: - creating an empty volume; and - creating a container. By providing an endpoint for the healthchecking to occur we can allow both BOSH, k8s, plain-docker ... to perform the checks and determine whether the worker is in a good shape or not. By providing a mininal interface we should *in theory* be able to improve the health checks even more in the future. concourse#2753 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
As a way of improving the first iteration we did in terms of making the worker more "healthcheck-able", this commit goes further than performing a request to `garden.Ping` and `bc.ListVolumes` and performs what would be the minimum workload that those components should be able to handle: - creating an empty volume; and - creating a container. By providing an endpoint for the healthchecking to occur we can allow both BOSH, k8s, plain-docker ... to perform the checks and determine whether the worker is in a good shape or not. By providing a mininal interface we should *in theory* be able to improve the health checks even more in the future. concourse#2753 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
By default the workers will retire (and then be pruned) if the following are logged by the worker:
This means that any failure to create a container will retire the worker. Creating a container can fail for any old reason, even a poorly written task, so this is less than ideal.