Investigation: non-BOSH worker operation lifecycle #1457

topherbullock · 2017-08-09T13:49:50Z

Challenge

Operators of concourse clusters (non-BOSH since it has drain scripts to handle worker lifecycle) often run into issues when operating a cluster of workers (creating, restarting, scaling, etc).

A Modest Proposal

Look into the following operation setups and document the pain points around operating (creating, recreating, scaling) worker clusters:

Binaries on VMs
docker-compose'd
Kubernetes pods (via Helm, or manually deployed)

If there is a sane workflow that involves the tools available currently (concourse land-worker, concourse retire-worker, fly prune-worker ), we should improve the documentation around how these commands should be used to operate a cluster where the workers undergo regular recreation, scaling, etc.

If there are gaps in operating these scenarios (workers can't be recreated easily using the tools available, errors happen due to data persistence on 'normal' deployments of K8s, docker compose, binaries ) we should strategize and discuss how to tackle them.

The text was updated successfully, but these errors were encountered:

JohannesRudolph · 2017-08-09T19:35:08Z

Please also see my comment here for a reproduction/scenario where the current worker lifecycle is a problem:
#844 (comment)

Additionally, I have noticed that fly prune-worker does not provide immediate relief when killing of old workers that won't come back. Sometimes our two node cluster takes more than an hour after all old workers were pruned until all build pipelines show no more errors, e.g. with check containers.

I'm curious about concourse retire-worker, never heard of that. Is this something that the concourse binary could trigger on receiving a signal from docker stop or docker kill?

william-tran · 2017-09-12T22:22:22Z

@topherbullock I've resigned to having workers register with unique names per invocation: helm/charts@master...autonomic-ai:concourse-worker-lifecycle-management, but this can result in stalled workers if processes don't gracefully shutdown (eg exec into container and kill -9), and containers on stalled workers aren't rescheduled onto running workers. In a test I've needed to prune the stalled workers before the new worker would start getting containers. Are there any plans to auto-remove workers that are stalled for a period of time?

william-tran · 2017-09-13T22:07:39Z

Fixed the issue above with stalled workers by keeping the worker names consistent across invocations. A liveness probe that watches the logs will restart a stuck worker if it detects FS panic or unknown volume, and a preStop hook takes care of retirement. Lets hope this suffices.

topherbullock · 2017-09-15T15:07:18Z

@william-tran some clarifications on worker names, and state as it exists now:

If the hosts are the same - in that the persistent disk ( or whatever storage volume they're using ) for Garden and BaggageClaim are the same disk with existing containers and volumes hanging around - then the workers should register with the same name. These workers should ideally be brought down using the concourse land-worker command when an operator knows they will be offline temporarily, but come back up with the same persistent storage.

If the hosts can be considered unique, and there is no persistent data left around on them, then registering with a unique name is the way to go. These workers should ideally be brought down using the concourse retire-worker command when an operator knows they will be retired permanently, and not come back up with the same persistent storage.

There's definitely some issues around STALLED workers still keeping their containers and those containers not being rebalanced across the whole cluster. There's an existing issue about the error messages resulting from this, but we should also tackle the other issue you've mentioned of rebalancing after some time.

william-tran · 2017-09-18T17:27:12Z

@topherbullock Its seems that when concourse worker starts on k8s, the contents of the /concourse-work-dir folders get wiped out, eg depot and volumes/live. This is the case even when using Persistent Volumes. I've confirmed this overriding the entrypoint and adding a sleep before concourse worker, then kill -9 ing the worker process to trigger a container restart. The contents of /concourse-work-dir survive after container restart, but then get wiped out when concourse starts. I see in concourse worker --help

--garden-destroy-containers-on-startup             Clean up all the existing containers on startup. [$CONCOURSE_GARDEN_DESTROY_CONTAINERS_ON_STARTUP]

This is unset, what's the default behaviour?

The only reason I'm using consistent names across restarts is so I can automatically clean up and sync worker state to 0 through error detection and calling retire-worker, and avoid having STALLED workers lying around. Even if we get the container volumes to persist on restart, I think I'll keep the worker lifecycle behaviour on k8s the way it is as it seems like the safest approach. Here's the PR for it: helm/charts#2109

william-tran · 2017-09-18T20:57:59Z

On the above behaviour, I think #1255 needs to be re-opened.

topherbullock · 2017-09-19T20:14:34Z

#1458 captures fixing the root cause for those errors. That one was closed off in favor of a larger-scoped fix for how we react to resources existing on workers which are in a stalled state.

topherbullock · 2017-09-19T20:21:05Z

This is unset, what's the default behaviour?

Default is false, and in that case Garden will try and restore from the depot dir.

WillemMali · 2017-09-22T13:29:02Z

I'm currently integrating Concourse with my k8s/helm setup and I'm planning on upstreaming it to the main Helm chart repo, if anyone wants it I can upload what I have right now.

william-tran · 2017-09-22T14:43:19Z

@WillemMali have you seen what's there right now? https://github.com/kubernetes/charts/tree/master/stable/concourse and my PR to it for improving worker lifecycle management helm/charts#2109

WillemMali · 2017-09-22T15:41:59Z

I have, it uses StatefulSets which I'm unfamiliar with and Helm doesn't support them, so I'm making a chart that works with Deployments,

william-tran · 2017-09-22T17:46:33Z

@WillemMali best to move your reasons for changes to the chart to an issue against https://github.com/kubernetes/charts

WillemMali · 2017-09-22T20:36:16Z

I was a bit rude there, sorry. Thank you for your suggestion.

topherbullock · 2017-12-13T16:20:16Z

@pivotal-jwinters and I had some thoughts about this while working on #1027 and discussing having the workers' baggageclaim initialize with a certs volume;

We could have the workers advertise their existing handles for containers and volumes when registering, as well as a unique identifier (generated on process start maybe? persistent somewhere tied to worker state?) which allows ATC to more reliably determine whether it can be considered the "same" worker or effectively a new worker.

Advertising the volumes also enables the ATC to cross-reference the state on the worker with its assumed state, and determine whether the ATC has volumes in the db which aren't on the worker's baggageclaim, or vice-versa (tl;dr no more "unknown handle" errors)

EugenMayer · 2017-12-13T18:06:38Z

@topherbullock i try to make up my mind, maybe thats of any use for you:

in yet any orchestration in know, hostname resets go with anon-volume resets. That means, if we want to make the worker-state persistent (named volume) which would be required for some of your thoughts, we cannot go with the hostname being the identifier - since we could have the same state (name volume) but new hostname - mixed up
I think it might no be needed to "advertise" if we can just keep the state consistent? If a worker shuts down, and the actual docker-engine and volumes / overalyfs is all part of /worker-state - it should come up right that same way as it has been shudown. Thus, just ensuring we do not lose any state which is expected by bagageclaim or ATC, will make it consistent out of the box + it ensures we keep as much as possible and restarts become just restarts.
I think checking if a volume is referenced in the DB but not present or vise versa is a huge plus for advertising actually, since a lot of people did mention "bleeding" somehow over time. So maybe, even if 2 ensure "maximum state persistance" advertising could help cleaning up a lot better.

Right now, even if i fixate the hostname to a static string and then use a named volume for state it will not work, because the state will not be recovered properly ( even with proper worker-land prior the restart ). Can we find a reason for that?

topherbullock · 2017-12-13T18:50:43Z

Yeah, @EugenMayer I'm kind of just brain-dumping on this thread to try and collect thoughts.
Regarding:

Moving to a unique identifier other than the worker name would move us from relying on the hostname as the only way to identify the workers.
2 & 3 . The idea behind advertising is to ensure that the desired state in the DB matches the state the worker has currently in its Garden + Baggageclaim. This means we can account for whatever changes have happened (if any) to the state volumes / directory, and helps us be agnostic of how those are actually managed.

EugenMayer · 2017-12-13T18:54:33Z

Advertising sounds great, but maybe rather hard to really get right. Sounds like a far goal to me - any way we could do this in 2 milestones, first with unique IDs + storage check, second advertice to "sanatize further and validate/sync" ?

william-tran · 2017-12-16T00:24:13Z

@topherbullock @EugenMayer and everyone else, I really appreciate the attention you're giving to this issue.

@EugenMayer I ran into the same issue with land vs retire worker running in k8s with the helm chart, I found retire to be the better option and updated the chart to manage worker lifecycle as a result. See https://github.com/kubernetes/charts/pull/2109/files#diff-33a16a0789cde854c77e65e7774aa2beL35 for the related changes. We're trying to solve the same problem so I want to give you a heads up on my approach.

@topherbullock Though I'm running into more issues around detecting fatal errors that should trigger a restart. I had unknown handle as a fatal error, but once I moved to multiple concourse-web pods, I would get lots of these. Looking at the worker logs, I saw a volume was requested to be deleted and that request being successful, but then another request for deletion would be received (I'm guessing from a different atc) and would log unknown handle, when really the worker could continue taking on work. I've removed unknown handle from the list of fatal errors, and am monitoring closely to see if I can catch the truly unrecoverable condition.

In all this, I've seen an opportunity to better clean up before starting the worker under the same name, something like:

rm -rf /concourse-work-dir/*
while ! concourse retire-worker --name=${HOSTNAME} | grep -q worker-not-found; do
  sleep 5
done
concourse worker --name=${HOSTNAME} | tee -a /concourse-work-dir/.liveness_probe

I'm relying on this log to determine whether the namesake worker is still registered or not:
https://github.com/concourse/tsa/blob/d563b2ac75c8e47e2342baaac205370765f8339b/retirer.go#L53

Sure it's hacky, but what can I do? Do you see any issues arising from calling retire-worker multiple times?

william-tran · 2017-12-24T05:31:47Z

Super Simple Solution: expose delete-worker to the concourse command. Then I can run this before I start the worker to ensure state on both sides is clean. I've tried curl -X DELETE https://my-concourse.com/api/v1/workers/my-worker-name and it works wonderfully, except I can't actually do that from the worker.

emcniece · 2018-01-18T00:02:52Z

🔺 I'd like to document a red herring that just cost me 2 days - this issue is where all roads lead for the file not found error, please let me know if I should post this elsewhere.

- get: my-docker-image
  params: {skip_download: true}   # <-- THIS IS A PROBLEM

- task: my-task
  image: my-docker-image
  file: tasks/my-task.yml

- put: my-docker-image
  build: docker/my-docker-image

If you flush worker state and rerun a pipeline with a config as defined in this example, the my-task step will fail with the file not found error because no previous builds of this image have been created/cached in Concourse.

If you are trying to get better performance out of building Docker images in a pipeline, use the cache or load_base methods instead.

This is expected behaviour, but wow I spent a lot of time upgrading and debugging workers before ripping my pipelines apart.

- Since we run on docker compose, we start and stop the worker very often Apparently this poses resource caching issue : the worker believes it has the image in cache, but doesn't, hence a weird "file not found" error. - See : concourse/concourse#1457

william-tran · 2018-01-24T19:24:18Z

@vito about trying to retire a stalled worker, here's the original issue I opened: #1919

leshik · 2018-03-22T05:17:10Z

I've been hit by this issue today when DigitalOcean restarted the hypervisor with my concourse installation. I'm using docker-compose deployment. After the reboot, the worker became stalled and ATC complained:

{"timestamp":"1521691858.311057329","source":"atc","message":"atc.container-collector.run.orphaned-containers.queue.worker-not-found","log_level":1,"data":{"session":"54.1078.1.21","worker-name":"5f2c668feaa4"}}

So I issued fly prune-worker to clean things up. But unfortunately, after that a new worker didn't come up and ATC complained failed-to-initialize-new-container with no workers error. I had to issue docker-compose down and then docker-compose up to make it work.

This last thing is especially frustrating, how the need to restart worker's container manually can be avoided?

vito · 2018-07-30T20:29:04Z

Quick update: we've merged and finished up vmware-archive/atc#247 which adds support for ephemeral workers. Ephemeral workers will just immediately go away instead of stalling. This will be off by default, but we're willing to change that for some scenarios (perhaps Docker-Compose).

This is just one piece of the puzzle but it's there now!

ddadlani · 2019-05-29T15:16:15Z

We have added tests for this in https://github.com/concourse/concourse/blob/master/topgun/k8s/worker_lifecycle_test.go and created specific issues so I'm going to close this issue for now. Feel free to open another issue for any specific problems.

topherbullock added docs area: operability stack/binary stack/kubernetes labels Aug 9, 2017

topherbullock closed this as completed Aug 9, 2017

topherbullock reopened this Aug 9, 2017

topherbullock mentioned this issue Aug 9, 2017

3.1.1 jobs fail with "file not found" after flushing workers #1251

Closed

This was referenced Aug 10, 2017

Take a worker out of service #1461

Closed

Not able to persist Concourse Pipelines after Server restarts on MacOS #1359

Closed

william-tran mentioned this issue Sep 12, 2017

Concourse workers stop working after pod restart helm/charts#2103

Closed

vito added this to Icebox in Operations Oct 18, 2017

vito moved this from Icebox to Portability in Operations Oct 18, 2017

vito moved this from Day 1 to Day 1.5 in Operations Oct 18, 2017

This was referenced Oct 18, 2017

Suspend Resume Cycle of Worker Node stalls worker #1525

Closed

When rebooting worker (offline for more than just a few minutes) not realized as the same identity #1220

Closed

jama22 added this to Backlog in Runtime Nov 20, 2017

vito removed this from Backlog in Operations Nov 22, 2017

vito added stack/docker and removed deployment: docker-compose labels Nov 27, 2017

william-tran mentioned this issue Jan 3, 2018

[stable/concourse] Upgrade to concourse 3.8.0 helm/charts#3203

Merged

vito mentioned this issue Jan 15, 2018

Distribute container/volume garbage-collection across workers #1959

Closed

This was referenced Jan 24, 2018

Build hangs when Concourse Docker container restarted #1797

Closed

worker restart makes atc stall with volume can not destroyed / violates check constraint #1737

Closed

phillbaker mentioned this issue Mar 9, 2018

Workers should be smart enough not to fill up their disk #1751

Closed

jama22 mentioned this issue May 7, 2018

Investigate worker lifecycle now that we have distributed GC #2202

Closed

1 task

mandrean mentioned this issue Jun 14, 2018

[stable/concourse] Workers restart way too often, CrashLoopBackoff helm/charts#6103

Closed

codeword mentioned this issue Aug 6, 2018

Better instructions how how to fix concourse after restart concourse/concourse-docker#20

Closed

topherbullock mentioned this issue Sep 13, 2018

GC containers & volumes from the DB if a worker reports that it no longer has them #2588

Closed

mvdkleijn mentioned this issue Oct 23, 2018

unknown handle on repository. #1821

Closed

cirocosta mentioned this issue Dec 1, 2018

[stable/concourse] Replaces worker StatefulSet by Deployment helm/charts#9668

Closed

3 tasks

ddadlani removed this from Backlog in Runtime May 29, 2019

ddadlani closed this as completed May 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: non-BOSH worker operation lifecycle #1457

Investigation: non-BOSH worker operation lifecycle #1457

topherbullock commented Aug 9, 2017

JohannesRudolph commented Aug 9, 2017

william-tran commented Sep 12, 2017

william-tran commented Sep 13, 2017

topherbullock commented Sep 15, 2017

william-tran commented Sep 18, 2017 •

edited

Loading

william-tran commented Sep 18, 2017

topherbullock commented Sep 19, 2017

topherbullock commented Sep 19, 2017 •

edited

Loading

WillemMali commented Sep 22, 2017

william-tran commented Sep 22, 2017

WillemMali commented Sep 22, 2017 via email •

edited

Loading

william-tran commented Sep 22, 2017

WillemMali commented Sep 22, 2017

topherbullock commented Dec 13, 2017

EugenMayer commented Dec 13, 2017

topherbullock commented Dec 13, 2017

EugenMayer commented Dec 13, 2017

william-tran commented Dec 16, 2017 •

edited

Loading

william-tran commented Dec 24, 2017

emcniece commented Jan 18, 2018

william-tran commented Jan 24, 2018

leshik commented Mar 22, 2018

vito commented Jul 30, 2018

ddadlani commented May 29, 2019

Investigation: non-BOSH worker operation lifecycle #1457

Investigation: non-BOSH worker operation lifecycle #1457

Comments

topherbullock commented Aug 9, 2017

Challenge

A Modest Proposal

JohannesRudolph commented Aug 9, 2017

william-tran commented Sep 12, 2017

william-tran commented Sep 13, 2017

topherbullock commented Sep 15, 2017

william-tran commented Sep 18, 2017 • edited Loading

william-tran commented Sep 18, 2017

topherbullock commented Sep 19, 2017

topherbullock commented Sep 19, 2017 • edited Loading

WillemMali commented Sep 22, 2017

william-tran commented Sep 22, 2017

WillemMali commented Sep 22, 2017 via email • edited Loading

william-tran commented Sep 22, 2017

WillemMali commented Sep 22, 2017

topherbullock commented Dec 13, 2017

EugenMayer commented Dec 13, 2017

topherbullock commented Dec 13, 2017

EugenMayer commented Dec 13, 2017

william-tran commented Dec 16, 2017 • edited Loading

william-tran commented Dec 24, 2017

emcniece commented Jan 18, 2018

william-tran commented Jan 24, 2018

leshik commented Mar 22, 2018

vito commented Jul 30, 2018

ddadlani commented May 29, 2019

william-tran commented Sep 18, 2017 •

edited

Loading

topherbullock commented Sep 19, 2017 •

edited

Loading

WillemMali commented Sep 22, 2017 via email •

edited

Loading

william-tran commented Dec 16, 2017 •

edited

Loading