-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigation: non-BOSH worker operation lifecycle #1457
Comments
Please also see my comment here for a reproduction/scenario where the current worker lifecycle is a problem: Additionally, I have noticed that I'm curious about concourse retire-worker, never heard of that. Is this something that the concourse binary could trigger on receiving a signal from |
@topherbullock I've resigned to having workers register with unique names per invocation: helm/charts@master...autonomic-ai:concourse-worker-lifecycle-management, but this can result in stalled workers if processes don't gracefully shutdown (eg exec into container and kill -9), and containers on stalled workers aren't rescheduled onto running workers. In a test I've needed to prune the stalled workers before the new worker would start getting containers. Are there any plans to auto-remove workers that are stalled for a period of time? |
Fixed the issue above with stalled workers by keeping the worker names consistent across invocations. A liveness probe that watches the logs will restart a stuck worker if it detects FS panic or |
@william-tran some clarifications on worker names, and state as it exists now: If the hosts are the same - in that the persistent disk ( or whatever storage volume they're using ) for Garden and BaggageClaim are the same disk with existing containers and volumes hanging around - then the workers should register with the same name. These workers should ideally be brought down using the If the hosts can be considered unique, and there is no persistent data left around on them, then registering with a unique name is the way to go. These workers should ideally be brought down using the There's definitely some issues around |
@topherbullock Its seems that when
This is unset, what's the default behaviour? The only reason I'm using consistent names across restarts is so I can automatically clean up and sync worker state to 0 through error detection and calling |
On the above behaviour, I think #1255 needs to be re-opened. |
#1458 captures fixing the root cause for those errors. That one was closed off in favor of a larger-scoped fix for how we react to resources existing on workers which are in a stalled state. |
Default is false, and in that case Garden will try and restore from the depot dir. |
I'm currently integrating Concourse with my k8s/helm setup and I'm planning on upstreaming it to the main Helm chart repo, if anyone wants it I can upload what I have right now. |
@WillemMali have you seen what's there right now? https://github.com/kubernetes/charts/tree/master/stable/concourse and my PR to it for improving worker lifecycle management helm/charts#2109 |
I have, it uses StatefulSets which I'm unfamiliar with and Helm doesn't support them, so I'm making a chart that works with Deployments,
|
@WillemMali best to move your reasons for changes to the chart to an issue against https://github.com/kubernetes/charts |
I was a bit rude there, sorry. Thank you for your suggestion. |
@pivotal-jwinters and I had some thoughts about this while working on #1027 and discussing having the workers' baggageclaim initialize with a certs volume; We could have the workers advertise their existing handles for containers and volumes when registering, as well as a unique identifier (generated on process start maybe? persistent somewhere tied to worker state?) which allows ATC to more reliably determine whether it can be considered the "same" worker or effectively a new worker. Advertising the volumes also enables the ATC to cross-reference the state on the worker with its assumed state, and determine whether the ATC has volumes in the db which aren't on the worker's baggageclaim, or vice-versa (tl;dr no more "unknown handle" errors) |
@topherbullock i try to make up my mind, maybe thats of any use for you:
Right now, even if i fixate the hostname to a static string and then use a named volume for state it will not work, because the state will not be recovered properly ( even with proper worker-land prior the restart ). Can we find a reason for that? |
Yeah, @EugenMayer I'm kind of just brain-dumping on this thread to try and collect thoughts.
|
Advertising sounds great, but maybe rather hard to really get right. Sounds like a far goal to me - any way we could do this in 2 milestones, first with unique IDs + storage check, second advertice to "sanatize further and validate/sync" ? |
@topherbullock @EugenMayer and everyone else, I really appreciate the attention you're giving to this issue. @EugenMayer I ran into the same issue with land vs retire worker running in k8s with the helm chart, I found retire to be the better option and updated the chart to manage worker lifecycle as a result. See https://github.com/kubernetes/charts/pull/2109/files#diff-33a16a0789cde854c77e65e7774aa2beL35 for the related changes. We're trying to solve the same problem so I want to give you a heads up on my approach. @topherbullock Though I'm running into more issues around detecting fatal errors that should trigger a restart. I had In all this, I've seen an opportunity to better clean up before starting the worker under the same name, something like: rm -rf /concourse-work-dir/*
while ! concourse retire-worker --name=${HOSTNAME} | grep -q worker-not-found; do
sleep 5
done
concourse worker --name=${HOSTNAME} | tee -a /concourse-work-dir/.liveness_probe I'm relying on this log to determine whether the namesake worker is still registered or not: Sure it's hacky, but what can I do? Do you see any issues arising from calling |
Super Simple Solution: expose |
🔺 I'd like to document a red herring that just cost me 2 days - this issue is where all roads lead for the - get: my-docker-image
params: {skip_download: true} # <-- THIS IS A PROBLEM
- task: my-task
image: my-docker-image
file: tasks/my-task.yml
- put: my-docker-image
build: docker/my-docker-image If you flush worker state and rerun a pipeline with a config as defined in this example, the If you are trying to get better performance out of building Docker images in a pipeline, use the cache or load_base methods instead. This is expected behaviour, but wow I spent a lot of time upgrading and debugging workers before ripping my pipelines apart. |
- Since we run on docker compose, we start and stop the worker very often Apparently this poses resource caching issue : the worker believes it has the image in cache, but doesn't, hence a weird "file not found" error. - See : concourse/concourse#1457
I've been hit by this issue today when DigitalOcean restarted the hypervisor with my concourse installation. I'm using
So I issued This last thing is especially frustrating, how the need to restart worker's container manually can be avoided? |
Quick update: we've merged and finished up vmware-archive/atc#247 which adds support for ephemeral workers. Ephemeral workers will just immediately go away instead of stalling. This will be off by default, but we're willing to change that for some scenarios (perhaps Docker-Compose). This is just one piece of the puzzle but it's there now! |
We have added tests for this in https://github.com/concourse/concourse/blob/master/topgun/k8s/worker_lifecycle_test.go and created specific issues so I'm going to close this issue for now. Feel free to open another issue for any specific problems. |
Challenge
Operators of concourse clusters (non-BOSH since it has drain scripts to handle worker lifecycle) often run into issues when operating a cluster of workers (creating, restarting, scaling, etc).
A Modest Proposal
Look into the following operation setups and document the pain points around operating (creating, recreating, scaling) worker clusters:
docker-compose
'dIf there is a sane workflow that involves the tools available currently (
concourse land-worker
,concourse retire-worker
,fly prune-worker
), we should improve the documentation around how these commands should be used to operate a cluster where the workers undergo regular recreation, scaling, etc.If there are gaps in operating these scenarios (workers can't be recreated easily using the tools available, errors happen due to data persistence on 'normal' deployments of K8s, docker compose, binaries ) we should strategize and discuss how to tackle them.
The text was updated successfully, but these errors were encountered: