Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation: non-BOSH worker operation lifecycle #1457

Closed
topherbullock opened this issue Aug 9, 2017 · 37 comments
Closed

Investigation: non-BOSH worker operation lifecycle #1457

topherbullock opened this issue Aug 9, 2017 · 37 comments

Comments

@topherbullock
Copy link
Member

Challenge

Operators of concourse clusters (non-BOSH since it has drain scripts to handle worker lifecycle) often run into issues when operating a cluster of workers (creating, restarting, scaling, etc).

A Modest Proposal

Look into the following operation setups and document the pain points around operating (creating, recreating, scaling) worker clusters:

  • Binaries on VMs
  • docker-compose'd
  • Kubernetes pods (via Helm, or manually deployed)

If there is a sane workflow that involves the tools available currently (concourse land-worker, concourse retire-worker, fly prune-worker ), we should improve the documentation around how these commands should be used to operate a cluster where the workers undergo regular recreation, scaling, etc.

If there are gaps in operating these scenarios (workers can't be recreated easily using the tools available, errors happen due to data persistence on 'normal' deployments of K8s, docker compose, binaries ) we should strategize and discuss how to tackle them.

@JohannesRudolph
Copy link
Contributor

Please also see my comment here for a reproduction/scenario where the current worker lifecycle is a problem:
#844 (comment)

Additionally, I have noticed that fly prune-worker does not provide immediate relief when killing of old workers that won't come back. Sometimes our two node cluster takes more than an hour after all old workers were pruned until all build pipelines show no more errors, e.g. with check containers.

I'm curious about concourse retire-worker, never heard of that. Is this something that the concourse binary could trigger on receiving a signal from docker stop or docker kill?

@william-tran
Copy link
Contributor

@topherbullock I've resigned to having workers register with unique names per invocation: helm/charts@master...autonomic-ai:concourse-worker-lifecycle-management, but this can result in stalled workers if processes don't gracefully shutdown (eg exec into container and kill -9), and containers on stalled workers aren't rescheduled onto running workers. In a test I've needed to prune the stalled workers before the new worker would start getting containers. Are there any plans to auto-remove workers that are stalled for a period of time?

@william-tran
Copy link
Contributor

Fixed the issue above with stalled workers by keeping the worker names consistent across invocations. A liveness probe that watches the logs will restart a stuck worker if it detects FS panic or unknown volume, and a preStop hook takes care of retirement. Lets hope this suffices.

@topherbullock
Copy link
Member Author

@william-tran some clarifications on worker names, and state as it exists now:

If the hosts are the same - in that the persistent disk ( or whatever storage volume they're using ) for Garden and BaggageClaim are the same disk with existing containers and volumes hanging around - then the workers should register with the same name. These workers should ideally be brought down using the concourse land-worker command when an operator knows they will be offline temporarily, but come back up with the same persistent storage.

If the hosts can be considered unique, and there is no persistent data left around on them, then registering with a unique name is the way to go. These workers should ideally be brought down using the concourse retire-worker command when an operator knows they will be retired permanently, and not come back up with the same persistent storage.

There's definitely some issues around STALLED workers still keeping their containers and those containers not being rebalanced across the whole cluster. There's an existing issue about the error messages resulting from this, but we should also tackle the other issue you've mentioned of rebalancing after some time.

@william-tran
Copy link
Contributor

william-tran commented Sep 18, 2017

@topherbullock Its seems that when concourse worker starts on k8s, the contents of the /concourse-work-dir folders get wiped out, eg depot and volumes/live. This is the case even when using Persistent Volumes. I've confirmed this overriding the entrypoint and adding a sleep before concourse worker, then kill -9 ing the worker process to trigger a container restart. The contents of /concourse-work-dir survive after container restart, but then get wiped out when concourse starts. I see in concourse worker --help

--garden-destroy-containers-on-startup             Clean up all the existing containers on startup. [$CONCOURSE_GARDEN_DESTROY_CONTAINERS_ON_STARTUP]

This is unset, what's the default behaviour?

The only reason I'm using consistent names across restarts is so I can automatically clean up and sync worker state to 0 through error detection and calling retire-worker, and avoid having STALLED workers lying around. Even if we get the container volumes to persist on restart, I think I'll keep the worker lifecycle behaviour on k8s the way it is as it seems like the safest approach. Here's the PR for it: helm/charts#2109

@william-tran
Copy link
Contributor

On the above behaviour, I think #1255 needs to be re-opened.

@topherbullock
Copy link
Member Author

#1458 captures fixing the root cause for those errors. That one was closed off in favor of a larger-scoped fix for how we react to resources existing on workers which are in a stalled state.

@topherbullock
Copy link
Member Author

topherbullock commented Sep 19, 2017

This is unset, what's the default behaviour?

Default is false, and in that case Garden will try and restore from the depot dir.

@WillemMali
Copy link

I'm currently integrating Concourse with my k8s/helm setup and I'm planning on upstreaming it to the main Helm chart repo, if anyone wants it I can upload what I have right now.

@william-tran
Copy link
Contributor

@WillemMali have you seen what's there right now? https://github.com/kubernetes/charts/tree/master/stable/concourse and my PR to it for improving worker lifecycle management helm/charts#2109

@WillemMali
Copy link

WillemMali commented Sep 22, 2017 via email

@william-tran
Copy link
Contributor

@WillemMali best to move your reasons for changes to the chart to an issue against https://github.com/kubernetes/charts

@WillemMali
Copy link

I was a bit rude there, sorry. Thank you for your suggestion.

@jama22 jama22 added this to Backlog in Runtime Nov 20, 2017
@vito vito removed this from Backlog in Operations Nov 22, 2017
@topherbullock
Copy link
Member Author

@pivotal-jwinters and I had some thoughts about this while working on #1027 and discussing having the workers' baggageclaim initialize with a certs volume;

We could have the workers advertise their existing handles for containers and volumes when registering, as well as a unique identifier (generated on process start maybe? persistent somewhere tied to worker state?) which allows ATC to more reliably determine whether it can be considered the "same" worker or effectively a new worker.

Advertising the volumes also enables the ATC to cross-reference the state on the worker with its assumed state, and determine whether the ATC has volumes in the db which aren't on the worker's baggageclaim, or vice-versa (tl;dr no more "unknown handle" errors)

@EugenMayer
Copy link

@topherbullock i try to make up my mind, maybe thats of any use for you:

  1. in yet any orchestration in know, hostname resets go with anon-volume resets. That means, if we want to make the worker-state persistent (named volume) which would be required for some of your thoughts, we cannot go with the hostname being the identifier - since we could have the same state (name volume) but new hostname - mixed up

  2. I think it might no be needed to "advertise" if we can just keep the state consistent? If a worker shuts down, and the actual docker-engine and volumes / overalyfs is all part of /worker-state - it should come up right that same way as it has been shudown. Thus, just ensuring we do not lose any state which is expected by bagageclaim or ATC, will make it consistent out of the box + it ensures we keep as much as possible and restarts become just restarts.

  3. I think checking if a volume is referenced in the DB but not present or vise versa is a huge plus for advertising actually, since a lot of people did mention "bleeding" somehow over time. So maybe, even if 2 ensure "maximum state persistance" advertising could help cleaning up a lot better.

Right now, even if i fixate the hostname to a static string and then use a named volume for state it will not work, because the state will not be recovered properly ( even with proper worker-land prior the restart ). Can we find a reason for that?

@topherbullock
Copy link
Member Author

Yeah, @EugenMayer I'm kind of just brain-dumping on this thread to try and collect thoughts.
Regarding:

  1. Moving to a unique identifier other than the worker name would move us from relying on the hostname as the only way to identify the workers.
    2 & 3 . The idea behind advertising is to ensure that the desired state in the DB matches the state the worker has currently in its Garden + Baggageclaim. This means we can account for whatever changes have happened (if any) to the state volumes / directory, and helps us be agnostic of how those are actually managed.

@EugenMayer
Copy link

Advertising sounds great, but maybe rather hard to really get right. Sounds like a far goal to me - any way we could do this in 2 milestones, first with unique IDs + storage check, second advertice to "sanatize further and validate/sync" ?

@william-tran
Copy link
Contributor

william-tran commented Dec 16, 2017

@topherbullock @EugenMayer and everyone else, I really appreciate the attention you're giving to this issue.

@EugenMayer I ran into the same issue with land vs retire worker running in k8s with the helm chart, I found retire to be the better option and updated the chart to manage worker lifecycle as a result. See https://github.com/kubernetes/charts/pull/2109/files#diff-33a16a0789cde854c77e65e7774aa2beL35 for the related changes. We're trying to solve the same problem so I want to give you a heads up on my approach.

@topherbullock Though I'm running into more issues around detecting fatal errors that should trigger a restart. I had unknown handle as a fatal error, but once I moved to multiple concourse-web pods, I would get lots of these. Looking at the worker logs, I saw a volume was requested to be deleted and that request being successful, but then another request for deletion would be received (I'm guessing from a different atc) and would log unknown handle, when really the worker could continue taking on work. I've removed unknown handle from the list of fatal errors, and am monitoring closely to see if I can catch the truly unrecoverable condition.

In all this, I've seen an opportunity to better clean up before starting the worker under the same name, something like:

rm -rf /concourse-work-dir/*
while ! concourse retire-worker --name=${HOSTNAME} | grep -q worker-not-found; do
  sleep 5
done
concourse worker --name=${HOSTNAME} | tee -a /concourse-work-dir/.liveness_probe

I'm relying on this log to determine whether the namesake worker is still registered or not:
https://github.com/concourse/tsa/blob/d563b2ac75c8e47e2342baaac205370765f8339b/retirer.go#L53

Sure it's hacky, but what can I do? Do you see any issues arising from calling retire-worker multiple times?

@william-tran
Copy link
Contributor

Super Simple Solution: expose delete-worker to the concourse command. Then I can run this before I start the worker to ensure state on both sides is clean. I've tried curl -X DELETE https://my-concourse.com/api/v1/workers/my-worker-name and it works wonderfully, except I can't actually do that from the worker.

@emcniece
Copy link

🔺 I'd like to document a red herring that just cost me 2 days - this issue is where all roads lead for the file not found error, please let me know if I should post this elsewhere.

- get: my-docker-image
  params: {skip_download: true}   # <-- THIS IS A PROBLEM

- task: my-task
  image: my-docker-image
  file: tasks/my-task.yml

- put: my-docker-image
  build: docker/my-docker-image

If you flush worker state and rerun a pipeline with a config as defined in this example, the my-task step will fail with the file not found error because no previous builds of this image have been created/cached in Concourse.

If you are trying to get better performance out of building Docker images in a pipeline, use the cache or load_base methods instead.

This is expected behaviour, but wow I spent a lot of time upgrading and debugging workers before ripping my pipelines apart.

Kehrlann added a commit to Kehrlann/concourse-demo that referenced this issue Jan 21, 2018
- Since we run on docker compose, we start and stop the worker very often
  Apparently this poses resource caching issue : the worker believes it
  has the image in cache, but doesn't, hence a weird "file not found" error.
- See : concourse/concourse#1457
@william-tran
Copy link
Contributor

@vito about trying to retire a stalled worker, here's the original issue I opened: #1919

@leshik
Copy link

leshik commented Mar 22, 2018

I've been hit by this issue today when DigitalOcean restarted the hypervisor with my concourse installation. I'm using docker-compose deployment. After the reboot, the worker became stalled and ATC complained:

{"timestamp":"1521691858.311057329","source":"atc","message":"atc.container-collector.run.orphaned-containers.queue.worker-not-found","log_level":1,"data":{"session":"54.1078.1.21","worker-name":"5f2c668feaa4"}}

So I issued fly prune-worker to clean things up. But unfortunately, after that a new worker didn't come up and ATC complained failed-to-initialize-new-container with no workers error. I had to issue docker-compose down and then docker-compose up to make it work.

This last thing is especially frustrating, how the need to restart worker's container manually can be avoided?

@vito
Copy link
Member

vito commented Jul 30, 2018

Quick update: we've merged and finished up vmware-archive/atc#247 which adds support for ephemeral workers. Ephemeral workers will just immediately go away instead of stalling. This will be off by default, but we're willing to change that for some scenarios (perhaps Docker-Compose).

This is just one piece of the puzzle but it's there now!

@ddadlani
Copy link
Contributor

We have added tests for this in https://github.com/concourse/concourse/blob/master/topgun/k8s/worker_lifecycle_test.go and created specific issues so I'm going to close this issue for now. Feel free to open another issue for any specific problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants