New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribute container/volume garbage-collection across workers #1959

Closed
vito opened this Issue Jan 15, 2018 · 7 comments

Comments

@vito
Member

vito commented Jan 15, 2018

Feature Request

What challenge are you facing?

Currently the ATC is responsible for destroying containers/volumes across workers. This is very network-intensive and error prone and difficult to parallelize while having reasonable resource consumption (it's easy to just have a swarm of connections lead to a 'too many open files' error). This happened to our large-scale Concourse instance, Wings, resulting in the whole server being dead and volumes leaking forever.

A Modest Proposal

Here's one idea:

  1. Don't have the ATC talk to workers to destroy containers/volumes - have its GC only mark them as 'destroying'.
  2. Add an API endpoint, /api/v1/workers/<name>/sync (or something). Details described after.
  3. Add a TSA command, sync (or whatever we call this), which does a POST to the above endpoint with the worker's list of container/volume handles.
  4. Any container/volume handles not included in the submitted list will be removed from the ATC's database.
  5. The API endpoint then returns the list of container/volume handles in DESTROYING state. This is passed on to the caller of the sync command.
  6. Add a process on the worker that periodically invokes this sync command (using the worker's private key as authorization), and destroys the returned containers/volumes.

This has quite a few benefits:

  1. Much fewer moving parts in the GC - it's all just database work now, making it much less prone to locking up.
  2. Easier to reason about worker parallelism, now that it's each worker doing its own slice. We'd just need a max-in-flight, rather than a fancy per-worker job queue.
  3. More effectively distributes work across the cluster; the ATC is no longer a bottleneck for removing all containers/volumes.
  4. This also fixes the unrecoverable cases where volumes/containers are removed out-of-band from the workers, leading to unknown handle errors - the initial POST will clear them out. Ref. #1255, #1305, #1322, #1550, #1721, #1821. As a result this should help out with #1457.

@vito vito added this to Icebox in Runtime via automation Jan 15, 2018

@vito vito added the incident label Jan 15, 2018

@vito vito moved this from Icebox to Backlog in Runtime Jan 15, 2018

@marco-m

This comment has been minimized.

marco-m commented Jan 17, 2018

Regarding the step "Add a process on the worker that periodically invokes this sync command", I would suggest to add on each worker a randomization factor when calculating the time for the next "sync", to avoid a synchronized "sync" storm from the workers to the ATC.

@william-tran

This comment has been minimized.

Contributor

william-tran commented Jan 24, 2018

On k8s, when a worker dies (for whatever reason), I would like it to re-register with the same name and be in a state to take on work. To do this, I'd like it to synchronize with ATC on startup, and if the simplest and most foolproof state to agree on is the empty state, as if this worker never existed before, that works for me. I don't want to have to clear the concourse-work-dir before I start the concourse process to ensure clean startup, or need to call another command (eg retire-worker, or fly prune-worker) to get ATC to agree on the empty state.

@vito

This comment has been minimized.

Member

vito commented Jan 26, 2018

There's one thing to be very careful about here. Say this order of operations happens:

  1. Worker collects its set of containers/volumes to sync with the ATC.
  2. The ATC instructs the worker to create a container or volume.
  3. The container/volume finishes being created.
  4. The ATC receives the sync request. The original set of containers/volumes won't include the newly-created one, and it will be immediately reaped from the database. Uh oh!

Not sure what to do about this yet. One way to do it would be to linearize it in some way that the ATC can tell that the sync request is old or cannot possibly contain the newly created container/volume (as opposed to it having disappeared), and to not remove it from the database. This could be done e.g. with a timestamp included in the sync request, and the ATC would only remove containers that were created prior to that timestamp. (To make this more foolproof we wouldn't use a timestamp; a monotonically increasing number would do, but then we need to define a source of truth for that so the worker knows what to send...)

@marco-m

This comment has been minimized.

marco-m commented Jan 27, 2018

Nice distributed system race condition :-)

One possibility would be to make the ATC the source of truth for the sequence number (the "monotonically increasing number") as follows:

  1. If the worker already has a sequence number from the ATC, it uses it, otherwise it uses 0.
  2. Worker collects its set of containers/volumes to sync with the ATC, with the current sequence number.
  3. The ATC instructs the worker to create a container or volume. Among the information, there is a new sequence number.
  4. The container/volume finishes being created.
  5. The ATC receives the sync request. The original set of containers/volumes won't include the newly-created one, but the request contains the old sequence number, so enough information for the ATC to decide what to remove.
@topherbullock

This comment has been minimized.

Member

topherbullock commented Feb 5, 2018

Breaking this down into some bite-sized issues:

  • Move worker common registration in concourse/bin and the BOSH release's groundcrew job into concourse/worker
  • Add batch volume and container deletion capabilities to concourse/worker #2109
  • Add ATC "marked garbage" API for worker to query ( via TSA ) which volumes and containers they should remove
  • Add ATC "reconciliation" API to allow workers to report what resources they still have after the initial tick of GC has finished; ATC should reconcile its database with the real state of the worker.
  • Modify the GC of containers and volumes on the ATC to only mark containers for deletion, and have the workers "sweep" the list of things to delete from their own baggageclaim and garden
  • Create a component of the worker which runs on some interval (configurable with a sane default) to hit the "marked garbage" API and start the sweep phase to delete volumes and containers.
  • Report back to the ATC "reconciliation" API from the worker after the "sweep" phase finishes with the current state of the wold on the worker.

@topherbullock topherbullock moved this from Backlog to In Flight in Runtime Feb 7, 2018

@topherbullock topherbullock moved this from In Flight to Backlog in Runtime Feb 8, 2018

@topherbullock topherbullock moved this from Backlog to In Flight in Runtime Feb 15, 2018

@vito vito moved this from In Flight to Backlog in Runtime Mar 5, 2018

@jama-pivotal jama-pivotal moved this from Backlog to In Flight in Runtime Mar 26, 2018

@xtremerui xtremerui self-assigned this Apr 23, 2018

@xtremerui

This comment has been minimized.

Contributor

xtremerui commented Apr 24, 2018

Breaking out the tasks further

  • Add ATC "marked garbage" API to ATC for workers to query which volumes they should remove
  • Add ATC "marked garbage" API to ATC for workers to query which containers they should remove
  • Add TSA command for fetching fetching destroying containers from ATC
  • Add ATC "reconciliation" API to allow workers to report what containers they still have after the initial tick of GC has finished; ATC should reconcile its database with the real state of the worker.
  • Cleanup reaper client on ATC heartbeat msg
  • Add TSA command for fetching fetching destroying volumes from ATC
  • Add ATC "reconciliation" API to allow workers to report what volumes they still have after the initial tick of GC has finished; ATC should reconcile its database with the real state of the worker.

shashwathi added a commit that referenced this issue May 3, 2018

bump atc bin topgun tsa worker
#1959

Submodule src/github.com/concourse/atc 310a311..97c9af1:
  > Add APIs to list destroying containers
  > Update marked destroy containers api without team context
  > Add API to report volumes to destroyed for a given worker
  > Add mark API
  > Merge pull request #271 from timrchavez/timrchavez/issue_1717
  > Merge pull request #265 from SHyx0rmZ/set-content-type
Submodule src/github.com/concourse/bin 6cd414a80..baac81ca9:
  > Add new runner for sweeping containers
  > Merge pull request #42 from osis/releasethequickstart
Submodule src/github.com/concourse/topgun 7706c3b..e9251aa:
  > Enable debug to see more logs for Topgun
Submodule src/github.com/concourse/tsa 62eb12d..5576ee1:
  > Add command to sync work status with ATC
  > Add tsa cmd to sweep containers
Submodule src/github.com/concourse/worker 907ef55..85f0974:
  > Add sweeper runner to worker start command

Signed-off-by: Rui Yang <ryang@pivotal.io>

shashwathi pushed a commit that referenced this issue May 8, 2018

bump atc bin tsa worker
#1959

Submodule src/github.com/concourse/atc f498067..4561536:
  > cleanup of reaper URL and reaper client references
Submodule src/github.com/concourse/bin 11139b93e..505ce5502:
  > Remove reaper URL for heartbeat msg
Submodule src/github.com/concourse/tsa 51a5503..88b5b1f:
  > Remove reaper addr from forward connection of workers
Submodule src/github.com/concourse/worker 13192ce..05f8abe:
  > Cleanup http client defaults
  > Remove reaper addr from heartbeat msg

Signed-off-by: Shash Reddy <sreddy@pivotal.io>

shashwathi added a commit that referenced this issue May 8, 2018

bump atc
#1959

Submodule src/github.com/concourse/atc 4561536..f64cb7f:
  > Fix down migration

Signed-off-by: Shash Reddy <sreddy@pivotal.io>

shashwathi added a commit that referenced this issue May 8, 2018

bump atc
#1959

Submodule src/github.com/concourse/atc f64cb7f..c92ce5b:
  > Update bindata

Signed-off-by: Shash Reddy <sreddy@pivotal.io>

xtremerui added a commit that referenced this issue May 15, 2018

bump atc tsa worker
#1959

Submodule src/github.com/concourse/atc c92ce5b..448ebf3:
  > Cleanup volume GC
Submodule src/github.com/concourse/tsa 88b5b1f..0d97c09:
  >  Add sweep and report functionality for volumes
Submodule src/github.com/concourse/worker 05f8abe..f388956:
  > Run volume GC after container GC

Signed-off-by: Shash Reddy <sreddy@pivotal.io>

@xtremerui xtremerui moved this from In Flight to Done in Runtime May 16, 2018

@topherbullock

This comment has been minimized.

Member

topherbullock commented Jun 5, 2018

Looks good!
One caveat here is that if a container or volume disappears from a worker, and it hasn't been marked as destroying in the DB. The ATC will never delete that container or volume from the worker, as it hasn't been marked as destroying. So, as it exists now, this works well at ensuring the state machine marches on, but doesn't handle the case of the disappearing container.

See :
https://github.com/concourse/atc/blob/master/api/containerserver/report.go#L11:18
https://github.com/concourse/atc/blob/master/db/container_repository.go#L62

These calls are where the worker reports the list of containers it knows about, and the ATC destroys any Destroying containers which aren't in that list. Ideally this should also perform some diff action to determine the set of things which should stay, and what should go. This is where that race condition comes in, but we can make that a separate issue ( and maybe use the worker's last heartbeat time in the DB for sequencing? )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment