GC containers & volumes from the DB if a worker reports that it no longer has them #2588

topherbullock · 2018-09-13T19:50:07Z

when a worker reports a list of existing containers/volumes back to the ATCs, we should diff that set against what the DB thinks should be there and remove the ones the worker didn't report having

This will help with some of the issues when a worker goes away and comes back with different state:

Task fails with "config file not found" after restarting Docker service Task fails with "config file not found" after restarting Docker service #1796
failed to find created volume in baggageclaim failed to find created volume in baggageclaim #1550
non-BOSH worker operation lifecycle Investigation: non-BOSH worker operation lifecycle #1457

Gotcha ( potential refactoring / documentation ) gc.Destroyer's methods take in a list of reported handles, not the list of handles to destroy :)

vito · 2018-09-13T20:11:30Z

Make sure there are no race conditions during container creation, since a CREATING row will be in the db and it may not be on the worker yet. Maybe only remove CREATED entries?

#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>

cirocosta · 2018-09-21T15:56:23Z

Hey,

@mhuangpivotal and I started working on this issue, having tackled the first part of it (GCing containers).

Our approach is the following:

on ReportWorkerContainers (endpoint touched by tsa when the worker's gc-interval triggers), mark the reported containers with last_seen = now();
on ContainerCollector being triggered, remove missing containers (those that didn't report that they're still alive within a deadline - now set to gc-interval * 3).

Some notes:

should we emit specific metrics regarding these GCd containers?
should we document that the TTL should not be smaller than the worker GC-interval? Should we enforce that in atccmd? (if it's smaller, then we can have a race between gc and reportcontainers); and
the migration to add the last_seen column does not retroactively mark the containers as last seen, so currently missing containers will never be collected by the GC.

The work can be found at the branch feature/2588, to be pushed after we tackle Volume as well (should follow the same pattern).

Wdyt? Could we be missing an edge case?

thx!

vito · 2018-09-21T16:07:50Z

@cirocosta Hmm my only concern is that this would mean a bunch of writes constantly to the database. Instead of marking last_seen = now() for all containers on every worker heartbeat/report, could we instead only mark something like missing_since = now() on containers that were not in the report?

cirocosta · 2018-09-24T13:57:19Z

Hey @vito,

We thought of using last_seen instead of missing_since because of the possible existance of the case where a container that gets reported as missing_since might come back again in the future and we'd need to remove the missing_since in such case.

For instance:


        t0                              t1                                 t2

      .-> no missing since                                          .-> remove missing since
      |                                                             |
containerA reported                                         containerA reported

                                .-> add missing since
                                |
                          containerA not reported

Using last_seen:

// at container reporting time, mark the
// containers as seen (single update query over
// all of them - not an actual loop)
for _, container := range containers:
  mark_as_seen

Using missing_since:

// at container reporting time
containers_on_worker = find_worker_containers(worker_name)
containers_to_mark = diff(containers_on_worker, containers_reported)

// needed for taking care of the case described above
for _, container := range containers_reported:
  remove_missing_since(container)

for _, container := range containers_to_mark:
  mark_as_missing_since_if_not_marked_yet(container)

It seems to us that in the end, using missing_since would consume the same amount of writes as using last_seen if we account for that case.

Wdyt?

thx!

By the way, at gc time, things would be the same:

containers_to_delete = any_container_with_last_seen_gt_ttl()
for container := range containers:
  delete container

containers_to_delete = any_container_with_missing_since_gt_ttl()
for container := range containers:
  delete container

vito · 2018-09-24T14:50:26Z

@cirocosta I don't think it'll end up being the same number of writes. You should only have to clear out missing_since for containers that both a) have it set and b) ended up being in a subsequent report of the containers. Something like:

UPDATE containers
SET missing_since = NULL
WHERE missing_since IS NOT NULL
AND handle IN (...)

We've had similar cases in the past where making the write conditional leads to a significant performance improvement. One example: 8aa4aca

So with last_seen you have a guaranteed write for all containers on every report. With missing_since you have a write for each missing container, plus a handful of writes for containers that were created on the time boundary and need their missing_since cleared out on the next report. The total writes should end up being much lower since they're both tied to emergent cases (containers disappearing) or edge cases (container created close to time of report).

cirocosta · 2018-09-24T17:57:00Z

Oooh got it!

That makes sense, didn't think about that.

Thanks!

#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>

xtreme-sameer-vohra · 2018-09-26T16:32:01Z

Another datapoint to consider;
Concourse v4.2.1
IaaS GCP + BOSH

We just observed an issue where a job had a step that was in progress for ~24hrs without any output or updates.
The DB stated that the container for the step was in creating mode. The worker had no state for the container.
The ATC had been re-started due to another issue ~24hrs ago, which corresponds with the time since when the container's state went wonky.

Questions to ponder

If a container is in the creating state and the ATC goes down/away, can another ATC safely recover from that step ?
We don't have access to the logs anymore, however, the atc was restarted using bosh and it was successful. Does ATC shutting down gracefully manage state of containers in creating mode properly ?

topherbullock · 2018-10-01T15:26:56Z

If the build is still running, then whichever ATC is tracking the build will eventually use the worker.ContainerProvider to FindOrCreateContainer in this loop.
https://github.com/concourse/concourse/blob/master/atc/worker/container_provider.go#L87

If the db record for the step's container goes away, a new "creating" container should be created in the DB and the ATC will find a worker to create it on. Looking at the code this should actually recover in any case, so maybe this flake is worth a separate issue. I'll write one up.

topherbullock · 2018-11-14T15:01:59Z

Update: @cirocosta and I discovered an edge case with this where containers which have been in the creating state for some time could be GC'd right after they transition to created:

	  WORKER                                ATC
     T1
     T2	                                        container_provider: hey make me a volume V1
     T3	  bc: k i'll make V1
     T4	                                        db: okay v1 is CREATING

     T5	  report: here's my volumes [ V2, V3 ]  cool.. I guess V1 is missing?

     T6	                                        db: V1 is missing since T5

     T7	  report: here's my volumes [ V2, V3 ]  oh V1 is still missing?

     T8	                                        db: V1 is missing since T5

     T9	  bc: k I'm done making V1              db: okay marking V1 as CREATED

    T10	                                        GC: Killing all CREATED Volumes missing since T5

The fix for this is to only set the missing_since column when the container is CREATED or DESTROYING

#2588 Signed-off-by: Topher Bullock <cbullock@pivotal.io>

vito · 2018-11-14T19:50:35Z

Another bit of acceptance criteria: let's make sure we don't accidentally nuke all containers/volumes if a worker stalls for longer than the grace period. We shouldn't consider these "missing" since the whole worker is having issues. We should also be careful to not immediately remove them when the worker comes back to running. (This all might work fine as-is, I just want to confirm during acceptance.)

romangithub1024 · 2018-11-15T17:56:00Z

We have seen several issues recently with jobs failing due to "volume-not-found" after switching to ephemeral workers.

Not certain about the root cause yet, but sounds like this patch should help? Hopefully GC/volume workflow is not any different in case of ephemeral workers... so just something to keep in mind.

topherbullock · 2018-11-21T15:43:15Z

@vito We verified the worker stalling case and everything worked as expected! We're only marking things as missing when the worker is successfully heartbeating, so it conveniently "just workst"™

topherbullock created this issue from a note in Runtime (Backlog) Sep 13, 2018

topherbullock added the resiliency label Sep 13, 2018

cirocosta moved this from Backlog to In Flight in Runtime Sep 18, 2018

cirocosta assigned cirocosta and mhuangpivotal Sep 18, 2018

cirocosta added the size/medium An easily manageable amount of work. Well-defined scope, few unknowns. label Sep 18, 2018

cirocosta pushed a commit that referenced this issue Sep 21, 2018

GCs containers from DB when they don't report they still exist

6695669

#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>

mhuangpivotal pushed a commit that referenced this issue Sep 24, 2018

GCs containers from DB when they don't report they still exist

3a70320

#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>

mhuangpivotal added a commit that referenced this issue Sep 24, 2018

GC missing containers by timestamping missing_since instead of last_seen

c69f6c4

#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>

mhuangpivotal pushed a commit that referenced this issue Sep 25, 2018

GCs containers from DB when they don't report they still exist

f8c6510

#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>

mhuangpivotal added a commit that referenced this issue Sep 25, 2018

GC missing containers by timestamping missing_since instead of last_seen

3fc3ffc

#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>

mhuangpivotal added a commit that referenced this issue Sep 25, 2018

GC volumes from DB when they are not reported by the workers

5a2f85d

#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>

mhuangpivotal mentioned this issue Sep 25, 2018

GC containers & volumes if worker reports no longer include them #2623

Merged

topherbullock mentioned this issue Oct 1, 2018

Flake? Creating containers for a running step not present on any worker #2637

Closed

topherbullock moved this from In Flight to Done in Runtime Oct 3, 2018

topherbullock pushed a commit that referenced this issue Nov 14, 2018

atc: don't mark creating containers/volumes as missing

120a393

#2588 Signed-off-by: Topher Bullock <cbullock@pivotal.io>

vito mentioned this issue Nov 19, 2018

Task fails with "config file not found" after restarting Docker service #1796

Closed

topherbullock added the enhancement label Nov 19, 2018

topherbullock added the accepted label Nov 21, 2018

topherbullock closed this as completed Nov 21, 2018

cirocosta mentioned this issue Dec 4, 2018

[stable/concourse] Replaces worker StatefulSet by Deployment helm/charts#9668

Closed

3 tasks

vito added this to the v5.0.0 milestone Jan 10, 2019

vito added the release/documented Documentation and release notes have been updated. label Jan 22, 2019

This was referenced Feb 11, 2019

Upgrading to 5.0.0 causes resource type checking become unrecoverable #3241

Closed

unknown handle on repository. #1821

Closed

volume '74054ca0-96a1-492c-737d-becd9d7268cf' disappeared from worker XXXX #2993

Closed

vito mentioned this issue Feb 23, 2019

Gracefully handle "unknown handle" if volumes are deleted #3354

Closed

topherbullock moved this from Done to Accepted in Runtime Mar 12, 2019

cirocosta mentioned this issue Mar 27, 2019

GC Containers and volumes when present in the worker but not in the DB #3600

Closed

This was referenced Oct 9, 2019

Countless unknown handle errors #4567

Closed

Don't place work on workers with high container count to avoid overloading them #4390

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC containers & volumes from the DB if a worker reports that it no longer has them #2588

GC containers & volumes from the DB if a worker reports that it no longer has them #2588

topherbullock commented Sep 13, 2018 •

edited

vito commented Sep 13, 2018

cirocosta commented Sep 21, 2018

vito commented Sep 21, 2018

cirocosta commented Sep 24, 2018

vito commented Sep 24, 2018

cirocosta commented Sep 24, 2018

xtreme-sameer-vohra commented Sep 26, 2018 •

edited

topherbullock commented Oct 1, 2018

topherbullock commented Nov 14, 2018

vito commented Nov 14, 2018

romangithub1024 commented Nov 15, 2018 •

edited

topherbullock commented Nov 21, 2018

GC containers & volumes from the DB if a worker reports that it no longer has them #2588

GC containers & volumes from the DB if a worker reports that it no longer has them #2588

Comments

topherbullock commented Sep 13, 2018 • edited

vito commented Sep 13, 2018

cirocosta commented Sep 21, 2018

vito commented Sep 21, 2018

cirocosta commented Sep 24, 2018

vito commented Sep 24, 2018

cirocosta commented Sep 24, 2018

xtreme-sameer-vohra commented Sep 26, 2018 • edited

topherbullock commented Oct 1, 2018

topherbullock commented Nov 14, 2018

vito commented Nov 14, 2018

romangithub1024 commented Nov 15, 2018 • edited

topherbullock commented Nov 21, 2018

topherbullock commented Sep 13, 2018 •

edited

xtreme-sameer-vohra commented Sep 26, 2018 •

edited

romangithub1024 commented Nov 15, 2018 •

edited