Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC containers & volumes from the DB if a worker reports that it no longer has them #2588

Closed
topherbullock opened this issue Sep 13, 2018 · 12 comments
Assignees
Labels
accepted enhancement release/documented Documentation and release notes have been updated. resiliency size/medium An easily manageable amount of work. Well-defined scope, few unknowns.
Projects
Milestone

Comments

@topherbullock
Copy link
Member

topherbullock commented Sep 13, 2018

when a worker reports a list of existing containers/volumes back to the ATCs, we should diff that set against what the DB thinks should be there and remove the ones the worker didn't report having

This will help with some of the issues when a worker goes away and comes back with different state:

Gotcha ( potential refactoring / documentation ) gc.Destroyer's methods take in a list of reported handles, not the list of handles to destroy :)

@topherbullock topherbullock created this issue from a note in Runtime (Backlog) Sep 13, 2018
@vito
Copy link
Member

vito commented Sep 13, 2018

Make sure there are no race conditions during container creation, since a CREATING row will be in the db and it may not be on the worker yet. Maybe only remove CREATED entries?

@cirocosta cirocosta moved this from Backlog to In Flight in Runtime Sep 18, 2018
@cirocosta cirocosta added the size/medium An easily manageable amount of work. Well-defined scope, few unknowns. label Sep 18, 2018
cirocosta pushed a commit that referenced this issue Sep 21, 2018
#2588

Signed-off-by: Mark Huang <mhuang@pivotal.io>
@cirocosta
Copy link
Member

Hey,

@mhuangpivotal and I started working on this issue, having tackled the first part of it (GCing containers).

Our approach is the following:

Some notes:

  • should we emit specific metrics regarding these GCd containers?
  • should we document that the TTL should not be smaller than the worker GC-interval? Should we enforce that in atccmd? (if it's smaller, then we can have a race between gc and reportcontainers); and
  • the migration to add the last_seen column does not retroactively mark the containers as last seen, so currently missing containers will never be collected by the GC.

The work can be found at the branch feature/2588, to be pushed after we tackle Volume as well (should follow the same pattern).

Wdyt? Could we be missing an edge case?

thx!

@vito
Copy link
Member

vito commented Sep 21, 2018

@cirocosta Hmm my only concern is that this would mean a bunch of writes constantly to the database. Instead of marking last_seen = now() for all containers on every worker heartbeat/report, could we instead only mark something like missing_since = now() on containers that were not in the report?

@cirocosta
Copy link
Member

Hey @vito,

We thought of using last_seen instead of missing_since because of the possible existance of the case where a container that gets reported as missing_since might come back again in the future and we'd need to remove the missing_since in such case.

For instance:


        t0                              t1                                 t2

      .-> no missing since                                          .-> remove missing since
      |                                                             |
containerA reported                                         containerA reported

                                .-> add missing since
                                |
                          containerA not reported

Using last_seen:

// at container reporting time, mark the
// containers as seen (single update query over
// all of them - not an actual loop)
for _, container := range containers:
  mark_as_seen

Using missing_since:

// at container reporting time
containers_on_worker = find_worker_containers(worker_name)
containers_to_mark = diff(containers_on_worker, containers_reported)

// needed for taking care of the case described above
for _, container := range containers_reported:
  remove_missing_since(container)

for _, container := range containers_to_mark:
  mark_as_missing_since_if_not_marked_yet(container)

It seems to us that in the end, using missing_since would consume the same amount of writes as using last_seen if we account for that case.

Wdyt?

thx!


By the way, at gc time, things would be the same:

containers_to_delete = any_container_with_last_seen_gt_ttl()
for container := range containers:
  delete container
containers_to_delete = any_container_with_missing_since_gt_ttl()
for container := range containers:
  delete container

@vito
Copy link
Member

vito commented Sep 24, 2018

@cirocosta I don't think it'll end up being the same number of writes. You should only have to clear out missing_since for containers that both a) have it set and b) ended up being in a subsequent report of the containers. Something like:

UPDATE containers
SET missing_since = NULL
WHERE missing_since IS NOT NULL
AND handle IN (...)

We've had similar cases in the past where making the write conditional leads to a significant performance improvement. One example: 8aa4aca

So with last_seen you have a guaranteed write for all containers on every report. With missing_since you have a write for each missing container, plus a handful of writes for containers that were created on the time boundary and need their missing_since cleared out on the next report. The total writes should end up being much lower since they're both tied to emergent cases (containers disappearing) or edge cases (container created close to time of report).

@cirocosta
Copy link
Member

Oooh got it!

That makes sense, didn't think about that.

Thanks!

mhuangpivotal pushed a commit that referenced this issue Sep 24, 2018
#2588

Signed-off-by: Mark Huang <mhuang@pivotal.io>
mhuangpivotal added a commit that referenced this issue Sep 24, 2018
mhuangpivotal pushed a commit that referenced this issue Sep 25, 2018
#2588

Signed-off-by: Mark Huang <mhuang@pivotal.io>
mhuangpivotal added a commit that referenced this issue Sep 25, 2018
mhuangpivotal added a commit that referenced this issue Sep 25, 2018
#2588

Signed-off-by: Mark Huang <mhuang@pivotal.io>
@xtreme-sameer-vohra
Copy link
Contributor

xtreme-sameer-vohra commented Sep 26, 2018

Another datapoint to consider;
Concourse v4.2.1
IaaS GCP + BOSH

We just observed an issue where a job had a step that was in progress for ~24hrs without any output or updates.
The DB stated that the container for the step was in creating mode. The worker had no state for the container.
The ATC had been re-started due to another issue ~24hrs ago, which corresponds with the time since when the container's state went wonky.

Questions to ponder

  • If a container is in the creating state and the ATC goes down/away, can another ATC safely recover from that step ?
  • We don't have access to the logs anymore, however, the atc was restarted using bosh and it was successful. Does ATC shutting down gracefully manage state of containers in creating mode properly ?

@topherbullock
Copy link
Member Author

If the build is still running, then whichever ATC is tracking the build will eventually use the worker.ContainerProvider to FindOrCreateContainer in this loop.
https://github.com/concourse/concourse/blob/master/atc/worker/container_provider.go#L87

If the db record for the step's container goes away, a new "creating" container should be created in the DB and the ATC will find a worker to create it on. Looking at the code this should actually recover in any case, so maybe this flake is worth a separate issue. I'll write one up.

@topherbullock
Copy link
Member Author

Update: @cirocosta and I discovered an edge case with this where containers which have been in the creating state for some time could be GC'd right after they transition to created:

	  WORKER                                ATC
     T1
     T2	                                        container_provider: hey make me a volume V1
     T3	  bc: k i'll make V1
     T4	                                        db: okay v1 is CREATING

     T5	  report: here's my volumes [ V2, V3 ]  cool.. I guess V1 is missing?

     T6	                                        db: V1 is missing since T5

     T7	  report: here's my volumes [ V2, V3 ]  oh V1 is still missing?

     T8	                                        db: V1 is missing since T5

     T9	  bc: k I'm done making V1              db: okay marking V1 as CREATED

    T10	                                        GC: Killing all CREATED Volumes missing since T5

The fix for this is to only set the missing_since column when the container is CREATED or DESTROYING

topherbullock pushed a commit that referenced this issue Nov 14, 2018
#2588

Signed-off-by: Topher Bullock <cbullock@pivotal.io>
@vito
Copy link
Member

vito commented Nov 14, 2018

Another bit of acceptance criteria: let's make sure we don't accidentally nuke all containers/volumes if a worker stalls for longer than the grace period. We shouldn't consider these "missing" since the whole worker is having issues. We should also be careful to not immediately remove them when the worker comes back to running. (This all might work fine as-is, I just want to confirm during acceptance.)

@romangithub1024
Copy link
Contributor

romangithub1024 commented Nov 15, 2018

We have seen several issues recently with jobs failing due to "volume-not-found" after switching to ephemeral workers.

Not certain about the root cause yet, but sounds like this patch should help? Hopefully GC/volume workflow is not any different in case of ephemeral workers... so just something to keep in mind.

@topherbullock
Copy link
Member Author

@vito We verified the worker stalling case and everything worked as expected! We're only marking things as missing when the worker is successfully heartbeating, so it conveniently "just workst"™

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted enhancement release/documented Documentation and release notes have been updated. resiliency size/medium An easily manageable amount of work. Well-defined scope, few unknowns.
Projects
No open projects
Runtime
Accepted
Development

No branches or pull requests

6 participants