New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC containers & volumes from the DB if a worker reports that it no longer has them #2588
Comments
Make sure there are no race conditions during container creation, since a |
#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>
Hey, @mhuangpivotal and I started working on this issue, having tackled the first part of it (GCing containers). Our approach is the following:
Some notes:
The work can be found at the branch feature/2588, to be pushed after we tackle Volume as well (should follow the same pattern). Wdyt? Could we be missing an edge case? thx! |
@cirocosta Hmm my only concern is that this would mean a bunch of writes constantly to the database. Instead of marking |
Hey @vito, We thought of using For instance:
Using // at container reporting time, mark the
// containers as seen (single update query over
// all of them - not an actual loop)
for _, container := range containers:
mark_as_seen Using // at container reporting time
containers_on_worker = find_worker_containers(worker_name)
containers_to_mark = diff(containers_on_worker, containers_reported)
// needed for taking care of the case described above
for _, container := range containers_reported:
remove_missing_since(container)
for _, container := range containers_to_mark:
mark_as_missing_since_if_not_marked_yet(container) It seems to us that in the end, using Wdyt? thx! By the way, at containers_to_delete = any_container_with_last_seen_gt_ttl()
for container := range containers:
delete container containers_to_delete = any_container_with_missing_since_gt_ttl()
for container := range containers:
delete container |
@cirocosta I don't think it'll end up being the same number of writes. You should only have to clear out UPDATE containers
SET missing_since = NULL
WHERE missing_since IS NOT NULL
AND handle IN (...) We've had similar cases in the past where making the write conditional leads to a significant performance improvement. One example: 8aa4aca So with |
Oooh got it! That makes sense, didn't think about that. Thanks! |
#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>
#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>
#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>
#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>
#2588 Signed-off-by: Mark Huang <mhuang@pivotal.io>
Another datapoint to consider; We just observed an issue where a job had a step that was in progress for ~24hrs without any output or updates. Questions to ponder
|
If the build is still running, then whichever ATC is tracking the build will eventually use the If the db record for the step's container goes away, a new "creating" container should be created in the DB and the ATC will find a worker to create it on. Looking at the code this should actually recover in any case, so maybe this flake is worth a separate issue. I'll write one up. |
Update: @cirocosta and I discovered an edge case with this where containers which have been in the creating state for some time could be GC'd right after they transition to created:
The fix for this is to only set the |
#2588 Signed-off-by: Topher Bullock <cbullock@pivotal.io>
Another bit of acceptance criteria: let's make sure we don't accidentally nuke all containers/volumes if a worker stalls for longer than the grace period. We shouldn't consider these "missing" since the whole worker is having issues. We should also be careful to not immediately remove them when the worker comes back to |
We have seen several issues recently with jobs failing due to "volume-not-found" after switching to ephemeral workers. Not certain about the root cause yet, but sounds like this patch should help? Hopefully GC/volume workflow is not any different in case of ephemeral workers... so just something to keep in mind. |
@vito We verified the worker stalling case and everything worked as expected! We're only marking things as missing when the worker is successfully heartbeating, so it conveniently "just workst"™ |
when a worker reports a list of existing containers/volumes back to the ATCs, we should diff that set against what the DB thinks should be there and remove the ones the worker didn't report having
This will help with some of the issues when a worker goes away and comes back with different state:
Gotcha ( potential refactoring / documentation ) gc.Destroyer's methods take in a list of reported handles, not the list of handles to destroy :)
The text was updated successfully, but these errors were encountered: