Failed resource check container creates can lead to snowballing of check containers #2454

vito · 2018-08-01T18:09:47Z

This is pretty complicated to describe.

Here's the scenario but you may want to jump around the code and use this as a guide instead of trying to just read it verbatim:

Radar scans for a given resource on an interval. It creates a Resource Config Check Session (RCCS).
Radar calls (ResourceFactory).NewResource with a ResourceConfigCheckSessionContainerOwner.
This then goes through to (worker.Pool).FindOrCreateContainer, which first does a (worker.WorkerProvider).FindWorkerForContainerByOwner. This will first find the container owner (in this case via (resourceConfigCheckSessionContainerOwner).Find which will search across all workers). But as part of finding the worker, it tries to find a container.

This is dangerous, because if we somehow have the owner already but don't have any container, we'll go down the "create" code path which will try to create another owner, and the worker_resource_config_check_sessions don't have a UNIQUE constraint.

Say the first time this runs, there's no container. We'll then pick a worker and then call (worker.Worker).FindOrCreateContainer.
This passes through to the worker.ContainerProvider, and it'll then create the container in the database.
This will then create the owner and the container in the same transaction, which is what you want, as we saw in point 3 if you end up with an owner with no container, you'll end up with dupes.
And here's things break: if we fail to create the real container, for whatever reason, it will instead transition to FAILED state, and then be removed by the garbage collector. This will leave us with a WRCCS owner without a corresponding container.

Ok, so what happens now that we have an owner without any container?

Well, we'll find the owner. But then we won't find a container corresponding to it (point 3). So we'll create another owner, and thus another check container. As long as the resource config check session is active, this will result in duplicate WRCCS rows, each corresponding to a duplicate check container.

The text was updated successfully, but these errors were encountered:

wagdav · 2018-08-02T09:38:57Z

@vito this behavior is very much what we saw related to #2346 . It was described there with less technical insight.

vito · 2018-08-02T12:22:04Z

@wagdav That issue makes no mention of container count going up. You sure it's related? That was just about the only observable metric for us. CPU and everything else was fine.

concourse/concourse#2454 Signed-off-by: Mark Huang <mhuang@pivotal.io>

#2454 Submodule src/github.com/concourse/atc 4b1839f..dad9ab3: > Merge pull request #294 from concourse/upsert > Add a failing test for duplicated WRCCS Signed-off-by: Mark Huang <mhuang@pivotal.io>

wagdav · 2018-08-02T16:27:48Z

@vito You're right I didn't mention this in the referenced issue. However, I remember when we were diagnosing the problem of #2346 we saw thousands of check containers on the workers while only a handful of 'task' containers. This seemed to be less important, probably because we were too much focused on the CPU consumption of the ATC back then.

vito added the bug label Aug 1, 2018

This was referenced Aug 1, 2018

use ON CONFLICT for upserts in more areas vmware-archive/atc#294

Merged

Jobs getting stuck at preparing build #2010

Closed

mhuangpivotal pushed a commit to vmware-archive/atc that referenced this issue Aug 2, 2018

Add a failing test for duplicated WRCCS

986dbda

concourse/concourse#2454 Signed-off-by: Mark Huang <mhuang@pivotal.io>

mhuangpivotal pushed a commit that referenced this issue Aug 2, 2018

bump atc

70e187d

#2454 Submodule src/github.com/concourse/atc 4b1839f..dad9ab3: > Merge pull request #294 from concourse/upsert > Add a failing test for duplicated WRCCS Signed-off-by: Mark Huang <mhuang@pivotal.io>

vito added the accepted label Aug 21, 2018

vito closed this as completed Aug 21, 2018

vito modified the milestone: v4.1.0 Aug 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed resource check container creates can lead to snowballing of check containers #2454

Failed resource check container creates can lead to snowballing of check containers #2454

vito commented Aug 1, 2018 •

edited

Loading

wagdav commented Aug 2, 2018

vito commented Aug 2, 2018

wagdav commented Aug 2, 2018

Failed resource check container creates can lead to snowballing of check containers #2454

Failed resource check container creates can lead to snowballing of check containers #2454

Comments

vito commented Aug 1, 2018 • edited Loading

wagdav commented Aug 2, 2018

vito commented Aug 2, 2018

wagdav commented Aug 2, 2018

vito commented Aug 1, 2018 •

edited

Loading