You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This then goes through to (worker.Pool).FindOrCreateContainer, which first does a (worker.WorkerProvider).FindWorkerForContainerByOwner. This will first find the container owner (in this case via (resourceConfigCheckSessionContainerOwner).Find which will search across all workers). But as part of finding the worker, it tries to find a container.
This is dangerous, because if we somehow have the owner already but don't have any container, we'll go down the "create" code path which will try to create another owner, and the worker_resource_config_check_sessions don't have a UNIQUE constraint.
Say the first time this runs, there's no container. We'll then pick a worker and then call (worker.Worker).FindOrCreateContainer.
And here's things break: if we fail to create the real container, for whatever reason, it will instead transition to FAILED state, and then be removed by the garbage collector. This will leave us with a WRCCS owner without a corresponding container.
Ok, so what happens now that we have an owner without any container?
Well, we'll find the owner. But then we won't find a container corresponding to it (point 3). So we'll create another owner, and thus another check container. As long as the resource config check session is active, this will result in duplicate WRCCS rows, each corresponding to a duplicate check container.
The text was updated successfully, but these errors were encountered:
@wagdav That issue makes no mention of container count going up. You sure it's related? That was just about the only observable metric for us. CPU and everything else was fine.
#2454
Submodule src/github.com/concourse/atc 4b1839f..dad9ab3:
> Merge pull request #294 from concourse/upsert
> Add a failing test for duplicated WRCCS
Signed-off-by: Mark Huang <mhuang@pivotal.io>
@vito You're right I didn't mention this in the referenced issue. However, I remember when we were diagnosing the problem of #2346 we saw thousands of check containers on the workers while only a handful of 'task' containers. This seemed to be less important, probably because we were too much focused on the CPU consumption of the ATC back then.
This is pretty complicated to describe.
Here's the scenario but you may want to jump around the code and use this as a guide instead of trying to just read it verbatim:
(ResourceFactory).NewResource
with aResourceConfigCheckSessionContainerOwner
.(worker.Pool).FindOrCreateContainer
, which first does a(worker.WorkerProvider).FindWorkerForContainerByOwner
. This will first find the container owner (in this case via(resourceConfigCheckSessionContainerOwner).Find
which will search across all workers). But as part of finding the worker, it tries to find a container.worker_resource_config_check_sessions
don't have aUNIQUE
constraint.(worker.Worker).FindOrCreateContainer
.worker.ContainerProvider
, and it'll then create the container in the database.FAILED
state, and then be removed by the garbage collector. This will leave us with a WRCCS owner without a corresponding container.Ok, so what happens now that we have an owner without any container?
Well, we'll find the owner. But then we won't find a container corresponding to it (point 3). So we'll create another owner, and thus another check container. As long as the resource config check session is active, this will result in duplicate WRCCS rows, each corresponding to a duplicate check container.
The text was updated successfully, but these errors were encountered: