New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed resource check container creates can lead to snowballing of check containers #2454

Closed
vito opened this Issue Aug 1, 2018 · 3 comments

Comments

Projects
2 participants
@vito
Member

vito commented Aug 1, 2018

This is pretty complicated to describe.

Here's the scenario but you may want to jump around the code and use this as a guide instead of trying to just read it verbatim:

  1. Radar scans for a given resource on an interval. It creates a Resource Config Check Session (RCCS).
  2. Radar calls (ResourceFactory).NewResource with a ResourceConfigCheckSessionContainerOwner.
  3. This then goes through to (worker.Pool).FindOrCreateContainer, which first does a (worker.WorkerProvider).FindWorkerForContainerByOwner. This will first find the container owner (in this case via (resourceConfigCheckSessionContainerOwner).Find which will search across all workers). But as part of finding the worker, it tries to find a container.
  • This is dangerous, because if we somehow have the owner already but don't have any container, we'll go down the "create" code path which will try to create another owner, and the worker_resource_config_check_sessions don't have a UNIQUE constraint.
  1. Say the first time this runs, there's no container. We'll then pick a worker and then call (worker.Worker).FindOrCreateContainer.
  2. This passes through to the worker.ContainerProvider, and it'll then create the container in the database.
  3. This will then create the owner and the container in the same transaction, which is what you want, as we saw in point 3 if you end up with an owner with no container, you'll end up with dupes.
  4. And here's things break: if we fail to create the real container, for whatever reason, it will instead transition to FAILED state, and then be removed by the garbage collector. This will leave us with a WRCCS owner without a corresponding container.

Ok, so what happens now that we have an owner without any container?

Well, we'll find the owner. But then we won't find a container corresponding to it (point 3). So we'll create another owner, and thus another check container. As long as the resource config check session is active, this will result in duplicate WRCCS rows, each corresponding to a duplicate check container.

@vito vito added the bug label Aug 1, 2018

@vito vito added this to Icebox in Core via automation Aug 1, 2018

@vito vito moved this from Icebox to Backlog in Core Aug 1, 2018

@mhuangpivotal mhuangpivotal moved this from Backlog to In Flight in Core Aug 1, 2018

@wagdav

This comment has been minimized.

wagdav commented Aug 2, 2018

@vito this behavior is very much what we saw related to #2346 . It was described there with less technical insight.

@vito

This comment has been minimized.

Member

vito commented Aug 2, 2018

@wagdav That issue makes no mention of container count going up. You sure it's related? That was just about the only observable metric for us. CPU and everything else was fine.

mhuangpivotal added a commit to concourse/atc that referenced this issue Aug 2, 2018

Add a failing test for duplicated WRCCS
concourse/concourse#2454

Signed-off-by: Mark Huang <mhuang@pivotal.io>

mhuangpivotal added a commit that referenced this issue Aug 2, 2018

bump atc
#2454

Submodule src/github.com/concourse/atc 4b1839f..dad9ab3:
  > Merge pull request #294 from concourse/upsert
  > Add a failing test for duplicated WRCCS

Signed-off-by: Mark Huang <mhuang@pivotal.io>

@mhuangpivotal mhuangpivotal moved this from In Flight to Done in Core Aug 2, 2018

@wagdav

This comment has been minimized.

wagdav commented Aug 2, 2018

@vito You're right I didn't mention this in the referenced issue. However, I remember when we were diagnosing the problem of #2346 we saw thousands of check containers on the workers while only a handful of 'task' containers. This seemed to be less important, probably because we were too much focused on the CPU consumption of the ATC back then.

@vito vito added the accepted label Aug 21, 2018

@vito vito closed this Aug 21, 2018

@vito vito modified the milestone: v4.1.0 Aug 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment