Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime : allow creation of ephemeral check containers #3424

Open
topherbullock opened this issue Mar 4, 2019 · 17 comments
Open

runtime : allow creation of ephemeral check containers #3424

topherbullock opened this issue Mar 4, 2019 · 17 comments

Comments

@topherbullock
Copy link
Member

@topherbullock topherbullock commented Mar 4, 2019

Step 1 of the Runtime side for #3079 is to create a method to create ephemeral check containers, and handle their destruction ( and the destruction of their RootFS Volume )

Probably some work to clean up GC now that it doesn't need to consider resource check containers, but we also need some way to ensure they are retained until Garden's TTL expires them.

Gotchas:

GC:

  • Garden won't deal with GC of the RootFS COW Volume
  • We could have an image plugin deal with the deletion of the COW volume when the container is destroyed
  • Ensure that distributed GC's diffing of what's on the worker against the DB doesn't clean up the check containers from underneath us ( depending on the direction "missing since" looks, and where the source of truth is.. probably the DB )

Performance

  • Ensure we don't have a lot of overhead on creating these containers
  • provide a mechanism to measure the performance once we implement ephemeral check containers
  • add some metrics to track the check containers across the cluster of workers, as we won't have anything in the DB.

Peas!

see : Garden's ProcessSpec

  • Maybe this would address some of the concerns around GC?
  • Investigate how much is shared between the peas and from the original container to the peas
  • Do all the peas in the "pod" have a TTL?
  • Can we use the image plugin with the peas?
  • How many peas can we create off an initial "pod" container?
@topherbullock topherbullock created this issue from a note in Runtime (Backlog) Mar 4, 2019
@vito vito moved this from Backlog to In Flight in Runtime Mar 25, 2019
@ddadlani

This comment has been minimized.

Copy link
Contributor

@ddadlani ddadlani commented Mar 27, 2019

We are spiking on the ability to create image volumes for ephemeral containers in #3607

@kcmannem kcmannem moved this from In Flight to Backlog in Runtime May 1, 2019
@ddadlani

This comment has been minimized.

Copy link
Contributor

@ddadlani ddadlani commented May 6, 2019

Blocked on #3607

@ddadlani ddadlani added the blocked label May 6, 2019
@ddadlani ddadlani added this to Icebox in Kubernetes Runtime via automation May 21, 2019
@ddadlani ddadlani moved this from Icebox to Backlog in Kubernetes Runtime May 21, 2019
@ddadlani ddadlani removed this from Backlog in Runtime May 21, 2019
@ddadlani ddadlani removed the blocked label May 30, 2019
@ddadlani

This comment has been minimized.

Copy link
Contributor

@ddadlani ddadlani commented May 30, 2019

We unblocked this because of the amount of time #3607 is going to take. The reason it was blocked on #3607 was because we needed a way to keep around image volumes and a cache for the git-resource check (git-resource checks currently have an implicit "cache" because the check containers stick around for an hour).

Currently, we fetch images for check containers and keep them around using a db.ForContainer resource_cache_use. For ephemeral containers, there is no db container created, which means that we cannot create a db.ForContainer resource_cache_user for them.

We found that because image volumes fetched for checks become resource_caches, they do not get garbage collected even if there is no resource_cache_use. We can therefore keep image volumes for ephemeral check containers around by simply removing the resource_cache_use.

We would also need the concept of "check caches", similar to resource caches, but created for a check container, to codify the implicit "caching" that the git-resource does. As an interim solution, the checks work without this cache, i.e. a git-resource check would clone the repo every time.

@ddadlani

This comment has been minimized.

Copy link
Contributor

@ddadlani ddadlani commented May 30, 2019

The way this currently works is that we created a garden_plugin for baggageclaim to create COW volumes of existing volumes to use as the rootfs for ephemeral containers. Code for this lives in https://github.com/concourse/baggageclaim/tree/gdn-plugin .

The garden.ContainerSpec used for creating ephemeral containers specifies a GraceTime for the containers, which means that the container (and its associated COW rootfs volume) will self-destruct after GraceTime since the last request. We created separate flows for creating ephemeral containers and fetching images for ephemeral containers. Code for this lives in https://github.com/concourse/concourse/tree/ephemeral_containers.

@ddadlani ddadlani moved this from Backlog to In Flight in Kubernetes Runtime May 30, 2019
@ddadlani ddadlani added this to Icebox in Runtime via automation May 30, 2019
@ddadlani ddadlani removed this from In Flight in Kubernetes Runtime May 30, 2019
@ddadlani ddadlani moved this from Icebox to In Flight in Runtime May 30, 2019
@cirocosta

This comment has been minimized.

Copy link
Member

@cirocosta cirocosta commented Jun 3, 2019

Hey,

I remember in the early chatting around this that we were not aware of how costly
container creation is, so I got interested in having a deeper look into how long each
step involved in creating a "guardian container" takes so that we could have some
insights into how that whole process looks, so here's some data from that.

steps involved

The main part in cloudfoundry/guardian where we can see the container creation
steps is gardener/gardener.go.

There we found the three main routines associated with the process of setting up
a container:

volume creation

runtimeSpec, err := g.Volumizer.Create(log, containerSpec)
if err != nil {
        return nil, err
}

(see gardener.go:220)

container creation

desiredSpec := spec.DesiredContainerSpec{
        Handle:     containerSpec.Handle,
        Hostname:   containerSpec.Handle,
        Privileged: containerSpec.Privileged,
        Env:        containerSpec.Env,
        BindMounts: append(containerSpec.BindMounts, networkBindMounts...),
        Limits:     containerSpec.Limits,
        BaseConfig: runtimeSpec,
}


if err := g.Containerizer.Create(log, desiredSpec); err != nil {
        return nil, err
}

(see gardener.go:230)

network set up

if err = g.Networker.Network(log, containerSpec, actualSpec.Pid); err != nil {
        return nil, err
}

(see gardener.go:255)

local tracing

As a quick way of checking how those times look like, I started with pprof,
injecting few trace points in the code, wrapping those methods mentioned above.

See https://gist.github.com/cirocosta/7e53cde47405f422ef66c8c2ff563630 for the patch.

Here's how it looks like:

As we can see, almost all of the container creation time gets split in just
setting up networking and the actual containerization, which is what we'd expect
as concourse uses gdn with --no-image-plugin (Concourse deals with the
"volumization" itself).

Looking at few more samples, we can see the distribution over few counts:

tracing in hush-house

As a way of having the same as the "tracing" step before running continuously in
hush-house, the approach changes a bit - instead of using pprof and hitting
its /trace endpoint to capture the traces for a while, Jaeger instead,
instrumenting those very same points in the code (see my tracing branch under
cirocosta/guardian
).

Interestingly, the results don't change all that much, aside from the fact that
the numbers are higher.

Looking at the last 3 hours in hush-house, here's the breakdown of time to create a container in a worker (volumizer + containerizer + networker):

Screen Shot 2019-06-02 at 7 08 36 PM

Looking at few traces:

Screen Shot 2019-06-03 at 8 48 06 AM

Screen Shot 2019-06-03 at 8 48 16 AM

Screen Shot 2019-06-03 at 8 48 23 AM

I didn't find a way of doing some math on top of all of the traces captured, but looking at some traces "randomly", it seems like the time ratio seems consistent with the times reported with the local experiment.

note.: this is an almost "idle" worker.


thanks!

@pivotal-jamie-klassen

This comment has been minimized.

Copy link
Contributor

@pivotal-jamie-klassen pivotal-jamie-klassen commented Jun 3, 2019

@cirocosta this is super cool!

@kcmannem

This comment has been minimized.

Copy link
Member

@kcmannem kcmannem commented Jun 3, 2019

@cirocosta my test didn't produce cool graphs and is less scientific but I found that. If you create 250 containers in serial it takes around ~.5s each on docker composed environment and ~4 mins if you make them in parallel.

@kcmannem

This comment has been minimized.

Copy link
Member

@kcmannem kcmannem commented Jun 3, 2019

Non-issue but I thought I'd note it down. Docker-image resource requires a privileged container to run so we'll have to go through the chown the rootfs flow every check container. This would be a problem if the fs had lots of files but its only a few.

@ddadlani

This comment has been minimized.

Copy link
Contributor

@ddadlani ddadlani commented Jun 3, 2019

@cirocosta this is pretty cool. I would be interested to repeat this with the ephemeral check containers code merged in, because for those garden needs to initiate COW volume creation (for the container image).

@kcmannem Did we collect any stats on a good number of containers to create in parallel? Of course it would depend on the available resources on a worker but it might be useful information for container scheduling in the future

@kcmannem

This comment has been minimized.

Copy link
Member

@kcmannem kcmannem commented Jun 3, 2019

@ddadlani nothing more concrete then a basic for loop hitting garden apis in go routines. That test lead to creating all 250 containers in around 4 mins. Parallel load is hard to work around because there is a global iptables lock to make sure each network is mutually exclusive. Like you mentioned we just have to figure out where to draw the line so that the lock contention doesn't outweigh parallel creates.

cirocosta added a commit to concourse/hush-house that referenced this issue Jun 5, 2019
For the purpose of getting a better sense of how long certain operations
happen in `gdn` (see [results][results]), some trace points were added to
`cloudfoundry/garden`. To collect those traces, [`jaeger`][jaeger] was
introduced.

[results]: concourse/concourse#3424 (comment)
[jaeger]: https://www.jaegertracing.io/

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
@vito

This comment has been minimized.

Copy link
Member

@vito vito commented Jun 5, 2019

Could we mitigate the network create/delete overhead by placing all check containers in the same network? 🤔

https://github.com/cloudfoundry/garden/blob/e9c503bfdeedec3e591f46a85e3cd0e7b1ebecc1/client.go#L132

(Would that be a security concern? 🤔 )

@kcmannem

This comment has been minimized.

Copy link
Member

@kcmannem kcmannem commented Jun 5, 2019

might be worthwhile to ship this along side global-resources defaulted to true so we reduce the container creation load.

@ddadlani

This comment has been minimized.

Copy link
Contributor

@ddadlani ddadlani commented Jun 10, 2019

@cirocosta to add analysis for performance impact of extra cgroups on the VM

@ddadlani

This comment has been minimized.

Copy link
Contributor

@ddadlani ddadlani commented Jun 17, 2019

pending performance testing + check cache implementation

@kcmannem

This comment has been minimized.

Copy link
Member

@kcmannem kcmannem commented Jun 17, 2019

and tests in general

@ddadlani ddadlani added the paused label Jun 26, 2019
@ddadlani

This comment has been minimized.

Copy link
Contributor

@ddadlani ddadlani commented Jun 26, 2019

This is paused for now, waiting on a chance to pair with Core on creating check caches. Some resources (e.g. git resource) have "implicit" caches which persist for the entire resource config check session. However, with the introduction of ephemeral containers, check sessions will go away so we need to create explicit check caches for each check container to use.

A lot of work in Core (e.g. #3788) will affect the implementation of check caches, so this will wait until that is completed.

@ddadlani

This comment has been minimized.

Copy link
Contributor

@ddadlani ddadlani commented Jul 15, 2019

If we switch over to containerd (#3891) the implementation of this will change. Just a note for any future changes

@ddadlani ddadlani moved this from In Flight to Backlog in Runtime Jul 23, 2019
@ddadlani ddadlani added this to Icebox in Ephemeral Check Containers via automation Jul 30, 2019
@ddadlani ddadlani removed this from Backlog in Runtime Jul 30, 2019
@ddadlani ddadlani moved this from Icebox to Backlog in Ephemeral Check Containers Jul 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
6 participants
You can’t perform that action at this time.