Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching directories between runs of a task #230

Closed
vito opened this issue Dec 12, 2015 · 40 comments

Comments

Projects
None yet
@vito
Copy link
Member

commented Dec 12, 2015

Common use case: fetching dependencies, syncing BOSH blobs, etc.

Caching the directories that these fetch into and update would dramatically speed up a bunch of builds. Our own ATC JS build spends 99% of its time just downloading npm packages.

Proposal:

Add a cache field to task configs, like so:

---
platform: linux

inputs:
- name: my-release

cache:
- path: my-release/.blobs

run:
  path: my-release/ci/scripts/create-release

Then, given I have a pipeline like so:

jobs:
- name: make-release
  plan:
  - get: my-release
  - task: create-release
    file: my-release/ci/create-release.yml

This would cache the directory my-release/.blobs between runs of that specific task in its job's build plan. So, the cache lookup key would be something like team-id+pipeline-id+job-name+task-name.

Notes:

  • This should also have the guarantee that two concurrent builds of the same job do not pollute each others' caches. There should be some sort of copy-on-write semantics, such that each job gets its own copy of the cache (initially empty), and at the end all other caches are marked "stale" and are set to expire.
  • Assumes tools will be durable to the directory being initially present + empty on the initial cache run. I think this should be fine. Without this it'll be very annoying to orchestrate.
  • The caching is for purely ephemeral data, so it doesn't sacrifice Concourse's "not being a source of truth" principle.
  • Has the same cache warming semantics as gets, i.e. it may take a bit for the cache to warm across the workers; it does not influence worker placement.
  • Has no effect on one-off builds, as there is not enough information to scope/correlate the caches (compared to a job build).
@concourse-bot

This comment has been minimized.

Copy link

commented Dec 12, 2015

Hi there!

We use Pivotal Tracker to provide visibility into what our team is working on. A story for this issue has been automatically created.

The current status is as follows:

  • #110017520 Caching directories between runs of a task

This comment, as well as the labels on the issue, will be automatically updated as the status in Tracker changes.

@xoebus

This comment has been minimized.

Copy link
Contributor

commented Dec 12, 2015

👍

It's an important feature but I can see it getting abused. Maybe we need a best practices document that sets some guidelines for having a sane pipeline. e.g. don't publish a different artefact to the same version number otherwise nothing will work properly.

Maybe we could have a random chance of running a task without a cache to make sure people aren't depending on it. 👀

@concourse-bot concourse-bot added scheduled and removed unscheduled labels Dec 12, 2015

@nicorikken

This comment has been minimized.

Copy link

commented Dec 14, 2015

I'd certainly like to see this come about. Our use-case includes downloading NPM dependencies as well, but also downloading Nix-packages. Copy-on-write would suit that use-case.

Sharing this cache even among pipelines would make sense efficiency-wise, although that would introduce (semantic) dependencies between pipelines.

@vito

This comment has been minimized.

Copy link
Member Author

commented Dec 14, 2015

@nicorikken We'll probably not start with cross-pipeline or even cross-job caching as that opens another can of worms: how to configure which caches apply to which tasks/etc. I could see that becoming a nightmare pretty quickly if you e.g. copy a job from another pipeline that has a cache name the same as yours because it was called something too generic.

So we'll probably just stick with as tightly-scoped as possible (individual task within job).

@nicorikken

This comment has been minimized.

Copy link

commented Dec 17, 2015

@vito Totally agree, makes most sense. There are other 'hacks' for getting around such issues, and worst-case somebody would have to write a specific resource targeting the local disk.

@jchester

This comment has been minimized.

Copy link

commented Dec 19, 2015

👍 from me, both privately (I have a pipeline pulling umpteen zillion jars) and professionally (pivnet is a hungry gem consumer).

@devcurmudgeon

This comment has been minimized.

Copy link
Contributor

commented Jan 5, 2016

not sure i understand this discussion properly, but i think just supporting the possibility to have 'cache' directory/ies untouched between jobs would be very useful on its own. various tools have their own cache lookup processes (eg bitbake)

@tomwhoiscontrary

This comment has been minimized.

Copy link

commented Jul 19, 2016

👍 we've got microservices built with Gradle, so every one of our 27 tasks starts by downloading Gradle, then all our dependencies, and it's horrendous. Cache plox.

As a stop-gap, could we write some sort of pre-loading resource? It would somehow download the Gradle binary and some set of dependencies, then we would copy that into place before kicking off the build. That would let us exploit Concourse's local caching of resources, right?

@simonvanderveldt

This comment has been minimized.

Copy link

commented Jul 21, 2016

@nicorikken We'll probably not start with cross-pipeline or even cross-job caching as that opens another can of worms: how to configure which caches apply to which tasks/etc. I could see that becoming a nightmare pretty quickly if you e.g. copy a job from another pipeline that has a cache name the same as yours because it was called something too generic.

So we'll probably just stick with as tightly-scoped as possible (individual task within job).

@vito While I do understand it will be hard to make sure people don't use it for the wrong reasons we see a serious speedup with shared caches between pipelines (and by extension also between jobs).
But the caches we share in these cases are the "normal" per user package manager's caches, so for example for npm this is ~/.npm/ and for sbt this is ~/.ivy2/.

Would it be possible to support this somehow?

@vito Totally agree, makes most sense. There are other 'hacks' for getting around such issues, and worst-case somebody would have to write a specific resource targeting the local disk.

@nicorikken Can you share those other "hacks"?

@seadowg

This comment has been minimized.

Copy link
Contributor

commented Aug 31, 2016

As a wee push: not having this feature is still causing major problems with Concourse at some enterprise companies due to terrible connections (usually through a VPN). I've seen this drop down the backlog a couple of times so just thought I'd add my feelings from the world of self important anecdotal evidence.

@tomwhoiscontrary

This comment has been minimized.

Copy link

commented Aug 31, 2016

@seadowg Oh, i really should mail out about this: we made a Gradle-specific caching thingy on our project. Because Docker Hub automatic builds need public repos, i had to make it public, so i guess you can read it?

https://github.com/projectfalcon/gradle-cache-resource

You might be able to do something similar. The example pipeline builds Spring with it.

The way it works is that you still need to download all the dependencies in one go, but only once, after that it's cached. Until one of them changes, then you have to download them all again.

Also, i've been off sick, so by now Thorbs may have replaced this with some rival saucer science.

@nicorikken

This comment has been minimized.

Copy link

commented Sep 16, 2016

@simonvanderveldt The hacks were a while ago, and I hope Concourse has improved on many levels since then. We tried sharing the local directories, but that didn't go so well (don't remember). We eventually were having Dockers on Docker Hub in multiple stages in the pipeline: [Base build dependency image] -> (fetch latest dependencies) -> [Up-to-date build dependency image] -> (build) -> [finished build] -> (test) -> [build and tested image]
So if I'm not mistaken were were having about 5 images stored at Docker Hub per pipeline. 😞 Eventually we moved back to Jenkins. I still like the concepts of Concourse, but until this resource sharing feature comes along, I guess we won't be switching back.

@poida

This comment has been minimized.

Copy link

commented Oct 18, 2016

We also need to be able to cache folders (NPM modules folder) between jobs.
Currently Concourse runs 2-3 times slower than similar pipeline on Jenkins just because of NPM INSTALL.
Any ideas, hacks to speed it up?

@chipx86

This comment has been minimized.

Copy link

commented Oct 19, 2016

For ours, we're using a fork of concourse-rsync-resource (https://github.com/beanbaginc/concourse-rsync-resource) and define a resource for each type of build we need to share. This looks like:

- name: my-project-env
  type: rsync-resource
  check_every: 10m
  source:
      server: our-builds.example.com
      base_dir: /path/to/unique/build/dest
      user: build-username
      private_key: {{builds-ssh-key}}

We have a job for each of these that builds the environment and puts it:

- name: my-project-prep
  plan:
      - task: prep
        file: tasks/my-project/prep.yaml
        input_mapping:
            source: my-project-src

      - put: my-project-env
        name:
             sync_dir: env

Jobs that need it just do a get:

- get: my-project-env
  trigger: true
  passed:
      my-project-prep

That prep.yaml file looks like:

platform: linux

image_resource:
    ...

inputs:
    ...

outputs:
    - name: env
      path: env

run:
    path: tasks/my-project/prep.sh

The prep.sh just does what it needs to do, and then outputs to wherever it needs in the env directory. Other tasks can then make use of what's in there.

For node packages, we use npm-cache to help with the cache. We have a helper function for use in the prep.sh scripts to set up that cache, install the modules, and store them in the environment:

function build_and_store_node_modules() {
    cache_dir=$BUILDS_DIR/env/.package_cache

    mkdir $cache_dir
    ln -s $cache_dir ~/.package_cache

    echo Installing node modules...
    npm install -g npm-cache
    npm-cache install
}

Scripts that need to load the environment then just call this helper:

function restore_node_modules() {
    echo Restoring node modules...
    ln -s $BUILDS_DIR/env/.package_cache ~
}

Hope that helps! I can provide more examples if needed.

@tomwhoiscontrary

This comment has been minimized.

Copy link

commented Oct 19, 2016

@poida You should be able to use the same approach with NPM that we used with Gradle - wrap the Git resource in a script which pulls your code, downloads the dependencies, then exposes the dependencies as a resource. Replace build.gradle with package.json, ./gradlew __cacheDependencies with npm install, and ~/.gradle with ~/.node_modules.

@jchester

This comment has been minimized.

Copy link

commented Oct 22, 2016

I gave this more thought recently and have changed my position. I still see the need, but I think it should be handled as a new resource.

The key implementation problem seems to be that any cache will either need to be centralised, or consistent amongst workers.

I bring this up because my first thought was: "why not write a resource that punches a hole through to a directory on the worker FS?", followed by the second: "uh, what if there is more than one worker, Jacques?"

I think that adopting the lightest possible blob store as a core-supported resource is the go here. It would retain existing Concourse semantics, would avoid changes to core logic and would provide a relatively smooth pathway for folk switching to and from remote blobstores as the case may be.

I don't feel qualified to evaluate the alternatives, but it seems to me that something that can distribute itself less-than-stupidly across workers would be the goal.

Some quick googling: The trendiest right now is minio (Golang u guise!1!), but it's not distributed yet (edit: apparently it now is as of last month). SX Cluster is distributed and more mature, but mired under a GPL cloud. Ceph is very mature and under LGPL, not sure how heavyweight it is. Ambry comes from LinkedIn and is presumably pretty robust, but requires a JVM.

@poida

This comment has been minimized.

Copy link

commented Oct 23, 2016

@jchester is what you're suggesting what baggage claim already does for concourse volumes? I wonder if that could be leveraged somehow into a resource type.

@dougtweedy

This comment has been minimized.

Copy link

commented Nov 2, 2016

+1 having this for our gradle builds is essential for caching dependencies. Using volume mounts on the .gradle directory would work too, but the option needs to be exposed to the concourse tooling (edit: and fall into the paradigm of stateless/centralized).

@aarondl

This comment has been minimized.

Copy link

commented Nov 17, 2016

+1 Building a caching mechanism (we use an rsync server with the rsync resource) is silly and is painful because all of concourses paradigms make it hard to keep things between builds (like not being able to use the same directory for put/get).

@poida

This comment has been minimized.

Copy link

commented Dec 22, 2016

Thanks @tomwhoiscontrary, what sort of speed up do you get with your gradle cache-resource to your pipeline? Did you have to tweak concourse settings to keep the resources around? Does concourse evict the resources out of the cache often?

@tomwhoiscontrary

This comment has been minimized.

Copy link

commented Dec 22, 2016

@poida The speedup appears to be equal to the time taken to download the resources, so a minute or so. I've never noticed the resource being rebuilt after being discarded from the cache. But then, ours is a small pipeline, with only four resources that actually produce files, so there is not a lot of cache pressure. A small problem is gratuitous rebuilds - if you make an irrelevant tweak to a build file (eg changing whitespace), the resource will rebuild. That's still better than rebuilding every time, though.

@jchester

This comment has been minimized.

Copy link

commented Dec 28, 2016

I (or rather, my Pivotal-affiliated alter ego) made a stab at setting up a resource during a hackday about a month ago. Didn't get a 10th of what I wanted, but I at least demonstrated using a co-deployed blobstore to avoid having to hit the internet.

@JohannesRudolph

This comment has been minimized.

Copy link
Contributor

commented May 25, 2017

We've had this same issue for a long time and it was very annoying until we came up with a non-intrusive caching strategy using a Task and two short scripts. I've wrapped it up as a sample repo here: https://github.com/Meshcloud/concourse-cached-pipeline

In any case, I'm curious about @vito 's plans to add this natively to Concourse and how and when we can expect this to happen.

@fralalonde

This comment has been minimized.

Copy link

commented May 29, 2017

The cache needs to be per-worker to maximize locality, accessible concurrently and totally transparent to the jobs. I'm thinking of running a caching proxy on every worker node, and route any request to our Artifactory server through it. The jobs will still "download" every dependency on each run, but transfers will be from the local worker proxy and thus will never hit the wire after first access.

Also, I don't know about other build systems, but Maven with Takari extensions allows .m2/repository to safely be updated by two concurrent build processes. This would make it possible for two jobs to use the same cache as a local network mount, or a shared docker volume, which would not require a proxy. If concourse could provide a default way to access worker-local storage to any / every job... Maybe this already exists? I've been out of the loop for a while.

@vito

This comment has been minimized.

Copy link
Member Author

commented Jun 19, 2017

Some investigation needed in to parallel execution of builds with the same cache. But we'll pick it up and figure it out as we go.

Worst-case ontario, we could just not care, and say if you've got parallel hits of the same cache, you can just set serial: true if there are problems.

@pn-santos

This comment has been minimized.

Copy link

commented Jun 21, 2017

One thing I don't quite understand from the proposed solution, how would the cache be evicted/reset?

Would that be something that the task is responsible for? Could fly for example offer a way to evict caches of a certain job?

Also regarding the scope of caches, won't making them task scoped make the size requirements for each worker go up considerably? i.e it I have 3 tasks that use the same cache I'll have 3 copies of the same thing if I cache it on all 3 tasks right? Eventually that data would be present in all workers...

Wouldn't it be preferable to make them per job? (so making the key team-id+pipeline-id+job-name). It would be like a task input that could also be an output (i.e. like mutating data in a task input path that is visible to downstream tasks). If one considers the more common scenarios (npm, maven/gradle, go vendor, etc) its very likely that these caches will be re-used by tasks within a job. Handling concurrency within a job is not that hard (vs handling cross pipeline).

@mariash mariash assigned mariash and unassigned topherbullock and ebabani Jun 22, 2017

@mariash

This comment has been minimized.

Copy link
Contributor

commented Jun 22, 2017

The initial approach is to introduce worker_task_caches table with the following fields:

  • worker_name
  • job_id
  • step_name
  • path

Plan ID is not consistent, it is just an incremented number within pipeline (see NewPlanFactory). It will be different for each build. We decided to rely on step name to persist the task cache. That means if 2 tasks have the same name in a job and reference the same path to cache they will get the same cache.

Volumes can now belong to worker_task_caches. Same way as to worker_resource_caches.

This will allow to define clear garbage collection for those volumes:

  • Delete worker_task_cache when it is no longer needed, e.g. it was removed from pipeline.
  • This will nullify worker_task_cache_id in volumes and that volume will become a candidate for garbage collection.

Flow:

When task is running:
For each specified cache in task:
If cache volume exists:
- Mount a COW volume at the specified path. COW volume will belong to container.
If cache volume does not exist:
- Mount a new volume at the specified path. That volume will belong to container.

After task is run:
For each specified cache in task:
If it was a COW volume and sha is different from parent's volume:
- Create an import volume from COW volume
- Set worker_task_cache_id on an import volume and remove worker_task_cache_id from parent volume (in transaction). That will make old parent volume a candidate for garbage collection.
** If two tasks modified COW volume in parallel, figure out which one to pick. E.g. highest build ID.
If it is a new volume:
- Initialize it as worker task cache volume (set worker_task_cache_id in volume). We do same thing for resource cache volume. First, it is created for container then it is initialized as resource cache volume. The container_id will be nullified when container is removed.
If sha did not change:
- Don't do anything.

TODO:

  • rename cache to caches in task config? to be consistent with inputs/outputs?
  • garbage collect task caches
  • concurrency: when two builds are running in parallel which volume will be promoted to task cache? as a first iteration the latest one wins.
  • optimization: as a first iteration import volume will be created every time. Add optimization to check if it was a new volume or sha did not change

mariash added a commit to concourse/atc that referenced this issue Jun 22, 2017

add caching between task runs
concourse/concourse#230

* add worker_task_caches and worker_task_cache_id to volume
* if volume does not exist create a new volume for each cache
* if volume exists create a cow volume for each cache
* mount volume at a specified cache path
* after task is finished initialize volume as worker task cache.

TODO:
* add path to worker_task_caches to distinguish task caches for
different paths
* stream contents of cache volume if it is on different worker
* create an import volume to avoid infinite cow volumes and mark that
volume as worker task cache volume
* in the same transaction release original parent volume
* do not create an import volume if it does not have a parent
* do not create an import volume if cow volume has the same sha of
contents as the parent volume
* make sure volume promotion to task cache is concurrency safe
* garbage collect worker task caches

mariash added a commit to concourse/atc that referenced this issue Jun 23, 2017

add caching between task runs
concourse/concourse#230

* add worker_task_caches and worker_task_cache_id to volume
* if volume does not exist create a new volume for each cache
* if volume exists create a cow volume for each cache
* mount volume at a specified cache path
* after task is finished initialize volume as worker task cache
* create an import volume to avoid infinite cow volumes and mark that
volume as worker task cache volume
* in the same transaction release original parent volume
* do not create an import volume if it does not have a parent
* make sure volume promotion to task cache is concurrency safe

TODO:
* add path to worker_task_caches to distinguish task caches for
different paths
* stream contents of cache volume if it is on different worker
* garbage collect worker task caches
* [optimization] do not create an import volume if cow volume has the same sha of
contents as the parent volume

mariash added a commit to concourse/atc that referenced this issue Jun 23, 2017

add caching between task runs
concourse/concourse#230

* add worker_task_caches and worker_task_cache_id to volume
* if volume does not exist create a new volume for each cache
* if volume exists create a cow volume for each cache
* mount volume at a specified cache path
* after task is finished initialize volume as worker task cache
* create an import volume to avoid infinite cow volumes and mark that
volume as worker task cache volume
* in the same transaction release original parent volume
* do not create an import volume if it does not have a parent
* make sure volume promotion to task cache is concurrency safe

TODO:
* add path to worker_task_caches to distinguish task caches for
different paths
* stream contents of cache volume if it is on different worker
* garbage collect worker task caches
* [optimization] do not create an import volume if cow volume has the same sha of
contents as the parent volume

Signed-off-by: Chris Hendrix <chendrix@pivotal.io>

mariash added a commit to concourse/testflight that referenced this issue Jun 23, 2017

add test for task caches
concourse/concourse#230

Signed-off-by: Chris Hendrix <chendrix@pivotal.io>

mariash added a commit to concourse/atc that referenced this issue Jun 26, 2017

add caching between task runs
concourse/concourse#230

* add worker_task_caches and worker_task_cache_id to volume
* if volume does not exist create a new volume for each cache
* if volume exists create a cow volume for each cache
* mount volume at a specified cache path
* after task is finished initialize volume as worker task cache
* create an import volume to avoid infinite cow volumes and mark that
volume as worker task cache volume
* in the same transaction release original parent volume
* do not create an import volume if it does not have a parent
* make sure volume promotion to task cache is concurrency safe
* add path to worker_task_caches to distinguish task caches for
different paths
* delete work task caches for inactive jobs and steps that no longer
exist

Signed-off-by: Chris Hendrix <chendrix@pivotal.io>

mariash added a commit to concourse/testflight that referenced this issue Jun 26, 2017

add test for task caches
concourse/concourse#230

Signed-off-by: Chris Hendrix <chendrix@pivotal.io>

mariash added a commit that referenced this issue Jun 26, 2017

bump atc testflight
Submodule src/github.com/concourse/atc 01c22b4..71f27ac:
  > add caching between task runs
Submodule src/github.com/concourse/testflight 9f93560..fdd732d:
  > add test for task caches

#230

Signed-off-by: Chris Hendrix <chendrix@pivotal.io>

mariash added a commit to concourse/atc that referenced this issue Jun 27, 2017

do not initialize task cache volumes for one off builds
You cannot have task caches for one off builds, but we were never
disallowing that before.

Also, we shouldn't be ignoring foreign key violations and doing a Safe
Retry.

concourse/concourse#230

Signed-off-by: Maria Shaldibina <mshaldibina@pivotal.io>

mariash added a commit that referenced this issue Jun 27, 2017

bump atc
Submodule src/github.com/concourse/atc 71f27ac..e4b9dc9:
  > do not initialize task cache volumes for one off builds

#230

Signed-off-by: Maria Shaldibina <mshaldibina@pivotal.io>

mariash added a commit that referenced this issue Jun 27, 2017

add documentation for task caches
#230

Signed-off-by: Maria Shaldibina <mshaldibina@pivotal.io>
@vito

This comment has been minimized.

Copy link
Member Author

commented Jun 27, 2017

@mariash minor thing - noticed these show up as "unknown" in fly volumes. could we show something special for these? e.g. task-cache and maybe the pipeline/job/step?

@vito

This comment has been minimized.

Copy link
Member Author

commented Jun 28, 2017

Breaking that last bit out as a separate issue in 3.3.x since it's not critical enough to block release.

@vito vito closed this Jun 28, 2017

vito pushed a commit to concourse/docs that referenced this issue Sep 25, 2017

add documentation for task caches
concourse/concourse#230

Signed-off-by: Maria Shaldibina <mshaldibina@pivotal.io>

vito added a commit that referenced this issue Dec 5, 2017

bump atc
Submodule src/github.com/concourse/atc 8ad0882..c783355:
  > Merge pull request #222 from sharms/idle-timeout
  > Merge pull request #230 from jmcarp/fix-prometheus-volumes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.