New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make maven_jar and friends smarter by re-using previously fetched artifacts across different projects #1752

Open
davido opened this Issue Sep 11, 2016 · 13 comments

Comments

Projects
None yet
7 participants
@davido
Contributor

davido commented Sep 11, 2016

Current maven_jar() implementation is limited to only using the fetched artifact for specific project. It doesn't provide solution for very basic requirements (that native Maven would provide):

  • previously fetched artifacts should survive entire clean of output base for specific project (bazel clean --expunge)
  • previously fetched artifacts can be reused by other clones of the same project. Say i have separate clones for master and stable branch
  • previously fetched artifacts can be re-used by different projects. Say Gerrit Code Review, JGit, Gitiles and 42 Gerrit plugins (standalone build mode) are cloned and built on the same machine. Their prerequisites are almost the same

See Gerrit Code Review maven_jar Bucklet: [1] implementation how to get it right. The implementation is putting all fetched artifact in project independent area and is linking the artifacts to the poject output: [2]. More context is here: [3,4].

The bandwith is just too valuable resource to throw away (or ignore) previously downloaded artifacts and re-fetch the gigabytes of data again.

@kchodorow

This comment has been minimized.

Contributor

kchodorow commented Sep 12, 2016

+1, it's super annoying to have these re-downloaded on every new workspace. This is a little tricky to implement, because we do need a way to clear the "master cache," whether that's to clear or update individual entries that might be corrupt/out of date or just avoid taking up all of the user's disk space. We'll also have to be careful about correctness, perhaps using the cache should require a hash.

(There were other requests for this a while ago, but now I can't find them. Will link if I come across them.)

@jin

This comment has been minimized.

Member

jin commented Sep 12, 2016

I've thought about these problems as well while re-implementing maven_jar() (#1410), especially while dealing with the location of local repositories. Will need to find a balance between the persistence of local repositories and Bazel's reproducibility and correctness ethos.

@kchodorow

This comment has been minimized.

Contributor

kchodorow commented Sep 13, 2016

Aha, related to #1266.

@davido

This comment has been minimized.

Contributor

davido commented Sep 14, 2016

Thanks for the link. There is also similar feature request mentioned in #1266 on the dev mailing list.

As pointed out by @jin it would be trivial, to teach maven_jar reimplementation as Sykalrk rule to use alternative (project independent) directory, say ~/.bazel_external_repository: [1], right? Bonus point, to make this location configurable, so that we could just point to the Buck's download_artifact cache, or even teach the Skylark rule to hijack ~/.m2 for that purpose and re-use the downloaded artifact by Maven itself (for those of us who still have to use Maven for some other projects, that weren't ported to Buck or Bazel yet).

@jin

This comment has been minimized.

Member

jin commented Sep 16, 2016

Initial thoughts: a basic design uses a maven_local_repository rule, which is basically a thin wrapper around native local_repository. It only has one attribute, path, which lets you specify the absolute path of the system's local Maven repository.

load("@bazel_tools//tools/build_defs/repo:maven_rules.bzl", "maven_local_repository")
maven_local_repository(
    path = "/home/johndoe/.m2",
)

This folder will then be symlinked to each maven_jar subfolder in //external, hence allowing caching across Bazel builds, and reusability across build tools.

@kchodorow

This comment has been minimized.

Contributor

kchodorow commented Sep 16, 2016

This would need to work for all repository rules, not just Maven. Right now we download/link stuff into output_base/externalreponame, we'd need to come up with a central location to download/link stuff, then add a step to create the symlink output_base/external/reponame -> $CENTRAL_CACHE/reponame.

@damienmg

This comment has been minimized.

Contributor

damienmg commented Sep 21, 2016

Caching the HttpDownloadValue is probably the easiest way to go forward. But we might want to offer that caching capabilities a bit more exposed directly (so execute result can also be cached)

@johnynek

This comment has been minimized.

Member

johnynek commented Sep 28, 2016

generally caching any download with a sha would be great. This is a major painpoint for our developers as we have a lot of downloads (many external repos: maven_jar + git_repository).

bazel-io pushed a commit that referenced this issue Oct 5, 2016

Implemented a "--experimental_repository_cache" option as the first s…
…tep to

caching external repositories.

The option is categorized as hidden because it is a no-op.

Re-submit with fix from rollback in commit 9883e22 due to JDK7 build failure.

GitHub issue: #1752

--
MOS_MIGRATED_REVID=135231668

bazel-io pushed a commit that referenced this issue Oct 7, 2016

Bridged --experimental_repository_cache value to HttpDownloader. Crea…
…ted HttpCache skeleton to implement caching logic of HttpDownloadValues as the first step (more types of caches will come later).

Having RepositoryDelegatorFunction initialize the cache in the respective RepositoryFunction handlers decouples the cache implementation from itself. It delegates the choice of Cache classes to the respective RepositoryFunctions, and let them decide what to do with the PathFragment of the cache location.

Continuation of commit 239d995.

A follow up CL will contain the implementation of HttpCache. For now, it's the empty interface of com.google.common.cache.Cache.

GITHUB: #1752

--
MOS_MIGRATED_REVID=135400724

bazel-io pushed a commit that referenced this issue Oct 19, 2016

Made HttpDownloader download calls non-static.
To set and use a RepositoryCache instance in HttpDownloader while parsing the command line options, we can pass an AtomicReference<HttpDownloader> instance from BazelRepositoryModule to the HttpArchiveFunctions. However, we'll need to change HttpDownloader download() calls to be non-static in order to initialize an instance of HttpDownloader in BazelRepositoryModule.

Remaining TODOs:

- RepositoryCache implementation and unit testing
- RepositoryCache lockfiles
- RepositoryCache integration testing

GITHUB: #1752

--
MOS_MIGRATED_REVID=136593517

bazel-io pushed a commit that referenced this issue Oct 27, 2016

Implementation of the Repository Cache get and put features.
This is a basic implementation of writing and reading HttpDownloader download artifacts, keyed by the artifact's SHA256 checksum. For an artifact to be cached, its SHA256 value needs to be specified in the rule. Rules supported: http_archive, new_http_archive, http_file, http_jar, 

Remaining TODOs:

- Lockfiles for concurrent operations in the cache.
- Integration testing

GITHUB: #1752

--
MOS_MIGRATED_REVID=137289206

bazel-io pushed a commit that referenced this issue Oct 27, 2016

Integration tests for RepositoryCache.
Remaining TODOs:

- Lockfiles for concurrent operations in the cache.

GITHUB: #1752

--
MOS_MIGRATED_REVID=137296606
@jin

This comment has been minimized.

Member

jin commented Oct 27, 2016

0590483 now lets you use --experimental_repository_cache=$HOME/some/path to cache downloaded artifacts that have their SHA256 values specified. This cache will survive bazel clean --expunge. Works with artifacts downloaded with new_http_archive, http_archive, http_file, http_jar, Skylark's download and download_and_extract. Maven support coming up.

bazel-io pushed a commit that referenced this issue Oct 28, 2016

Integration tests for RepositoryCache using Skylark's download() and …
…download_and_execute().

GITHUB: #1752

--
MOS_MIGRATED_REVID=137535936

bazel-io pushed a commit that referenced this issue Nov 3, 2016

Refactor MavenDownloader to be a subclass of HttpDownloader to stream…
…line instantiation of HttpDownloader and RepositoryCache in BazelRepositoryModule.

There are sufficient similarities between the download flows of HttpDownloader and MavenDownloader such that we can extend HttpDownloader to MavenDownloader, and reuse method headers such as checkCache and download.

GITHUB: #1752

--
MOS_MIGRATED_REVID=137982375

bazel-io pushed a commit that referenced this issue Nov 3, 2016

@jin

This comment has been minimized.

Member

jin commented Nov 3, 2016

With 38e54ac, maven_jar artifacts with the SHA1 value specified can now be cached using --experimental_repository_cache.

@davido

This comment has been minimized.

Contributor

davido commented Nov 5, 2016

Thanks. This is very much appreciated!

I have built from the tip of master and am trying to integrate this great feature in Gerrit Code Revew Bazel build and having some questions:

Neither ~/.gerritcodereview/bazel_repository_cache nor $HOME/.gerritcodereview/bazel_repository_cache seems to be supported. What we would like to do is to put these lines in .bazelrc in root of Gerrit project (this file is under GIT control, so we don't have the option to use resolved user name, only ~ or $HOME):

build --experimental_repository_cache=~/.gerritcodereview/bazel_repository_cache --workspace_status_command=./tools/workspace-status.sh --strategy=Javac=worker
fetch --experimental_repository_cache=~/.gerritcodereview/bazel_repository_cache

Note that Buck supports something similar, one can hijack the directory cache to user home directory (excerpt from .buckconfig):

[cache]
  mode = dir
  dir = ~/.gerritcodereview/buck-cache/locally-built-artifacts

I added this feature to Buck 2 years ago: facebook/buck@a1ba001

For now I tested it with --experimental_repository_cache=/home/davido/.gerritcodereview/bazel_repository_cache and it works as expected. I tried bazel fetch gerrit, then disconnected the wire and have rebuilt with bazel build gerrit without network connection. It has also survived bazel clean --expunge ;-). This is great!

Question: Can it be that the cached artifact are copied to external artifacts and not linked? i cannot see that symbolic links are used. Any particular reason to not use symbolic links for that?

@kchodorow

This comment has been minimized.

Contributor

kchodorow commented Nov 7, 2016

Question: Can it be that the cached artifact are copied to external artifacts and not linked? i cannot see that symbolic links are used. Any particular reason to not use symbolic links for that?

We may change this in the future, but for now we decided to use copies to simplify cache cleanup.

Bazel options generally don't support ~ nor $HOME, I filed #2054 to gauge interest/have discussion.

@kchodorow kchodorow added this to the 0.5 milestone Dec 9, 2016

@kchodorow kchodorow removed this from the 0.5 milestone Dec 21, 2016

@kchodorow kchodorow modified the milestones: 0.6, 0.5 Dec 21, 2016

@pwnall

This comment has been minimized.

pwnall commented Jan 9, 2017

Asides from all the benefits mentioned above, I think that having a build cache makes it easy to use bazel repositories with package managers that do not allow (by policy) the build process to download anything on its own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment