Skip to content
This repository has been archived by the owner on Aug 30, 2022. It is now read-only.

Share the download cache with the invoking Bazel installation #63

Closed
hchauvin opened this issue Apr 23, 2018 · 13 comments
Closed

Share the download cache with the invoking Bazel installation #63

hchauvin opened this issue Apr 23, 2018 · 13 comments

Comments

@hchauvin
Copy link
Contributor

... or have another way to share external dependencies.

Currently, the Bazel environment that is created by bazel-integration-testing has a download cache that is separated from the invoking Bazel installation (let's call it the "upstream" download cache). It is possible to have a download cache when developing by pinning the TEST_TMPDIR, but since the download cache is a content-addressable storage, wouldn't it make sense to simply use the upstream download cache as it would be safe anyway? In some cases this could greatly improve performance. Agreed, such external dependencies should be avoided when doing integration testing, but sometimes it is more convenient.

This is not something that could be added right away, though, as it would require Bazel to publish in some way the location of its download cache.

@ittaiz
Copy link
Member

ittaiz commented Apr 23, 2018

this is really interesting. @anchlovi is also facing some issues with setting up external dependencies for the scratch workspace and your solution sounds really interesting. @anchlovi wdyt?

@ittaiz
Copy link
Member

ittaiz commented Apr 24, 2018

Where is REPOSITORY_CACHE that you used in the rules_kotlin PR defined?
Also it sounds strange that the scratch_workspace will have access to the shared cache directory since its out of its sandbox

@ittaiz
Copy link
Member

ittaiz commented Apr 24, 2018

I'll rephrase the last sentence as a question. Were you able to run it with sandboxing?

@hchauvin
Copy link
Contributor Author

hchauvin commented Apr 24, 2018

Yep, it runs with sandboxing, but because of a horrendous hack involving absolute paths. Actually in local dev I hardcode this: "-Dbazel.repository.cache=/tmp/cache" and have /tmp/cache as the repository cache of the invoking bazel instance. That's clearly not a portable solution :)

I thought about this again and there are IMO three ways to speed up things with shared caching: one that works out of the box, without modifying bazel, one that works with a little bit of help from the user, one that means modifying bazel.

Working out of the box, without modifying bazel: prewarming

This might actually work across the spectrum, including remote execution, and without touching Bazel.

Here, the WorkspaceDriver would be used during the build phase to generate a download cache, so that the downloading time is only paid once. I actually like that better that piggybacking on the caching mechanism of the invoking bazel instance because you have less assumptions concerning how bazel works, and those assumptions might break when you have, e.g., remote execution.

For example, you might use a slightly modified WorkspaceDriver in a genrule with the following:

  • scratch a WORKSPACE with your external deps
  • build them with a repository_cache location that you control: "bazel build @foo//bar --download_cache=..."
  • put everything in the repository_cache in a tarball that is the output of the genrule (again, this is just a CAS, so that's perfectly safe to do).

Then you test as follows:

  • put the tarball as data to the integration test
  • untar it in a repository_cache location that you control
  • invoke bazel, e.g.: "bazel test //hello:world --repository_cache=..."

I am pretty sure an intuitive API can be designed to encapsulate all that. The advantage here is that you have performance improvement across the board, without depending on a particular way to invoke Bazel. And you don't need to touch Bazel.

With a little bit of help from the user

Here, we let the user pass some caching options (there are many other than --repository_cache) through a '--define bazel.caching.options="--repository_cache=..."'. This is straightfoward to expand, see bazelbuild/bazel#3736. However, it means that caching does not work out of the box for the users. Moreover, depending on how the caches are implemented, there is a risk of cache poisoning, and it is problematic overall from a security point of view.

Modifying bazel

Here, we let bazel give the caching options that were passed to its local instance through the ctx object, maybe as a make variable (ctx.var). Again, there is a risk of cache poisoning, and this probably seems like too particular a use case to warrant loading up the ctx object with yet another configuration variable, especially since this comes with security issues.

======

Overall, I'm in support of prewarming, but I'm curious to have your opinion about that.

@ittaiz
Copy link
Member

ittaiz commented Apr 24, 2018 via email

@hchauvin
Copy link
Contributor Author

hchauvin commented Apr 24, 2018

Sorry, I wrote that too quickly. I have in mind using the WorkspaceDriver in a genrule. Changed the text so that it is clearer there as well.

I don't know if it is possible, though, I need to try that.

@ittaiz
Copy link
Member

ittaiz commented Apr 24, 2018 via email

@hchauvin
Copy link
Contributor Author

hchauvin commented Apr 24, 2018

Ah yes they are not supposed to access the network. But does it mean that they can't?

Another possibility, in this case, would be to have a new repository rule that uses repository_ctx.download to have a compressed archive. Then it is possible to use --experimental_distdir (https://github.com/bazelbuild/bazel/blob/beef2c452bb6b1c2dd2b08d3089062f26cccc859/src/main/java/com/google/devtools/build/lib/bazel/repository/RepositoryOptions.java) with a location within the workspace to have all the repository rules cached without rewriting everything.

Something like a repository rule:

# WORKSPACE
integration_testing_archives(
    name = "archives",
    content = {
        "https://foo/bar.zip": "<sha256 digest>",
        "https://hello/world.tar.gz": "<sha256 digest>",
    }
)
# Integration WORKSPACE

# some_repository_rule is set to download https://foo/bar.zip.  But with the distdir,
# it actually looks up some predefined path for a matching sha256.
some_repository_rule(
    name = "repo",
)

So this is almost the same, but now you have a new integration_testing_archives rule where you say exactly what you want to cache. And that's the missing piece, because it actually allows you to reuse the download cache of the "invoking Bazel instance" (this is a mouthful, but I didn't find better). Then you can disable downloading entirely by setting a "block-network" tag on the java_test in bazel_java_integration_test:

bazel_java_integration_test(
    ...,
    cache = "@archives",
    tags = ["block-network"],
)

@ittaiz
Copy link
Member

ittaiz commented Apr 24, 2018 via email

@damienmg
Copy link
Contributor

Preferably test and build shouldn't access the network and the linux sandbox forbid it I think (unless it was removed for performance reason, the blocking is possible). For build step the network is not restricted (for performance reason). You can also add a tag to ensure that it get access to the network

Anyway, you can make the repository cache accessible by using a rctx.symlink. That indeed requires some hack to pass around the name of the repository cache. I don't believe it make sense to have bazel integrate such a feature.

@hchauvin
Copy link
Contributor Author

@damienmg Thank you for the feedback, I agree with you, I don't think that Bazel should expose that, that is in the realm of the test infrastructure.

For the blocking of the network, the "block-network" tag can be added.

@ittaiz I came up with another approach which feels way less hackish, but it works only for testing bazel > 0.12.0, using --experimental_distdir, see #71.

@ittaiz
Copy link
Member

ittaiz commented Apr 25, 2018 via email

@ittaiz
Copy link
Member

ittaiz commented Apr 29, 2018

fixed by #71 thanks @hchauvin!

@ittaiz ittaiz closed this as completed Apr 29, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants