Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC Remote Caching #2777

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft

POC Remote Caching #2777

wants to merge 20 commits into from

Conversation

lihaoyi
Copy link
Member

@lihaoyi lihaoyi commented Sep 21, 2023

This PR allows Mill builds in different folders or different computers to share their out/ folder data, by means of a third-party remote-cache service referenced via --remote-cache-url

  1. This means that you "never" need to build something twice across an entire fleet of machines. A target compiled on one machine can have its output downloaded and re-used on another, assuming the inputs are unchanged (detected via our normal inputsHash invalidation logic)

  2. This also means you never have to re-run tests unnecessarily (using testCached, which is a Target and thus cached), which means if you re-run a job due to a flaky test most of the other tests should be automatically skipped (since their result is in the cache) and only the specific flaky tests would need to re-run

Limitations

  1. Remote Caching assumes that all inputs are tracked. The remote cache cannot detect scenarios where un-tracked inputs cause a target with the same inputsHash result in different outcomes, e.g. by calling different versions of a CLI tool. I included a --remote-cache-salt flag for a user to explicitly pass in whatever they want as an additional cache key, e.g. they could give developers on Mac-OSX and CI machines on Linux different salts to ensure they do not share cache entries

  2. Remote caching has known security considerations; anyone with push access to the remote cache can send arbitrary binaries to be executed on the other machines pulling from it. All machines sharing a remote cache have to be within the same trust boundary. Other topologies include only pushing to the cache from trusted machines (e.g. CI running on master) but allowing pulling from untrusted machines (e.g. developer laptops)

  3. Not everything benefits from remote caching; some targets are faster to compute locally that fetch over the network, and it's impossible to statically determine which ones. --remote-cache-filter allows the user to select what targets they want to cache. There are probably other ways we could try and tune things e.g. setting minimum-durations or max-output-sizes for cache uploads.

None of these limitations are unique to Mill's implementation: every build tool with remote caching suffers from them, including Bazel. But is worth calling out for anyone who wishes to deploy such a system

Implementation

  1. We use the bazel-remote-execution protocol for compatibility with bazel remote caches. This has become a de-facto standard, with multiple build tools supporting it as clients (Bazel, Buck, Pants, ...) and multiple backend implementations supporting it as servers (bazel-remote, Buildbarn, EngFlow, ...).

    • This also saves us from architecting/implementing/maintaining our own scalable high-performance cache backend servers, which is a huge win.
    • I didn't manage to find any other commonly-implemented file-based cache server protocols we could leverage. The closest was WebDAV, which is much simpler than bazel-remote-execution and supported by common tools like Nginx or ApacheHTTPD, but it does not come with cache-eviction and thus would be problematic to use without further configuration of cron jobs and other things
  2. The Bazel remote execution protocol is extremely detailed, and does not completely match up with Mill's data model. For this PR we integrate with it relatively shallowly: on write, we PUT a single ActionResult to the Action Cache endpoint /ac/..., which references a single output file which is a .tar.gz of the foo.{dest,json,log} folders we PUT to the Content Addressable Store endpoint /cas/.... On read, we do the reverse: grab the /ac/ data, use it to grab the /cas/ blob, and unpack it.

    • This two-step process is necessary because the bazel remote cache API disallows "inline" requests that upload file contents to the /ac/ metadata store. The Bazel-Remote implementation I tested against does not seem to complain, but better follow the spec anyway in case other servers are not so forgiving

    • The fact that we bundle all target data into one big .tar.gz blob does mean the cache is at a target-level granularity. The protocol allows finer grained stuff (e.g. sharing individual files), but we can leave support for that for future work

    • We limit the uploads in the .dest folder to only things references by PathRefs. I use a DynamicVariable to instrument the JSON serialization of PathRefs so I can gather up all the PathRefs in a task's return value for this purpose. This is a similar approach as we discussed w.r.t. PathRef.validatedPaths

  3. It took some fiddling to make input hashes consistent across different folders.

    • I made JsonFormatters.pathReadWrite to serialize relative paths whenever the path is within os.pwd
    • I tweaked the MillBuildRootModule script-wrapper-code-generator to generated paths relative to os.pwd
    • The use of os.pwd is somewhat arbitrary here. It may make things a bit annoying for testing: we can only change the os.pwd via subprocesses in integration tests and not unit tests. But the alternative of passing an implicit RepoRoot everywhere in our codebase would make things more annoying for everyone else, so this may be the best option
  4. The remote cache will likely need a lot more configuration in future: certificates, authentication, HTTP proxy config, etc.. This stuff cannot live in any build.sc, even a meta-build, since it is necessary to evaluate the build.sc's tasks, unless we accept that the meta-build is never going to be remote-cached. The alternative is to put it in some JSON/YAML file somewhere

  5. We need some way to annotate things like resolvedIvyDeps to ensure they are not remote cached, so they can be downloaded anew on each machine.

  6. I pre-built and published the Java protobuf stubs from https://github.com/bazelbuild/remote-apis via a bazelRemoteApis target in Mill's own build file, versioned separately from the rest of Mill, similar to what we do for the Zinc compiler bridges. This shouldn't change very often, and when it does should generally be backwards compatible, so we shouldn't need to include it as a formal part of the Mill build graph

  7. I broke the caching-related logic (both remote and logic) out of GroupEvaluator.scala so it's easier to navigate around that part of the codebase

Testing

Tested manually with https://github.com/buchgr/bazel-remote

$ ./bazel-remote-2.4.3-darwin-arm64 --dir dir --max_size 1 

$ cp -R example/basic/1-simple-scala example/basic/1-simple-scala-copy

$ ./mill -i dev.run example/basic/1-simple-scala -i --remote-cache-filter __.compile --remote-cache-url http://localhost:8080 run --text hello

$ ./mill -i dev.run example/basic/1-simple-scala-copy -i --remote-cache-filter __.compile --remote-cache-url http://localhost:8080 run --text hello

This is a proof of concept that demonstrates the ability for multiple different checkouts of the same repo to share the remote cache. After running the commands above, we can look at 1-simple-scala-copy/out/mill-profile.json to see that compile was cached, despite being run on a "clean" repository. Removing the --remote-cache-filter and re-doing the above steps after removing the out/ folders demonstrates that everything is cached except mill.scalalib.ZincWorkerModule.worker and run, which is to be expected. Not merge-ready, but it demonstrates the approach and can probably be cleaned up and fleshed out we decide to move forward with it.

TODO

  1. We need some way to handle "quick" PathRefs. These replacing the content-hashing with comparing mtime timestamps, and are used for large binary files downloaded externally where the file changes almost-never (so a timestamp is good enough to ~never invalidate them) and are often large binary blobs (so hashing them everytime is wasteful and expensive). mtimes won't work on remote caching because each machine will download them anew and get different download times.

@lihaoyi lihaoyi marked this pull request as draft September 21, 2023 08:33
@lefou
Copy link
Member

lefou commented Sep 21, 2023

How does you deal with absolute paths? I though, we need to adapt PathRef before consideing any remote caching.

@lihaoyi
Copy link
Member Author

lihaoyi commented Sep 21, 2023

From what I can tell, PathRef already ignores the enclosing path; it only cares about relative paths within the target path

That's not to say this already works across different folders though. There are a bunch of absolute paths embedded in the generated sources or bytecode that makes CodeSig invalidate things. Should be fixable

@lefou
Copy link
Member

lefou commented Sep 21, 2023

Yeah, we already made PathRef.sig independent of the absolute path (#2106), yet PathRef still hold a os.Path which is absolute, so the hashCode will change for different paths. There is also another discussion about this and a potential solution, esp. in comment #2101 (comment). IIRC, this is a bit like the virtual file API of zinc works, too.

@lihaoyi lihaoyi changed the title WIP remote caching POC remote caching Sep 22, 2023
@lihaoyi
Copy link
Member Author

lihaoyi commented Sep 22, 2023

@lefou I've got this working end-to-end(-ish): I can build in one folder, go to another folder, and have everything be downloaded from the remote cache. Still pretty rough, but it would be great if you could take a look. Seems like a pretty small change overall, but it raises some questions that are probably worth discussing:

  • Is this the correct way of handling PathRefs?
  • How do we mark targets as do-not-cache?
  • Should we have a more flexible way of tagging tasks vs the current sub-class-based approach?
  • How should we configure the remote cache client?

@lefou
Copy link
Member

lefou commented Sep 22, 2023

I never used bazel or bazel-remote, but am interested into this topic as well. How is the cache handling older results? Is it only storing the latest result or all until some space/count criteria is met? Does that mean, git bisecting may always find cached results in the best case?

I think a sub-goal of getting remote caching to work is making the out directory relocatable. So when we're going to decide some implementation details, we should favor those that also improve this goal.

  • I made JsonFormatters.pathReadWrite to serialize relative paths whenever the path is within os.pwd

Without having inspected that change yet, I think we should always encode paths as "relative" to some "base", and the first "base" I can imagine is the local workspace. Another is the (relocateable) out dir. Coursier caches are good candidates, too. We should specify supported bases and encode their standardized name in the serialized path. Results pointing to other paths shouldn't be cacheable by default.

  • I tweaked the CodeSig logic to ignore any string literals that et passed to sourcecode.File, similar to how it already ignores integer literals passed to sourcecode.Line, since most of these are just logging. I then had to wrap a few usages where we actually care about the paths in T.inputs, e.g. millBuildRootModuleInfo

I'm not really sure, this is the right thing. This seems like an acceptable trade to avoid recompilation in Mill's buildscripts, but for shared users code, this is probably wrong. This may result in wrong log messages in production code, for example. Instead, we should strive for some alternative to sourcecode.File containing a relative path (to the source root), or something like that. (Here I assume, the issue is the changing paths due to different absolute location of source files on the developer machine.)

  • We currently bundle up the entire .dest folder if present and ship it to the remote cache. This can probably be optimized to only upload things referenced via PathRefs

Agreed, the latter may reduce size and avoid leakage of temporary results. Persistent targets are probably a special case, as we can't reason about what is needed by an persistent target to fulfill it's requirement to be transparent.

  • We need some way to annotate things like resolvedIvyDeps to ensure they are not remote cached, so they can be downloaded anew on each machine.

Here, the returned result contains PathRefs pointing outside the out dir. We either ignore those (as not cachable by default) or try to match them to other known base directories like a coursier cache. But even in that case, it's probably enough to just cache the json and distribute the coursier artifacts via some dedicated shared cache, if neccesary. We already have support in the Evaluator to deal with PathRefs which can't sucessfully revalidate when de-serialized. This should be enough to try to re-use coursier cached artifacts but re-evaluate them in case they are missing or inconsistent.

  • The remote cache will likely need a lot more configuration in future: certificates, authentication, HTTP proxy config, etc.. This stuff cannot live in any build.sc, even a meta-build, since it is necessary to evaluate the build.sc's tasks, unless we accept that the meta-build is never going to be remote-cached. The alternative is to put it in some JSON/YAML file somewhere

I think, this should be part of a local Mill configuration. We already have the plan to introduce a Mill config file (#1226). If we apply best practices for Linux tools, we could have some system-wide / user-specific / project-specific override concept, so the user can keep this configuration in a separate safe place.

@lihaoyi
Copy link
Member Author

lihaoyi commented Sep 22, 2023

I never used bazel or bazel-remote, but am interested into this topic as well. How is the cache handling older results? Is it only storing the latest result or all until some space/count criteria is met? Does that mean, git bisecting may always find cached results in the best case?

It's up to the cache server implementation, but yes, git bisect, and in general checking out old branches should hopefully hit the cache. testLocal should hit the cache too, since it's a target. Basically checking out old commits and re-building should be "instant", at least until you check out a commit old enough that its cache entries got evicted

bazel-remote lets you configure the cache size, after which it evicts things from the cache. It can also be configured to have cloud storage backends e.g. S3, which can themselves be configured via various eviction policies. Mill doesn't need to know about any of that, and will just re-build anything that it cannot find in the remote cache. We can leave it up to the backend to decide exactly how long things are remote cached for.

@lihaoyi lihaoyi changed the title POC remote caching Remote Caching Sep 25, 2023
@lihaoyi lihaoyi changed the title Remote Caching POC Remote Caching Sep 28, 2023
@sgammon
Copy link

sgammon commented Jan 13, 2024

@lihaoyi Hey there. We make Buildless, I saw your comment in the discussion. We'd love to help test this on the server implementation side. Our gRPC services for Bazel should be up and running soon. Would it be helpful to build locally and give it a shot, or is it too early?

We spend a lot of time on build caching, so we may be able to provide some resources. For instance, we published the Bazel remote APIs as a Buf module, which might let you depend on the generated protos through a Maven dependency if you want to. This is generated directly from the protos and can be updated from that repo, which has now been handed over to the Buf and Bazel teams to maintain.

On the headers and auth side, certainly adding regular headers would cover most basic remote caching auth needs (we support HTTP basic, Authorization with bearer, X-API-Key, etc., since some tools are more customizable than others). Whatever is easiest there would work for us if there is interest in enabling integration with our service and local tools. I have definitely seen more requirements, like mTLS and corporate proxies, etc., and can expand on that if it's at all helpful.

Anyway, this looks awesome. I'm a lay follower of Mill but this gets me more interested 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants