Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: Store resolved repository attributes in the Bzlmod lockfile #19026

Open
fmeum opened this issue Jul 24, 2023 · 8 comments
Open

FR: Store resolved repository attributes in the Bzlmod lockfile #19026

fmeum opened this issue Jul 24, 2023 · 8 comments
Labels
area-Bzlmod Bzlmod-specific PRs, issues, and feature requests P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: feature request

Comments

@fmeum
Copy link
Collaborator

fmeum commented Jul 24, 2023

Description of the feature request:

The information about resolved attributes returned by repo rules and collected by --experimental_repository_resolved_file should be stored in the Bzlmod lockfile.

Repository rules already use this information to lazily (that is, when evaluated) return (often cryptographic) identifiers that make their results fully reproducible, which can also result in faster fetches (via Bazel's repository cache).

If MODULE.bazel.lock could be updated with this information incrementally, users would have strong "trust on first use" guarantees without having to manually provide cryptographic identifiers in an extension-dependent manner.

What underlying problem are you trying to solve with this feature?

Lockfiles for package managers usually serve two purposes:

  1. Pinning a particular resolution of all (transitive) dependency requirements to a set of deps and their versions.
  2. Recording cryptographic identifiers for the artifacts representing these resolved deps, e.g. hashes of archives or their contents.

With the Bzlmod lockfile available with Bazel 6.3.0, there is a general solution for 1. that module extensions can use instead of rolling their own one.

However, 2. remains unsolved: When wrapping an external package manager for the fictitious foo language into a module extension foo_deps, in order to benefit from Bazel's repository cache, users either have to manually specify hashes of the resolved deps in their MODULE.bazel file or a file referenced from it (such as go.sum for Go) or rely on custom module extension/repo rule logic to generate such a file (such as rules_jvm_external's concept of pinning). The latter usually requires eagerly fetching all repos once to generate the lockfile, which can be slow. Since this process differs between module extensions, it results in additional friction beyond just updating Starlark files.

Which operating system are you running Bazel on?

N/A

What is the output of bazel info release?

N/A

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

All module extensions for languages that do not come with a well-established lockfile format currently need to come up with their own format. This includes existing rulesets such as rules_jvm_external as well as new developments (e.g. for PHP.

Any other information, logs, or outputs that you want to share?

No response

@Pavank1992 Pavank1992 added the team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. label Jul 24, 2023
@meisterT
Copy link
Member

cc @SalmaSamy

@meteorcloudy
Copy link
Member

The information about resolved attributes returned by repo rules

This will require the repo rules used by the package manger ruleset to support resolved attributes, right? IIRC, currently only the http_* and git_repository repository rules return resolved attributes?

The latter usually requires eagerly fetching all repos once to generate the lockfile, which can be slow.

This is still true if we delegate the job to the repository rule, right? The resolved attributes are returned by the implementation function of the repo rule, which means the repo must be fetched at least once to get those resolved attributes.

@fmeum
Copy link
Collaborator Author

fmeum commented Jul 25, 2023

The information about resolved attributes returned by repo rules

This will require the repo rules used by the package manger ruleset to support resolved attributes, right? IIRC, currently only the http_* and git_repository repository rules return resolved attributes?

Third-party repo rules can and do support this as well. Even if they don't yet, this is a well-defined existing mechanism to return "pinning" information.

The latter usually requires eagerly fetching all repos once to generate the lockfile, which can be slow.

This is still true if we delegate the job to the repository rule, right? The resolved attributes are returned by the implementation function of the repo rule, which means the repo must be fetched at least once to get those resolved attributes.

Yes, this is similar to how module extension results are added to the lockfile as they are run instead of all at once. I would imagine this feature to emit the resolved attributes only if a repo rule is fetched. If someone wants to eagerly lock everything, they could just use the upcoming offline mode feature to force everything to be fetched.

@meteorcloudy
Copy link
Member

I would imagine this feature to emit the resolved attributes only if a repo rule is fetched.

Does this mean if I only do dependency resolution, the lock file will be generated without the resolved attributes info.
Then when the external dep is fetched, the lockfile should be updated? Sounds a bit weird that the lock file actually changes after a dependency is fetched.

@fmeum
Copy link
Collaborator Author

fmeum commented Jul 25, 2023

Does this mean if I only do dependency resolution, the lock file will be generated without the resolved attributes info. Then when the external dep is fetched, the lockfile should be updated? Sounds a bit weird that the lock file actually changes after a dependency is fetched.

Yes, I think that would necessarily be the case with lazy fetches and would transparently improve the situation in "update" mode: Compared to now, after a bazel shutdown deps that were fetched before will be fetched more efficiently.

In order to generate a classical lockfile for everything, fetching everything would be required, as is the case now with custom implementations such as r_j_e's Maven pinning.

@Wyverald Wyverald added P2 We'll consider working on this in future. (Assignee optional) area-Bzlmod Bzlmod-specific PRs, issues, and feature requests and removed untriaged labels Jul 25, 2023
@illicitonion
Copy link
Contributor

The latter usually requires eagerly fetching all repos once to generate the lockfile, which can be slow.

I'm not sure this is as avoidable as the issue description suggests. I think there are two styles repository rules in-use:

  1. Repository rules which are handed complete information and just change its format (I believe the Go rules work like this but I'm not sure - you hand them a go.sum file and that contains enough information to generate all of the repositories it needs). These are persumably not very interesting as far as this proposal is concerned, because that text -> text translation of complete information is already not expensive (but centralising the data and making the update workflows consistent is a great aim).
  2. Repository rules which need to do some kind of global resolve step in order to decide version numbers (e.g. rules_jvm_external and rules_rust where the input to the repository rule is incomplete - it may be missing transitive dependencies, or it may just contain version requirements not resolved to full final dependency versions) - in these cases, the global resolve step already has to do the work that probably involves getting a sha256 of every artifact anyway, because it probably involves needing to download per-target metadata in order to discover what dependencies that target has in order to complete the resolve.

I'm really excited by the idea of having a standard format, and a standard mechianism for repinning, but I'm not sure how much incremental fetching is a practical goal?


On repinning - one of the big issues we face in rules_jvm_external and rules_rust is a conflict between two desires:

  1. We never want to perform a build where a lockfile is out of date. If you update a version requirement from ^1.0.0 to 1.1.0, we don't want to perform a build with version 1.0.0 - you may produce a binary that has a known security vulnerability. This used to be how rules_jvm_external worked, and got changed in Detect when pinned inputs have changed and stop the build rules_jvm_external#509 because it caused surprises.
  2. We don't want to update the versions you use without your consent. If you've pinned to version 1.0.0, we shouldn't update you to 1.1.0 just beacuse it's available, without you choosing to do so. (If you want that behaviour, you can get it by not opting in to a lockfile).

Because when a repository rule runs we can't detect whether you're repinning, we need to do something like sniff an env var we get users to manually set to know we shouldn't detect that things are out of date and error. (rules_rust automatically repins if this env var is set, rules_jvm_external just doesn't error if it's set, but requires you to bazel run a special target to do the repin).

A problem we run into here is that there may be a required order to do repins, and we may need to ignore bad repins (or get them done as a side effect). Let's say both Java and Rust need re-pinning - right now, there's no good way to express "Rust - ignore that you're out of date, I'm repinning Java, I'll come back to you soon to repin yourself". It's fortunate that these rulesets don't depend on each other, but ideally we'd be repinning rulesets in toplogical order...

Another thing to note on repinning - rules_rust has different modes of repinning - you may be repinning to say:

  1. I want to update this one dependency to version X
  2. I want to update all my dependencies to their latest versions
  3. I just added a new dependency, do the minimal updates required to pull in that dependency, but keep everything else as close to how it was as possible
  4. I just updated my rules_rust, I don't think anything has actually changed, but rules_rust isn't sure so it should check

We should work out how to support these :)

@fmeum
Copy link
Collaborator Author

fmeum commented Aug 14, 2023

Repository rules which need to do some kind of global resolve step in order to decide version numbers (e.g. rules_jvm_external and rules_rust where the input to the repository rule is incomplete - it may be missing transitive dependencies, or it may just contain version requirements not resolved to full final dependency versions) - in these cases, the global resolve step already has to do the work that probably involves getting a sha256 of every artifact anyway, because it probably involves needing to download per-target metadata in order to discover what dependencies that target has in order to complete the resolve.

I agree that this is the interesting case. Even in this case, isn't there a difference between fetching the metadata for each dep eagerly and truly fetching all deps artifacts eagerly?

I don't know Cargo that well but for r_j_e, a module extension could potentially just download the poms and sha256 files and then use that information to generate http_jar repos for each dep (plus aliases pointing to them from the hub repo). The actual jar files would then be downloaded lazily. In this case, the existing lockfile functionality would be sufficient to capture all information required by the extension for subsequent builds. If there is a need for more metadata, then http_jar could be replaced with a dedicated repo rule that captures the required per-dep information.

The changes I propose in this issue wouldn't be relevant in this context. They would apply if there were no way to get the hash of a jar without downloading it first, which is the case in other situations (think http_archive).

On repinning - one of the big issues we face in rules_jvm_external and rules_rust is a conflict between two desires: ...

With what I described above, r_j_e's fail_if_repin_required would translate to --lockfile_mode=error. Even in update mode, version requirements would always be honored, potentially ignoring the lockfile if its contents don't match.

Because when a repository rule runs we can't detect whether you're repinning, we need to do something like sniff an env var we get users to manually set to know we shouldn't detect that things are out of date and error. (rules_rust automatically repins if this env var is set, rules_jvm_external just doesn't error if it's set, but requires you to bazel run a special target to do the repin).
...

Could explain in more detail why these different repinning modes need to exist?

Ideally (again, please give this a reality check ;⁠-⁠) ), we could split this into two different actions: resolving version requirements and manipulating version requirements. If users don't want this process to pick up newly released versions of unrelated deps, shouldn't they express this in their version requirements, e.g. by pinning to patch versions? Of course this can also be done by a tool that updates the version requirements with the result of the resolution - crucially, this would move the complexity of specifying exactly what a user wants to be kept at what version to the requirements instead of the lockfile or module extension logic.

A problem we run into here is that there may be a required order to do repins, and we may need to ignore bad repins (or get them done as a side effect). Let's say both Java and Rust need re-pinning - right now, there's no good way to express "Rust - ignore that you're out of date, I'm repinning Java, I'll come back to you soon to repin yourself". It's fortunate that these rulesets don't depend on each other, but ideally we'd be repinning rulesets in toplogical order...

This does sounds pretty complex and I hope we never get to that point, but if we do, module extensions can depend on others, which would give you the capability to cleanly evaluate them in topological order.

@illicitonion
Copy link
Contributor

Could explain in more detail why these different repinning modes need to exist?

I think there are three main reasons people repin:

  1. I want to update to the newest versions of things available (e.g. to pick up potential bugfixes). For whatever reason, I don't want to update my version requirement (maybe because I am still compatible with the older version, but the newer version may contain optimisations or bugfixes or whatever; maybe because I want to update transitive deps that I don't actually mention in my version requirements at all; maybe because I don't know what packages have updates and want to use this update as a discovery mechanism). In this case, the algorithm we want to follow is roughly: "For each package, look for the newest version which satisfies my version requirements, if it's newer than what's in the lockfile, update it". In a Cargo world, this is what cargo update does.
  2. I have added a new dependency (or want to update one dependency), and want to perform the minimum required update in order to pull it in. I'm perfectly happy with the versions I'm using for everything, I've performed security audits on them, I have cache hits in my remote caches, I want to keep using them, but I need to add something to the lockfile (and maybe update a common transitive dependency). If I need to update that transitive dep to add my new one, that's fine, I can do it and perform a new audit (so I don't want to specify an exact match version requirement), but if I can avoid it I'd like to do so. In this case, the algorithm we want to follow is roughly: "Add a new package, check if the existing resolve is compatible with its transitive dependencies, and incrementally update/downgrade anything which conflicts". In a Cargo world, this is what happens if you edit your Cargo.toml file and run a cargo build.
  3. Some external tooling requires me to repin my dependencies, but I don't expect anything to change. With our existing pinning mechanisms, both "the version of Rust you're using" and "the version of rules_rust you're using" factor into the generated lockfile, and may change formats or hashes in them. In practice, this is effectively equivalent to a "new dependency" update, but where there are 0 new dependencies.

Ideally (again, please give this a reality check ;⁠-⁠) ), we could split this into two different actions: resolving version requirements and manipulating version requirements.

Yes - I agree with this. Right now this is made much harder by the fact that these manipulations are done via running bazel builds, and so we have a bootstrapping problem of "In order to use bazel to do a manipulation, the fact that the lockfile may be invalid needs to be ignored", which is what gives rise to the environment variables in the first place. If bazel itself had the knowledge that "version requirements / resolved versions are being manipulated, don't perform those checks", that would make things much tidier.

If users don't want this process to pick up newly released versions of unrelated deps, shouldn't they express this in their version requirements, e.g. by pinning to patch versions?

From a theoretical perspective, yes, but in practice I think there are valid reasons to want to specify "Technically I'm compatible with newer versions than I happened to most recently lock, but I'd rather avoid them unless I need to". Before we had this functionality in rules_rust, we received several complaints about the unexpected updates (e.g. bazelbuild/rules_rust#1522, bazelbuild/rules_rust#1535 (comment), bazelbuild/rules_rust#1231). The main reasons I think people care about this are:

  1. Avoiding rebuilding the world just because there's one new dependency. Ideally adding a dependency should only invalidate caches related to that dependency, but if suddenly your entire transitive dependency footprint has slightly newer versions, you may suddenly have a double-digit-minute wait while things re-compile even though all you wanted to do was add a new dependency which doesn't impact anything else in your tree.
  2. Sometimes there's extra work to do to get a version building with Bazel which already builds with Cargo (e.g. requiring to set new env vars, new compiler flags, add more runfiles to be present, etc), and while people may be willing to do that work if needed, they may want to avoid it.
  3. Transitive dependency resolution may require a version which wasn't present in your specified version requirements at all. But keeping these consistent where possible may be desirable for the above reasons.

In most other cases (e.g. a breaking API change), I'd agree that version requirements should be updated, but I can see reasons for wanting to minimise churn nonetheless.

@fmeum fmeum changed the title Store resolved repository attributes in the Bzlmod lockfile FR: Store resolved repository attributes in the Bzlmod lockfile Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Bzlmod Bzlmod-specific PRs, issues, and feature requests P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: feature request
Projects
None yet
Development

No branches or pull requests

8 participants