Stop doing manual string formatting of output #655

illicitonion · 2021-03-29T15:57:04Z

Use tera for all rendering.

There are things we will want to change, both about the output, and how
it's computed, but this is an incremental step in more legible, testable
code.

Also, add some hermetic unit tests for rendering.

Use tera for all rendering. There are things we will want to change, both about the output, and how it's computed, but this is an incremental step in more legible, testable code. Also, add some hermetic unit tests for rendering.

UebelAndre

This seems good to me.

One thing I'd like to see is a more declarative implementation of each component. I quite like how cargo-raze is shaking out to be.

This may be my own lack of rust experience but I've definitely found imperative styles to be difficult to test and harder to follow. No action for this PR but something I'd love to either learn more about (if you have some recommended links) or keep in mind for future changes.

UebelAndre · 2021-03-29T16:03:20Z

crate_universe/tests/renderer.rs

+    let output = String::from_utf8(output).expect("Non-UTF8 output");
+
+    // TODO: Have a single function with kwargs to enable each kind of dep, rather than multiple functions.
+    let expected_repository_rule = r###"CRATE_TARGET_NAMES = {


Feeling nervous about the volume of dumped text in the tests right now. I don't think this is a blocker for this PR, but I think it'd be better to implement #653 and have more asserts around data instead of massive string comparison.

Oh yeah, I think that everything except the renderer should be tested using much more data-y ways, but I think the renderer fundamentally kind of has to be checking raw strings against each other.

For the renderer tests themselves, making the things being tested be much more targeted (e.g. single crates), and hermetic (hard-coded inputs), and potentially using better assertions libraries, helps here, but at the end of the day these tests test what we're really outputting the right thing. I also find it pretty useful to be able to look at the diffs in PRs to see what "really" changed. I'm hoping that these small number of better tests will be sufficiently maintainable, and that we'll delete the huge unhermetic ones when we have suitable targeted replacements for each pipeline stage :)

Ah, yeah, sorry, I didn't realize I'd put that comment in the renderer code. I agree the renderer should be the place to do these kinds of tests. This is my first time really looking at the tests so just wanted to reiterate that issue.

illicitonion · 2021-03-29T17:00:12Z

One thing I'd like to see is a more declarative implementation of each component. I quite like how cargo-raze is shaking out to be.

I suspect we have similar aims and ideals here! If you ignore a bit of the noise (which we should totally clean up), the core of crate_universe_resolver's main.rs is:

let opt = Opt::from_args();
let config = ...;
let mut resolver = config.preprocess()?;
if lockfile_up_to_date {
  copy_lockfile(...);
  return Ok(());
}
let consolidator = resolver.resolve()?;
let renderer = consolidator.consolidate()?;
renderer.render(&mut output_file)?;

The real difference compared with the cargo-raze equivalent you linked to is that each component returns a state object whose no-argument function is the next stage in the pipeline (i.e. calls ::new itself, and stashes the data in self) rather than returning the argument which is manually passed to the next stage in the pipeline.

I imagine you (and I!) may prefer this to be written as:

let opt = Opt::from_args();
let config = ...;
let versions = resolve(&config);
let consolidated_versions = consolidate(&config, &versions);
render(&config, &consolidated_version, &mut output_file);

The main reason we ended up with the current structure, rather than returning values that we pipe into the next function manually, is that we wanted to ensure the hash generated in Resolver factors in everything that may influence the final rendering. If the Renderer (or the Consolidator, or really anything other than the Resolver itself) took in extra parameters which weren't transitively derived from the input to Resolver, this could affect the generated output in ways that are significant, but don't lead to an invalidation of the lockfile - this is the only way we know that we can skip re-resolving and early-return. Otherwise we'd need to be very diligent in our code review to make sure people aren't passing around data where they shouldn't - "there are no arguments" is trivially verifiable - we just need to make sure the resolver does hashing properly (and you'll get a compile error if you add a field to Resolver and don't explicitly factor it into the hash or ignore it). To take your cargo-raze example, I could easily add an overrides argument to render_files which would change the output, and if I did that, it wouldn't be appropriate to early-return looking only at the result of load_raze_settings or fetch_raze_metadata. This isn't a problem for cargo-raze because it doesn't try to do that early return, but it is relevant here given how the lockfiles currently work.

As we've discussed elsewhere (and it's worth writing down!), I'm very happy to change how lockfiles work, which would make me feel a lot more comfortable changing that code structure to be more "return a value and pass that as an argument to another function" rather than "return an object with a no-argument function to call", as long as we preserve the existing properties of the system - that if you're using a lockfile:

When I first git clone and bazel build a repo with a lockfile in place, I don't need to do expensive work. Expensive work, in my mind, includes fetching a crates index, or performing constraint solving on versions. Basically, we shouldn't need to invoke cargo.
If any of the Cargo.toml files or package entries, or anything else we use in version resolution (e.g. the version of cargo) change in ways that change anything about how the final targets are generated, it's impossible to use stale values from a previous generation.

I'm imagining that will probably look like changing from "copy lockfile and early return" to encapsulating the cache key checking into the Resolver (or a wrapper around it), so that everything before that stage really needs to be carefully encapsulated into a cache key, but we then always re-execute all of the following code, which means that we don't need to worry about data being passed around inappropriately because we'll be executing the code anew regardless :)

UebelAndre · 2021-03-29T21:49:02Z

The main reason we ended up with the current structure, rather than returning values that we pipe into the next function manually, is that we wanted to ensure the hash generated in Resolver factors in everything that may influence the final rendering

I need to digest your previous comment but one thing that stands out to me is that there are a bunch of environment variables going into the hash? This seems pretty undesirable and I feel effectively makes the lockfile un-sharable.

rules_rust/crate_universe/src/resolver.rs

Lines 187 to 193 in e744b93

    
           if env_name.starts_with("CARGO") && env_name != "CARGO_NET_GIT_FETCH_WITH_CLI" { 
        
               eprintln!("Warning: You have the {} environment variable set - this may affect your crate_universe output", env_name); 
        
               hasher.update(env_name); 
        
               hasher.update(b"\0"); 
        
               hasher.update(env_value); 
        
               hasher.update(b"\0"); 
        
           }

Not sure if this is the appropriate place to discuss this but why is this the case?

illicitonion · 2021-03-29T22:05:22Z

Not sure if this is the appropriate place to discuss this but why is this the case?

Probably not, but we don't have a better one at the moment, and we're here, so... 😁

one thing that stands out to me is that there are a bunch of environment variables going into the hash? This seems pretty undesirable and I feel effectively makes the lockfile un-sharable.

rules_rust/crate_universe/src/resolver.rs

Lines 187 to 193 in e744b93

if env_name.starts_with("CARGO") && env_name != "CARGO_NET_GIT_FETCH_WITH_CLI" {

eprintln!("Warning: You have the {} environment variable set - this may affect your crate_universe output", env_name);

hasher.update(env_name);

hasher.update(b"\0");

hasher.update(env_value);

hasher.update(b"\0");

}

Yeah, this is not super desirable, but...

Anything which may affect the resolve needs to go in the hash. So imagine a (fictional) Cargo env var CARGO_OVERRIDE_REPOSITORY=lazy_static=/path/to/lazy_static - if someone set that, and we honoured it for one user but not anther, or for one bazel invocation but not another, that would be problematic.

There are a few approaches we could take here:

We could allow them all, ignore them, and hope they don't affect the resolve in practice.
We could hard-crash if any CARGO env vars are present.
We could unset the env vars when we invoke cargo (possibly allowing an allow-list), but folks may need them and ideally they won't need to PR to us every time they want to use another one to add it to a list.
We could proactively audit every CARGO env var and decide whether to handle each one. There are many, and the list grows.
We could hope that people don't set them very often, allow them all, factor them into the cache key, but not digest ones we've specifically audited as not affecting the resolve.

There are probably other approaches too. We sided on the last of these options because:

I don't observe people setting CARGO env vars very often, unless they're actively experimenting with something (so the impact on cache keys is rare). But this may not be representative.
It unblocks users as much as possible. If they want to use an env var, they can use an env var (though they'll get a warning telling them they may be making life harder for folks). But if their team/users/community aren't aligned about it, they either need to drive alignment, or:
There's a good get-out - if an env var doesn't affect the resolve, it can be audited and added to the list of env vars ignored by the digesting (as is done with CARGO_NET_GIT_FETCH_WITH_CLI).

This was the best trade-off we came up with, but as with everything, very open to other suggestions/ideas!

UebelAndre · 2021-03-29T22:11:47Z

1. I don't observe people setting `CARGO` env vars very often, unless they're actively experimenting with something (so the impact on cache keys is rare). But this may not be representative.

Again, still need to take the time to read your thorough responses but CARGO is used to control the path to the cargo binary in a few tools I've come across. cargo_metadata does this and so does cargo-raze. I would say instead of trying to hash all environment variables, we disallow the use of them and create a "clean" environment for cargo to run in. It should be the job of the resolver to handle any and all inputs and guarantee a reproducible lockfile across different machines and platforms.

illicitonion · 2021-03-29T22:26:23Z

1. I don't observe people setting `CARGO` env vars very often, unless they're actively experimenting with something (so the impact on cache keys is rare). But this may not be representative.
Again, still need to take the time to read your thorough responses but CARGO is used to control the path to the cargo binary in a few tools I've come across.

CARGO is likely the least problematic one, and we can probably safely strip it away (because we explicitly choose the path of cargo that we run - it's the one we fetch as part of rules_rust). But some of the env vars are used for things like setting up corporate proxies for cargo to use (which perhaps a whole company uses, and without which cargo may fail), or enabling cargo to fetch git deps with the git CLI (which may allow for more authentication mechanisms than cargo's usage of libgit allows).

I would say instead of trying to hash all environment variables, we disallow the use of them and create a "clean" environment for cargo to run in.

If we agree (I don't know if we do) that some env vars may need to be passed to cargo, then it comes down to a user freedom trade-off: We can either allow by default (unblocking users, but potentially making cache keys over-sensitive), or deny by default (meaning some users won't be able to use crate_universe at all until either they fork the project, or file and get fixed an issue/PR)... Concretely, I have needed to run with CARGO_NET_GIT_FETCH_WITH_CLI=true in the past, so there is at least a need for "allow some way of configuring cargo specially some of the time"...

UebelAndre · 2021-03-29T22:32:36Z

If we agree (I don't know if we do) that some env vars may need to be passed to cargo, then it comes down to a user freedom trade-off: We can either allow by default (unblocking users, but potentially making cache keys over-sensitive), or deny by default (meaning some users won't be able to use crate_universe at all until either they fork the project, or file and get fixed an issue/PR)... Concretely, I have needed to run with CARGO_NET_GIT_FETCH_WITH_CLI=true in the past, so there is at least a need for "allow some way of configuring cargo specially some of the time"...

What I mean is we should only be including environment variables that we can guarantee have the same impact on the lockfile no matter what machine/environment they're used in. I don't even think CARGO is appropriate since if I run on my linux workstation, I'll generate a lockfile that no longer matches the one generated on my mac. It seems totally reasonable to allow users to explicitly specify an environment variable they'ed like to include, but the rule should not by default include all environment variables.

illicitonion · 2021-03-29T22:42:00Z

If we agree (I don't know if we do) that some env vars may need to be passed to cargo, then it comes down to a user freedom trade-off: We can either allow by default (unblocking users, but potentially making cache keys over-sensitive), or deny by default (meaning some users won't be able to use crate_universe at all until either they fork the project, or file and get fixed an issue/PR)... Concretely, I have needed to run with CARGO_NET_GIT_FETCH_WITH_CLI=true in the past, so there is at least a need for "allow some way of configuring cargo specially some of the time"...

What I mean is we should only be including environment variables that we can guarantee have the same impact on the lockfile no matter what machine/environment they're used in. I don't even think CARGO is appropriate since if I run on my linux workstation, I'll generate a lockfile that no longer matches the one generated on my mac. It seems totally reasonable to allow users to explicitly specify an environment variable they'ed like to include, but the rule should not by default include all environment variables.

Yeah, I think that's reasonable. I've been taking the approach of "If you decided to run Bazel with it, and decided you didn't care about the warning, you opted in to it being included", but maybe we should switch to an explicit allow-list. Whether we should allow the user to specify it, or hard-code it in crate_universe itself, I'm not sure...

UebelAndre · 2021-03-29T23:05:22Z

Yeah, I think that's reasonable. I've been taking the approach of "If you decided to run Bazel with it, and decided you didn't care about the warning, you opted in to it being included", but maybe we should switch to an explicit allow-list. Whether we should allow the user to specify it, or hard-code it in crate_universe itself, I'm not sure...

My goal is just to be able to share a lockfile across machines and operating systems. Consider the case of building crate_universe in Bazel. If I build it on a linux machine, I'm going to need a whole new lockfile on windows and macos because the hash will not match. We should be able to hookup crate_universe with it's own rule to generate 1 lockfile that is shared for each platform rules_rust supports.

illicitonion · 2021-03-29T23:14:07Z

Yeah, I think that's reasonable. I've been taking the approach of "If you decided to run Bazel with it, and decided you didn't care about the warning, you opted in to it being included", but maybe we should switch to an explicit allow-list. Whether we should allow the user to specify it, or hard-code it in crate_universe itself, I'm not sure...

My goal is just to be able to share a lockfile across machines and operating systems. Consider the case of building crate_universe in Bazel. If I build it on a linux machine, I'm going to need a whole new lockfile on windows and macos because the hash will not match. We should be able to hookup crate_universe with it's own rule to generate 1 lockfile that is shared for each platform rules_rust supports.

The lockfile is currently cross-platform. I'm using the same lockfile across Linux and Mac machines today.

Environment variables are, as far as I know, the only weird corner, and even those are only problematic if you do things like set paths in them... (Though, as you note, stripping out the CARGO env var if set is probably worthwhile!) But also, my user base hasn't set any CARGO_ env vars when running bazel, which may be uncharacteristic, so this hasn't been a problem in practice so far.

UebelAndre · 2021-03-29T23:20:19Z

The lockfile is currently cross-platform. I'm using the same lockfile across Linux and Mac machines today.

Environment variables are, as far as I know, the only weird corner, and even those are only problematic if you do things like set paths in them... (Though, as you note, stripping out the CARGO env var if set is probably worthwhile!) But also, my user base hasn't set any CARGO_ env vars when running bazel, which may be uncharacteristic, so this hasn't been a problem in practice so far.

Nice, I didn't really understand how that can be the case but glad that's the case. I think the smarter approach would still be to not include all environment variables by default if there's a chance someone could unknowingly affect the generated lockfile. I'd imagine it'd be super frustrating to not realize something was set in your environment and commit a lockfile that no one else on your team can use.

UebelAndre · 2021-03-30T13:35:24Z

Ok, so having understood everything written now, my general response to it all (and reiteration of things I've already said) is we should only force things to be included that affect the outputs and allow for users to explicitly request that additional environment variables get used. I don't want to have anyone be in a position where they need to open a PR to get some environment variable on the deny-list.

This may be my lack of familiarity with cargo's dependency resolution but I feel it's not quite correct to have the version factor into the hash of the lockfile. Can two versions of cargo not generate the same set of dependencies when given the same lockfile? If this is true, then we shouldn't track the version. I feel this should be the case for any other tool/input to the resolver. Even if it's used for generating the current lockfile, if a version were to variable were to change but all the same dependencies get defined, then it should not be factored into the hash.

and re control flow of main.rs: I'd prefer it to be more explicit. I currently have the concern that too much unnecessary data is going into the resolver and I'm worried that just as users can accidentally forget to hook something up to the hashing mechanism, that they can make a change to an underlying component and unknowingly have that affect the hash. Though perhaps proper testing can be used to solve for the latter? I'd be willing to adapt to things the way they are but would appreciate some thought-out tests with "mocked" objects. Maybe something for #654

illicitonion · 2021-03-30T14:30:26Z

we should only force things to be included that affect the outputs and allow for users to explicitly request that additional environment variables get used.

I think that's a reasonable approach - given that, what should this actually look like for env vars?

Should we have an allowlist in crate_universe_resolver (requiring changes to rules_rust to allowthings)?
Should we have a user-supplied allowlist in the crate_universe rule (potentially requiring every repo to duplicate the same data)?
Should we have both of the above, so that we have a shared list, but also users can augment it if they need?
Should we ignore or warn or error if you have CARGO_ env vars set which we will ignore, because you may get surprising different results building the same code in cargo and bazel?

This may be my lack of familiarity with cargo's dependency resolution but I feel it's not quite correct to have the version factor into the hash of the lockfile. Can two versions of cargo not generate the same set of dependencies when given the same lockfile?

Given the same Cargo.toml, two versions of cargo can resolve different deps, yeah. e.g. in 1.52 rust-lang/cargo#9255 will be released, and in the next edition the default resolver is going to switch from 1 to 2.

if a version were to variable were to change but all the same dependencies get defined, then it should not be factored into the hash.

Yes, I agree with that. And definitely trimming down to the lockfile only containing the resolved version graph, and not the whole generated rules, will help trim down what needs to be factored into the hash a lot!

and re control flow of main.rs: I'd prefer it to be more explicit. I currently have the concern that too much unnecessary data is going into the resolver

Yeah, when we stop the lockfile containing the whole rendered output, we can pass a lot less to the resolver.

and I'm worried that just as users can accidentally forget to hook something up to the hashing mechanism, that they can make a change to an underlying component and unknowingly have that affect the hash.

I think those things should all be very easy to notice in code review at the moment. Currently the rule is very simple: Don't open any sources of external data (e.g. file paths, network addresses) unless you were constructed with them. So if someone started reading a hard-coded file path, that would be bad, but easily noticed.

The two places that are allowed to violate this are:

Constructing Config - we deserialize it from an external path.
Env vars are currently handled in Resolver rather than Config - we should probably make Config be the thing to read and filter env vars, rather than Resolver, so that it doesn't violate the rule :)

The two corners which are trickier are:

Stuff that cargo metadata does. For instance, if it will read files at arbitrary paths (spoilers: it does), or read the contents of paths that env vars point at (it probably does that too, but we are making a plan!), those are things we need to be aware of.
Things which contain addresses, where we only digest the address and not the results of reading from it. e.g. if an env var contains an absolute path, we currently factor that absolute path into the hash, but not the contents of the file. By filtering out more env vars and such, I hope we'll get a bit better at this - I think one of the biggest holes we have at the moment is cargo workspaces, where we probably don't properly digest/process the non-primary Cargo.toml files. But that's definitely something we can add tests for as we improve support :)

Though perhaps proper testing can be used to solve for the latter? I'd be willing to adapt to things the way they are but would appreciate some thought-out tests with "mocked" objects. Maybe something for #654

I am very supportive of making our tests hermetic! The only real requirements we have that scrape at the edges of hermeticity are:

We need some way of showing that the system actually works end-to-end - that could be by running the examples we have, but that should actually interface with crates.io (or another registry) and build real code.
We need some way of testing that resolution actually works. For instance, testing that resolving crate foo leads to also resolving its dependencies even if they're not listed in our Cargo.toml. We could go about this with a high-fidelity mock of the crates.io endpoints, or we could do it by using less restrictive tests, something like "Try to resolve cargo-raze==0.9.0 and verify that one of the things we get back is anyhow, rather than verifying the exact versions of everything that's resolved.

There's always a tension between "how accurate are my mocks" vs "how reliable are my tests", and going all the way in either direction has issues :)

UebelAndre · 2021-03-30T14:51:59Z

@illicitonion I split the conversation into two threads since it seems to be getting quite lengthy and don't want it to distract from this PR 😄

illicitonion · 2021-03-30T15:12:06Z

@illicitonion I split the conversation into two threads since it seems to be getting quite lengthy and don't want it to distract from this PR 😄

Very good call - thanks!

hlopko

Giving rubber stamp approval from vacation in case you want to merge quickly.

Use tera for all rendering. There are things we will want to change, both about the output, and how it's computed, but this is an incremental step in more legible, testable code. Also, add some hermetic unit tests for rendering.

Stop doing manual string formatting of output

2d8bbae

Use tera for all rendering. There are things we will want to change, both about the output, and how it's computed, but this is an incremental step in more legible, testable code. Also, add some hermetic unit tests for rendering.

illicitonion requested a review from UebelAndre March 29, 2021 15:57

google-cla bot added the cla: yes label Mar 29, 2021

UebelAndre approved these changes Mar 29, 2021

View reviewed changes

illicitonion mentioned this pull request Mar 29, 2021

crate_universe: Use tera for generating top-level repository rules #652

Closed

This was referenced Mar 30, 2021

crate_universe: Review how to interract with environment variables #662

Closed

crate_universe has non hermetic tests #654

Closed

hlopko approved these changes Apr 7, 2021

View reviewed changes

Merge branch 'main' into renderer

26593c9

illicitonion merged commit de2ae12 into bazelbuild:main Apr 7, 2021

illicitonion deleted the renderer branch April 7, 2021 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop doing manual string formatting of output #655

Stop doing manual string formatting of output #655

illicitonion commented Mar 29, 2021

UebelAndre left a comment

UebelAndre Mar 29, 2021

illicitonion Mar 29, 2021

UebelAndre Mar 29, 2021

illicitonion commented Mar 29, 2021

UebelAndre commented Mar 29, 2021

illicitonion commented Mar 29, 2021

UebelAndre commented Mar 29, 2021

illicitonion commented Mar 29, 2021

UebelAndre commented Mar 29, 2021

illicitonion commented Mar 29, 2021

UebelAndre commented Mar 29, 2021

illicitonion commented Mar 29, 2021

UebelAndre commented Mar 29, 2021

UebelAndre commented Mar 30, 2021

illicitonion commented Mar 30, 2021

UebelAndre commented Mar 30, 2021

illicitonion commented Mar 30, 2021

hlopko left a comment

Stop doing manual string formatting of output #655

Stop doing manual string formatting of output #655

Conversation

illicitonion commented Mar 29, 2021

UebelAndre left a comment

Choose a reason for hiding this comment

UebelAndre Mar 29, 2021

Choose a reason for hiding this comment

illicitonion Mar 29, 2021

Choose a reason for hiding this comment

UebelAndre Mar 29, 2021

Choose a reason for hiding this comment

illicitonion commented Mar 29, 2021

UebelAndre commented Mar 29, 2021

illicitonion commented Mar 29, 2021

UebelAndre commented Mar 29, 2021

illicitonion commented Mar 29, 2021

UebelAndre commented Mar 29, 2021

illicitonion commented Mar 29, 2021

UebelAndre commented Mar 29, 2021

illicitonion commented Mar 29, 2021

UebelAndre commented Mar 29, 2021

UebelAndre commented Mar 30, 2021

illicitonion commented Mar 30, 2021

UebelAndre commented Mar 30, 2021

illicitonion commented Mar 30, 2021

hlopko left a comment

Choose a reason for hiding this comment