Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non repeatable build / broken dependency propogation #184

Closed
ollie-etl opened this issue Apr 9, 2021 · 29 comments · Fixed by #224
Closed

Non repeatable build / broken dependency propogation #184

ollie-etl opened this issue Apr 9, 2021 · 29 comments · Fixed by #224

Comments

@ollie-etl
Copy link
Contributor

ollie-etl commented Apr 9, 2021

I'm not sure exactly whats up here, but cargo2nix appears broken.

I've been building in CI, and running to successful completion. Getting hold of master and issuing nix-build pulls everything from the binary cache, which no build, as expected.

However:

echo " " >> ./crates/xrfdc/Cargo.toml
cargo clean
cargo generate-lockfile
cargo2nix -f
nix-build .

results in the error below. Cargo is presumably unhappy with the metadata within encoding_rs.

I haven't bottomed out the cause yet (otherwise there would be an accompanying PR).

The fact that this has built successfully indicates there is an ordering type bug at play here? Or that cargo is emitting metadeta which renders the build invalid

   Compiling async-graphql v2.8.2 (/build/async-graphql-2.8.2)
{
    "reason": "compiler-message",
    "package_id": "xrfdc 0.1.0 (path+file:///build/source)",
    "target": {
        "kind": [
            "lib"
        ],
        "crate_types": [
            "lib"
        ],
        "name": "xrfdc",
        "src_path": "/build/source/src/lib.rs",
        "edition": "2018",
        "doc": true,
        "doctest": true,
        "test": true
    },
    "message": {
        "rendered": "error[E0460]: found possibly newer version of crate `encoding_rs` which `async_graphql` depends on
 --> src/schema.rs:2:5
  |
2 | use async_graphql::*;
  |     ^^^^^^^^^^^^^
  |
  = note: perhaps that crate needs to be recompiled?
  = note: the following crate versions were found:
          crate `encoding_rs`: /nix/store/md0bfvsf39ch8vki443wsib7sc44vcjd-crate-encoding_rs-0.8.28/lib/libencoding_rs.rlib
          crate `async_graphql`: /nix/store/akgk0qdl4fzn32mhpg54zwjgxh7nzc9a-crate-async-graphql-2.8.1/lib/libasync_graphql.rlib

",
        "children": [
            {
                "children": [],
                "code": null,
                "level": "note",
                "message": "perhaps that crate needs to be recompiled?",
                "rendered": null,
                "spans": []
            },
            {
                "children": [],
                "code": null,
                "level": "note",
                "message": "the following crate versions were found:
crate `encoding_rs`: /nix/store/md0bfvsf39ch8vki443wsib7sc44vcjd-crate-encoding_rs-0.8.28/lib/libencoding_rs.rlib
crate `async_graphql`: /nix/store/akgk0qdl4fzn32mhpg54zwjgxh7nzc9a-crate-async-graphql-2.8.1/lib/libasync_graphql.rlib",
                "rendered": null,
                "spans": []
            }
        ],
        "code": {
            "code": "E0460",
            "explanation": null
        },
        "level": "error",
        "message": "found possibly newer version of crate `encoding_rs` which `async_graphql` depends on",
        "spans": [
            {
                "byte_end": 31,
                "byte_start": 18,
                "column_end": 18,
                "column_start": 5,
                "expansion": null,
                "file_name": "src/schema.rs",
                "is_primary": true,
                "label": null,
                "line_end": 2,
                "line_start": 2,
                "suggested_replacement": null,
                "suggestion_applicability": null,
                "text": [
                    {
                        "highlight_end": 18,
                        "highlight_start": 5,
                        "text": "use async_graphql::*;"
                    }
                ]
            }
        ]
    }
}
{
    "reason": "compiler-message",
    "package_id": "xrfdc 0.1.0 (path+file:///build/source)",
    "target": {
        "kind": [
            "lib"
        ],
        "crate_types": [
            "lib"
        ],
        "name": "xrfdc",
        "src_path": "/build/source/src/lib.rs",
        "edition": "2018",
        "doc": true,
        "doctest": true,
        "test": true
    },
    "message": {
        "rendered": "error: aborting due to previous error

",
        "children": [],
        "code": null,
        "level": "error",
        "message": "aborting due to previous error",
        "spans": []
    }
}
{
    "reason": "build-finished",
    "success": false
}
@psionic-k
Copy link
Member

cargo generate-lockfile
cargo2nix -f

This step is expected to be non-reproducible as the package index will be consulted. Still, the build should succeed if cargo can succeed (Definitely point out if this is not the case).

Among search results, this link seems pointing in the right direction:

It produces a crate with incompatible symbol names (a "different version"). There's a "strict version hash" derived by hashing a representation of the structure of a crate's entire external interface. That hash value is appended to every symbol name as part of the name mangling process.

A mechanism I could believe to be at work here is if Nix evaluation thinks a dependency doesn't need rebuild when it actually does. The compiler catches our error due to the wrong interface hashes. To simplify, we might be ignoring part of the input when writing the Nix expressions. If this theory is correct, cleaning out the store paths should result in Nix not finding the (broken) cached rlib and being able to succeed. Can you give this a shot?

You can find the paths that might need rebuilding using a similar workflow:

nix-store -q  --deriver /nix/store/32d9pc56axxrc2830q5cz1iwifjbar2m-crate-cargo2nix-0.9.0-bin
nix show-derivation /nix/store/ddwgw69slmx1ajdcyjrj2z53c4kfw3ml-crate-cargo2nix-0.9.0.drv | grep '/nix.*crate.*drv

  "/nix/store/ddwgw69slmx1ajdcyjrj2z53c4kfw3ml-crate-cargo2nix-0.9.0.drv": {
      "/nix/store/3inhij8bmn3qzpn1m18j2f48bvwlil2r-crate-anyhow-1.0.35.drv": [
      "/nix/store/3l1la0qrsgwyzx6nvpxlb8kz8l98jgi0-crate-semver-0.9.0.drv": [
      "/nix/store/9lid9ypd34zxyp697ajx92qwn9bkzshf-crate-tera-1.5.0.drv": [
      "/nix/store/ixv0163kk08nyjwjk22p2z990z4335wy-crate-cargo-0.48.0.drv": [
      "/nix/store/jjkvilw9283576lxysgkyrng2v5myf33-crate-tempfile-3.1.0.drv": [
      "/nix/store/js8lidv52qbdflgnjrqbv9hnhairsd66-crate-colorify-0.2.3.drv": [
      "/nix/store/l5l4anxvgpj6kxvp9h3g2dgi37glabql-crate-serde-1.0.118.drv": [
      "/nix/store/n08g1s2v3kgcir8z06f45hn2a32ybm0k-crate-toml-0.5.7.drv": [
      "/nix/store/qyivqx0j3g4gikbf7y9gn1wfan1xl0ds-crate-pathdiff-0.2.0.drv": [
      "/nix/store/vgdpa0vxdw1yh88xbgmbdvijhm2mdl95-crate-cargo-platform-0.1.1.drv": [

@ollie-etl
Copy link
Contributor Author

To confirm, the build does not succeed. I'll have a go with what you've suggested - thanks for the pointer. It'll be next week now before i get a chance.

@psionic-k
Copy link
Member

Assuming cargo build works but nix build fails. Please correct me if that's not the case.

@ollie-etl
Copy link
Contributor Author

ollie-etl commented Apr 12, 2021

Yes, cargo works fine, nix does not. I can simplify the failure case to:

git checkout master
# confirm build is as CI - everything is pulled from binary cache
nix-build .
# Invalidate build for 1 of the crates in workspace
echo " " >> ./crates/xrfdc/Cargo.toml
nix-build .

The resultant error message is then on a downstream crate:

 found possibly newer version of crate `cvt` which `xrfclk` depends on

so here, we have crate xrfdc and xrfclk which are both dependencies of a workspace binary crate. xrfdc builds correctly with the whitespace modified toml.

I can confirm that deleting all store paths, then rebuilding with nix-build . --option substitute false results in a successful build.

We were seeing CI failures too. In CI, we build 4 sets of derivations for our software, which are

let
  rustPkgs = pkgs.rustBuilder.makePackageSet' rec {
    rustChannel = pkgs.rustChannel.channel;
    packageFun = import ./Cargo.nix;
    packageOverrides = pkgs:
      pkgs.rustBuilder.overrides.all
      ++ (import ./sys-crates.nix { inherit pkgs; });
  };

  crates = pkgs.lib.mapAttrsToList (n: _: n)
    (pkgs.lib.filterAttrs (n: v: v == "directory") (builtins.readDir ./crates));

in {
  build = pkgs.lib.genAttrs crates (n: rustPkgs.workspace."${n}" { });
  test = pkgs.lib.genAttrs crates (n: pkgs.rustBuilder.runTests rustPkgs.workspace."${n}" { });
  test-bin = pkgs.lib.genAttrs crates (n: rustPkgs.workspace."${n}" { compileMode = "test"; });
  bench-bin = pkgs.lib.genAttrs crates (n: rustPkgs.workspace."${n}" { compileMode = "bench"; });
}

we were seeing these failures also in bench-bin

@psionic-k
Copy link
Member

Guessing that you've already implemented flushing the CI cache as a workaround for now.

Based on first example, there's strong evidence that the newly generated Cargo.nix did not react to the "modified" Cargo.toml.

Question is, how does the inconsequential change to the Cargo.toml result in rustc deciding that the rlib is no longer good...

@ollie-etl
Copy link
Contributor Author

Our workaround was nota subtle one - CI is ignoring all entries in the binary cache containing our crates. We warm it up by building hello-world with the cache for all cross-compilation targets, to avoid rebuilding the universe.

I'm equally baffled. I don't understand the internals of cargo sufficiently to know where to inspect the metadata.

@ollie-etl
Copy link
Contributor Author

My leading (only) candidate for this is a single binary crate, which shares a many of the dependencies of the failing ones. Its the only override aside from some fairly straightforward sys crates including build inputs on the libraries. We have

    my_bin = pkgs.rustBuilder.rustLib.makeOverride {
      name = "my_bin";
      overrideAttrs = drv: {
        # Rust flags were being ignored: this is a workaround.
        configureCargo = ''
          mkdir -p .cargo
          cargo_config=${./.cargo/config.toml}
          cp $cargo_config .cargo/$(stripHash $cargo_config)
        '';
      };
    };

This is being done to set linkage flags for static cross compilation. I'm unaware of any other way to inject these in cargo2nix - i'd love to been shown to be incorrect.

[target."aarch64-unknown-linux-musl"]
linker = "aarch64-unknown-linux-musl-ld"
rustflags = ["-C", "target-feature=+crt-static", "-C", "link-arg=-s"]

Building the dependencies for this target will have different flags than building those dependencies for other targets, which if encoded in metadata would cause what i'm seeing if those dependencies were reused. I don't have a great handle on what metadata cargo stores, but i'd believe link flags were amongst them.

Does this seem likely? If so what would be a better method for injecting these rust flags for this target and its deps?

@Fuuzetsu
Copy link

We've come to hit this issue as well. We use crates2nix however.

@ollie-etl
Copy link
Contributor Author

@Fuuzetsu Is there a related issue in that project? Also, are you doing cross compilation?

@Fuuzetsu
Copy link

Fuuzetsu commented Aug 2, 2021

@Fuuzetsu Is there a related issue in that project? Also, are you doing cross compilation?

No issue on cargo2nix that I know of.

w.r.t. reproduction, sadly it's a closed source project that has been working well up until recently when we hit this issue. We have a workspace with maybe like 20 crates (not at PC to check) and when we added one more, the new crate is having the problem. I found this ticket by Googling.

We are not doing cross compilation.

@Fuuzetsu
Copy link

Fuuzetsu commented Aug 2, 2021

@Fuuzetsu Is there a related issue in that project? Also, are you doing cross compilation?

No issue on cargo2nix that I know of.

w.r.t. reproduction, sadly it's a closed source project that has been working well up until recently when we hit this issue. We have a workspace with maybe like 20 crates (not at PC to check) and when we added one more, the new crate is having the problem. I found this ticket by Googling.

We are not doing cross compilation.

Sorry, first line meant to say crate2nix, not cargo2nix

@ollie-etl
Copy link
Contributor Author

@Fuuzetsu Are you able to share how you invoke crate2nix?

My working assumption for this error is its something in the manner in which I'm setting the linkage flags not propagating correctly. However, that could be completely unrelated, and there is some more fundamental bug here.

The way this first manifested itself for us was very similar to what your describe - we've 15 or so crates in a workspace (closed source), and adding a new one caused this problem.

@Fuuzetsu
Copy link

Fuuzetsu commented Aug 4, 2021

@Fuuzetsu Are you able to share how you invoke crate2nix?

My working assumption for this error is its something in the manner in which I'm setting the linkage flags not propagating correctly. However, that could be completely unrelated, and there is some more fundamental bug here.

The way this first manifested itself for us was very similar to what your describe - we've 15 or so crates in a workspace (closed source), and adding a new one caused this problem.

We're not doing anything strange with the crates I think: I am adding things like buildInputs to a bunch and setting some env vars to the others but once we get past that, it's fairly vanilla. I'm not messing with linking (except for a single crate where I use extraLinkFlags to deal with NixOS/nixpkgs#119382).

I don't really think it's an issue with how we're invoking the builder nor with what you're doing. I initially was blaming non-determinism of rustc but it doesn't explain why changing the whitespace in the upstream crate makes it work sometimes... The issue happening when a crate is added is also very strange: they ought to be just separate derivations, I don't understand why they would be affecting each other.

Honestly I'm a bit lost as to how to deal with this. I was going to try and examine what's happening during the build in detail when I got some time but I haven't gotten to that point yet. If we have a way to replicate the issue then change some whitespace and have a working build, it should presumably be possible to examine what rustc is doing during the build and find the difference...

@Fuuzetsu
Copy link

Fuuzetsu commented Aug 4, 2021

@Fuuzetsu Are you able to share how you invoke crate2nix?
My working assumption for this error is its something in the manner in which I'm setting the linkage flags not propagating correctly. However, that could be completely unrelated, and there is some more fundamental bug here.
The way this first manifested itself for us was very similar to what your describe - we've 15 or so crates in a workspace (closed source), and adding a new one caused this problem.

We're not doing anything strange with the crates I think: I am adding things like buildInputs to a bunch and setting some env vars to the others but once we get past that, it's fairly vanilla. I'm not messing with linking (except for a single crate where I use extraLinkFlags to deal with NixOS/nixpkgs#119382).

I don't really think it's an issue with how we're invoking the builder nor with what you're doing. I initially was blaming non-determinism of rustc but it doesn't explain why changing the whitespace in the upstream crate makes it work sometimes... The issue happening when a crate is added is also very strange: they ought to be just separate derivations, I don't understand why they would be affecting each other.

Honestly I'm a bit lost as to how to deal with this. I was going to try and examine what's happening during the build in detail when I got some time but I haven't gotten to that point yet. If we have a way to replicate the issue then change some whitespace and have a working build, it should presumably be possible to examine what rustc is doing during the build and find the difference...

Oh, I wanted to add something. One thing that's different on say my machine where build works vs coworker's machine where build failed vs CI where build failed is the codegen-units rustc parameter that basically comes from the environment. This parameter is very suspicious and I think easily can lead to different build results. This could explain why it builds on one machine but fails in CI. I have not yet tried tuning NIX_BUILD_CORES but I wonder if it'd work to set it to say 1 in CI or something.

@ollie-etl
Copy link
Contributor Author

ollie-etl commented Aug 4, 2021

I think you may have something with codegen-units. I hadn't considered this, but our errors started around the time we bumped our CI machines from 8 to 16 cores.

I'm not sure of the exact correlation, because the CI infrastructure was done out of band from our commit history. Perhaps codegen-units should be a build parameter in cargo2nix / crate2nix.

The docs suggest this is not the case though:

This flag controls how many code generation units the crate is split into. It takes an integer greater than 0.

When a crate is split into multiple codegen units, LLVM is able to process them in parallel. Increasing parallelism may speed up compile times, but may also produce slower code. Setting this to 1 may improve the performance of generated code, but may be slower to compile.

The default value, if not specified, is 16 for non-incremental builds. For incremental builds the default is 256 which allows caching to be more granular.

@Fuuzetsu
Copy link

Fuuzetsu commented Aug 4, 2021

The docs suggest this is not the case though:

I'm not sure what you mean by this. The defaults aren't used in nix builds. See build-crate.nix and configure-crate.nix.

@Fuuzetsu
Copy link

Fuuzetsu commented Aug 4, 2021

For now I'm going to send MR to nixpkgs to make codegen-units configurable. Will need changes in crates2nix and cargo2nix to make use of it of course but have to start somewhere.

@ollie-etl
Copy link
Contributor Author

The docs suggest this is not the case though:

I'm not sure what you mean by this. The defaults aren't used in nix builds. See build-crate.nix and configure-crate.nix.

Cargo2nix uses cargo to do the building

@Fuuzetsu
Copy link

Fuuzetsu commented Aug 5, 2021

The docs suggest this is not the case though:

I'm not sure what you mean by this. The defaults aren't used in nix builds. See build-crate.nix and configure-crate.nix.

Cargo2nix uses cargo to do the building

Ah, I see. I still am not sure what you mean by "this is not the case" in this situation but as I understand, this means that probably cargo2nix is using the value of 16 as release builds are non-incremenal by default as of few rust versions ago (and nix builds aren't able to be incremental anyway so one would hope it's doing non-incremental settings anyway).

I guess the patch to nixpkgs doesn't matter for cargo2nix and for your usecase you should just either patch cargo2nix to set the flag explicitly or set it in Cargo.toml, assuming it gets honoured properly.

Fuuzetsu added a commit to Fuuzetsu/nixpkgs that referenced this issue Oct 12, 2021
This parameter is being set to `$NIX_BUILD_CORES` by default. This is a
standard practice but there's a suspicion that this can produce broken
builds. For some details see
cargo2nix/cargo2nix#184 . As a
work-around/test, it'd be good if codegen-units can be set to something
constant, such as `1`. This PR allows it.

Note that the default of `$NIX_BUILD_CORES` is preserved so this MR
causes no change in default behaviour and no rebuilds.
@Fuuzetsu
Copy link

We started doing codegen-units=1 in our codebase recently in hopes that it'd fix the issue. Sadly, I just hit this again even with codegen-units=1 so it doesn't make the issue go away. I guess this needs some real debugging ;/

@Fuuzetsu
Copy link

As codegen-units=1 did not help, I had to spend some time to investigate properly.

I suspected rustc initially because I had two crate .rlib (and then .so) that came from same nix derivations but had different crate hashes. I made the ticket at rust-lang/rust#89904 . You can read the investigation there but to save you the time, it turned out that it was a crate we had in the dependency tree that would generate different code on every compile (unintentionally).

As this crate was deep in our dependencies, it would only rarely actually get recompiled which is why we weren't hitting the issue often: only really when someone was adding crates (and so potentially causing it to build due to different resolved versions/features). I submitted a fix at frozenlib/structmeta#1 for our particular instance.

@ollie-etl if you guys still have issues, I'd recommend you try and vet macros that are in your (transitive) dependency tree. I suspect you may even have the same crate in scope. Sadly, I don't have an easy way to find what crate an issue comes from. I effectively was hacking rustc to be able to read the dependencies of crates from .rlib and .so as well as dump output while crate was being hashed which is how I finally found the problem but it's obviously not a reasonable way...

I suppose there should at least be a tool that prints crate hashes dependency tree: you can already see these hashes if you do ask rustc to output verbose output so maybe that's where one can start.

@psionic-k
Copy link
Member

Need to look at the drv's between success & failure. In the tree of dependencies, there's got to be some extra drv's that show up (along with old ones) or else the impurity is a non-determinism that's leaking in, meaning the intensional store winds up with multiple results at the same nix path. It's a great help to spot the propagation of differences in the drvs

@psionic-k
Copy link
Member

0.9.1 has fixed a behavior where trivial features were leaking down into dependencies. This might actually help the behavior.

@psionic-k
Copy link
Member

After reading more into Fuuzetsu's experience, it's pretty clear that this issue is with Rust, non-deterministic binary outputs. In general, we can't protect from non-determinism. I guess would could recommend an override workflow for the specific crate to always trigger rebuild. A workaround is the right answer if there's unavoidable non-determinism in the Rust dependency or some rustc behavior.

@Fuuzetsu
Copy link

Yes, rustc 1.55 seems OK and issues there lie with macro crates that produce non-deterministic output.
In rustc 1.56, there are issues with compiler (rather, LLVM) itself in general so I would wait for 1.57 where there are bunch of fixes.

There's a nix config flag that you can set to rebuild all packages multiple times and check hashes: this can help catch non-deterministic packages ahead of time, hopefully before they go into your binary caches etc. Another is to override your rust deps with preferLocalBuild or whatever it is.

For us, we are just on 1.55 for now.

@psionic-k
Copy link
Member

so I would wait for 1.57 where there are bunch of fixes.

@Fuuzetsu Did you see any fixes land as expected?

@Fuuzetsu
Copy link

Fuuzetsu commented Dec 7, 2021

so I would wait for 1.57 where there are bunch of fixes.

@Fuuzetsu Did you see any fixes land as expected?

Yes, there were a bunch. We ran into another issue but it turned out to be a derive crate producing code from a macro in non-deterministic way. It's looking like we're going to be upgrading from 1.55 to 1.57, skipping the broken 1.56.

You can see some issues getting fixed in rust-lang/rust#90301 – it's all LLVM stuff.

@psionic-k
Copy link
Member

Then I think it's appropriate to handle this issue as errata, noting it in the common issues on the README but not attempting to hunt it down as a nix or cargo2nix issue. With no objections, I'll include this note in my open PR and close this issue on merge.

@Fuuzetsu
Copy link

Fuuzetsu commented Dec 7, 2021

Then I think it's appropriate to handle this issue as errata, noting it in the common issues on the README but not attempting to hunt it down as a nix or cargo2nix issue. With no objections, I'll include this note in my open PR and close this issue on merge.

Yes, there is nothing on the nix side that's broken or that one can do beyond something like "disable binary caches because binaries are randomly not compatible". Definitely nothing cargo2nix or crate2nix or anything can do (except somehing crazy like enabling preferLocalBuild or something).

@psionic-k psionic-k mentioned this issue Dec 9, 2021
18 tasks
happysalada pushed a commit to NixOS/nixpkgs that referenced this issue May 1, 2022
This parameter is being set to `$NIX_BUILD_CORES` by default. This is a
standard practice but there's a suspicion that this can produce broken
builds. For some details see
cargo2nix/cargo2nix#184 . As a
work-around/test, it'd be good if codegen-units can be set to something
constant, such as `1`. This PR allows it.

Note that the default of `$NIX_BUILD_CORES` is preserved so this MR
causes no change in default behaviour and no rebuilds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants