Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build ID Section for WASM #133

Closed
mitsuhiko opened this issue Nov 14, 2019 · 58 comments
Closed

Build ID Section for WASM #133

mitsuhiko opened this issue Nov 14, 2019 · 58 comments

Comments

@mitsuhiko
Copy link

I originally brought this up in the design repo (WebAssembly/design#1306) but I believe this fits here better.

For deferred symbolication on services like sentry it would be nice to be able to match up DWARF debug information to the main WASM file by build ID. In ELF this is typically accomplished with the GNU build ID note, on windows with the PDB signature and age and on darwin the macho UUID fulfills that purpose.

I would love to see a build_id custom section that contains a 16 or 20 byte ID which tools would ensure remains in both WASM files (CODE, debug companion containing DWARF info) if they get split. Capping it at 16 bytes makes it possible to roundtrip this through breakpad which uses a 16+4 byte char array for the debug id. 16 for the PDB UUID + 4 byte for the PDB age.

Motivation: Sentry and other systems like to be able to look up files by build ID because then they can access an external symbol server for that information. That way one just provides some sources where debug information can be found and then symbolicators just reach out to that service to find the debug information files.

@kripken
Copy link
Member

kripken commented Nov 14, 2019

Sounds good to me.

Would the build ID be modified by tools that process the wasm, say by Binaryen when it optimizes the binary? Or would it stay fixed after it's emitted from the original compiler? (The former seems to make sense, as optimizations change the binary, but then we'd need to describe those changes here I think.)

@mitsuhiko
Copy link
Author

Definitely should change as the file changes.

Of note is that in the Microsoft ecosystem the age on the PDB signature (those extra 4 bytes) get incremented with every transformation. This from my experience has made things more complicated in practice because they were not consistently changed everywhere. For instance the age is stored more than once in the PE format and actually comes out desynched from Microsoft's own tools.

I think it would be wiser to explicitly tell tools to always completely override the embedded ID if it goes through a transformation. This does mean you can't track back to the original ID of the originally created WASM file but I'm not sure if that is necessary in general.

Would be curious to hear though if there are some advantages of the pdb+age system on the Microsoft side.

@sunfishcode
Copy link
Member

How important is it to have an explicit field for this, as opposed to just having tools compute a hash of a wasm binary to use as an effective build ID?

@mitsuhiko
Copy link
Author

@sunfishcode since the ID needs to survive a stripping of the file, it's very important. With DWARF in place you normally want to separate out the object file into two: one that contains CODE and other sections necessary to run the code, a second one with the DWARF sections (.debug_frame etc.). Since stripping/splitting the files changes the file you cannot reproduce the ID after the split.

@sunfishcode
Copy link
Member

Can tools just hash the contents of the main wasm sections then, and ignore debug info sections?

I don't have a strong opinion either way yet; I just want to understand the space.

@mitsuhiko
Copy link
Author

mitsuhiko commented Nov 14, 2019

To generate the build ID they could take the hash of the main wasm sections and store it in the file. They can alternatively just generate a random UUID and embed it. I do think though that a build ID should ideally always be embedded.

(This here describes the workflow where this information is particularly useful)

@sunfishcode
Copy link
Member

I'm curious about what situations storing a Build ID in the file is better than computing a hash on demand whenever it's needed.

Naively, computing it on demand would seem to have several advantages:

  • tools can pick an appropriately strong hash function for their own use case
  • tools can decide which sections to hash and which to ignore for their own use case
  • tools don't have to worry about other tools stripping or omitting the Build ID
  • tools can trust that no other tools have modified the binary without updating the Build ID
  • tools can trust that the Build ID is not maliciously crafted to cause collisions, if that's important
  • tools don't have to worry about the length of the Build ID varying

@Jake-Shadle
Copy link

FWIW Breakpad supports the embedded, format specific identifiers, that @mitsuhiko mentioned, but if they aren't available for any reason, it falls back to computing an md5 of the first 1024 bytes of the TEXT section (or equivalent).

You bring up some good points about some advantages computing the build ID has, but to me the point of the Build ID is to precisely identify a particular build so that different tools can always pair the code with the debug information, so allowing tools to choose their own hash function or which sections to hash, brings up problems when tools need to communicate with each other, eg. between a debugger and a symbol store that use different hash functions.

So storing the Build ID does have some disadvantages, particularly when a tool does a transformation that doesn't also change the Build ID, but I'm much more concerned with tools having a consistent source of truth.

@sunfishcode
Copy link
Member

I've now found this stackoverflow post which I found helpful. The Build ID isn't just a hash of the contents; it's something like a hash of the contents and the debug info together, which is then recorded and preserved, even if debug info is stripped. As such, it can't always be recomputed.

There are a lot of use cases other than debug info that would seem to want something like a Build ID, but what they need is something subtly different from what the Build ID actually is. So, brainstorming here, what if we do have a Build ID section, but call it the "Debug Info ID", and say:

  • Compilers producing wasm without debug info don't need to include a Debug Info ID.
  • Post-processing tools could decide whether to strip the Debug Info ID based on whether their transformation would invalidate any associated debug info.

Would that make sense?

@mitsuhiko
Copy link
Author

I think it's fair to specifically call this a debug_id and record that it's useful for that purpose.

The hashing fallback path of breakpad has caused more issues than it solved so I would prefer we don't spec out something like this.

@sbc100
Copy link
Member

sbc100 commented Nov 15, 2019

Would this build ID be generated at the point when the debug info is split out (either by the linker, or some kind of post link debug-splitting tool)? Or would it be present even in binaries that still have their debug info embedded?

@mitsuhiko
Copy link
Author

@sbc100 definitely already in binaries that have the debug info embedded. We for instance have lots of cases where we want to symbolicate stacktraces where the client just submitted instruction addresses and then people upload the entire binary with debug information included.

This is especially important normally when doing stack unwinding out of memory dumps. This obviously is less useful for wasm right now, but in terms of existing work flows having the debug ID even in unstripped binaries has been very valuable.

@RReverser
Copy link
Member

Just as a counter-point, one downside of an embedded id seems to be precisely that it would usually survive destructive operations on the code.

That is, if code is post-processed by a tool similar to wasm-opt or wasm-bindgen, and if that tool can't correctly update DWARF information, then the build id would remain the same even though the code has changed and no longer matches the debug info. In this case you as a consumer (Sentry or otherwise) explicitly don't want such debug info to be matched and used.

Arguably, every such tool should either support DWARF or be able to at least change build ID to some new unique value, but it seems that hashing of code section would alleviate this concern even more naturally.

@mitsuhiko
Copy link
Author

Since we're adding WASM DWARF support at Sentry at the moment we might be going ahead and require customers to embed a build_id custom section into their files for now.

@RReverser
Copy link
Member

@mitsuhiko Does the "hash of the code section" idea not work for you?

@mitsuhiko
Copy link
Author

@RReverser Generally I did not define how the build_id section so far is to be computed. However since the code section is inaccessible from within JavaScript but custom sections are available, I cannot compute it on demand. So a user for us can either compute the build_id by hashing the code section or alternatively just embed a random UUID, either way the result from our perspective is the same.

For what it's worth embedding a random build_id is easier to accomplish with the existing rust toolchain as it can be accomplished with #[link_section] on a static byte literal whereas making it a hash requires injecting the custom build section after the fact. I was attempting to do this with walrus but unfortunately that appears to do something nasty with the DWARF data in the WASM file currently.

@RReverser
Copy link
Member

I was attempting to do this with walrus but unfortunately that appears to do something nasty with the DWARF data in the WASM file currently.

Yeah, walrus is a high-level IR and, as such, rewrites even the code you didn't touch, which, in turn, affects debug offsets. You need a lower-level representation instead, e.g. [shameless plug] you can try my wasmbin library which was created with similar use-cases in mind. https://github.com/GoogleChromeLabs/wasmbin

@RReverser
Copy link
Member

I've pushed an example for random build_id (based on UUID v4) here: https://github.com/GoogleChromeLabs/wasmbin/blob/build_id/examples/build_id.rs

You'll probably want to extend it to be more robust (e.g. add detection of existing build_id section), but it works and attaches a section successfully.

@mitsuhiko
Copy link
Author

Oh this is neat. Going to use this.

@RReverser
Copy link
Member

RReverser commented Nov 25, 2020

Come to think of it, due to the nature of Wasm binary format, if you didn't want to check for presence of existing build_id, you could even literally append bytes representing the custom section to the end of the file:

fn main() {
    let filename = std::env::args()
        .nth(1)
        .expect("Provide a filename as an argument");
    let mut f = OpenOptions::new().append(true).open(filename)?;
    f.write_all(&[
        // Custom section (id=0)
        0x00,
        // Length of payload (length of length of name + length of name + length of UUID)
        1 + 8 + 16,
        // Length of name
        8,
    ]);
    f.write_all("build_id".as_bytes())?;
    f.write_all(uuid::Uuid::new_v4().as_bytes())?;
    Ok(())
}

Won't save too much in terms of perf and the code won't be as clean, but hey, it's possible in case you want to avoid any dependencies altogether and make a tiny util :)

@mitsuhiko
Copy link
Author

I extended your tool into one that does not override existing build IDs and also splits the file into two: getsentry/symbolicator#303

@dschuff
Copy link
Member

dschuff commented Jul 23, 2021

I think we should pick this up and add support to LLVM/emscripten to make this easier.
Is this a correct summary of people's current thoughts/current usage?

  1. We are thinking of build_id similar to ELF, in that it reflects the semantics/origin/sources of the program and therefore:
    a) it conceptually includes the debug info
    b) It should be changed (dropped?) by any tool that modifies only the code (since that would invalidate the debug info). @sunfishcode says above that any transformation that wouldn't invalidate debug info wouldn't need to rewrite the ID. That makes sense to me, although the set of such possible transformations seems rather small.
    c) It should be changed by any tool that modifies the code and updates the debug info (this is kind of a funny thing to say for an optimizer that isn't supposed to change the semantics of the program, but I think it's correct because of course changing the code will change function indexes, section/module offsets, etc)
  2. Current tools (other than emscripten) just add a section called build_id with a random UUID

@sunfishcode also suggests above that tools not write a build ID if they don't generate debug info. I don't really see the harm either way; a wasm file that never had debug info will be indistinguishable from one that had debug info stripped out.
Thinking about this some more: there is no practical way to tell whether a file has been modified incorrectly (i.e. rewriting the code section but failing to change the build id), or modified at all. In other words, if it is known that the build ID is e.g. a hash of all the known sections plus specified debug info sections, a tool could verify that a wasm file with debug info (or one that never had debug info) hasn't been modified, but won't be able to infer anything from a file with no debug info and an "incorrect" hash, since it can't tell whether a file previously had debug info or not. (Unless we also embed some kind of indication in the build id or otherwise in the wasm, that there was previously debug info, and ask tools not to strip that out. Not sure if that's worth it or not).

If we specify that (or even just implement the linker such that) the build id is a hash of some file contents, that would slow down linking, so we'd want to get some benefit in return for it.

@dschuff
Copy link
Member

dschuff commented Jul 23, 2021

/cc @walkingeyerobot @trybka

@trybka
Copy link

trybka commented Jul 26, 2021

I don't think we want a random UUID for build_id in (2), do we? Ideally the same inputs should generate the same outputs, including build_id -- remote builds care a lot about this kind of reproducibility.

@dschuff
Copy link
Member

dschuff commented Jul 28, 2021

yeah build determinism is a good point, LLVM and emscripten should definitely have that, even if other tools might not care. GNU ld and ELF lld actually have both options (hashing sections, picking a random UUID, and using a value specified on the command line).
I guess that probably means we need to hash all of the sections that LLVM produces by default, including:

  1. all of the known sections
  2. all of the debuginfo sections
  3. name section, on the same grounds that the debuginfo sections are included

... Actually, Looking at ELF lld's implementation, maybe we just want to hash the entire output file.

@dschuff
Copy link
Member

dschuff commented Jan 7, 2022

Ah, looking back at that code, there are a couple of details about the format of the section itself:
In particular, ELF build IDs support several different kind of hashes: "fast", MD5, random UUID, SHA-1, and arbitrary user-supplied hex string. I can see use cases for several of those, and it would be very straightforward to support all of them in lld. Is there any reason not to?

Then there's the format of the section itself. The most straightforward encoding would be

  1. a ULEB field designating the hash type (assuming we ever support more than 1)
  2. a length-prefixed wasm string field containing the hash itself. Although IIRC the "standard" wasm binary-format strings need to be UTF-8 and come to think of it I'm not sure if build-id hashes are strings, or just arbitrary binary data. So we'd have to figure that out.

@dschuff
Copy link
Member

dschuff commented Jan 7, 2022

(sorry we raced). Yes, a tool-conventions doc like that one would be perfect, to specify the section's format.
As for the LLVM change, IIRC I tried it out on a simple case and it seemed to work; the main thing it needs is tests (e.g. for different hash types, and maybe use of the feature in conjunction with other linker features such as synthetic sections and relocatable output).
Also, the way it works (by writing a placeholder during the normal synthetic-section generation phase, and then writing the real hash in a special phase at the very end) seems slightly ugly to me, but I don't know of a better way to do it; maybe @sbc100 would have an opinion on that.

@sbc100
Copy link
Member

sbc100 commented Jan 7, 2022

That approach seems reasonable to me. I guess this is not unlike relocation entries which get written with placeholders and then updated. The difference here is that we could obviously need to wait until all other sections have been written since we could be hashing their final content.

@bkotsopoulossc
Copy link

Thanks for the extra details. Some thoughts (with the caveat that I am not familiar with other conventions here from ELF or other formats):

  • If we do encode the hash, maybe just a uint8 that is essentially an enum, like the tag type here
  • I'm in favour of having arbitrary binary data and not having to think of this as a string, as it feels more generic
  • Maybe having the hash type could be useful to be able to parse or interpret the build ID data in some way. But maybe it's sufficient for all consumers to just treat it as opaque binary data, and not have to care how it was hashed.
  • I do wonder if it's worth fleshing out all of these different types of hashes now, or just start with a strawman that is extensible to different types in the future. For example, it sounds like starting with just a random ID avoids some of the ambiguity in llvm around placeholders and such. To me, getting this into the spec and the binary format is a big win - adding more options around how the build ID is generated could be done later

@dschuff
Copy link
Member

dschuff commented Jan 7, 2022

  • Yes, an enum is what I had in mind. If there are < 128 values, a uint8 is the same as an LEB so it doesn't really matter what we call it.
  • Yes, it does look like binary data. Actually I should have reread this thread because it's the same conclusion we came to already above 🤣
  • Regarding the hash type and what tools do with it, there's also discussion of that above. Per that discussion I think the primary/default build ID type at least for LLVM needs to be deterministic (so, a hash rather than a random ID). As discussed above, tools that modify binaries will probably have to make a case-by-case decision about whether to modify the build ID too. But it might be useful for them to know that hash type in that case? Maybe also any tool that wants to print the build ID might want to know the type (so it could e.g. format UUIDs differently from hashes)? Maybe it's not useful to distinguish different hash types though?

@bkotsopoulossc
Copy link

Ahh yeah I guess random is problematic when it comes to reproducible builds. Maybe the user-supplied string is an easy one to start with then? The idea of supporting the various different types sounds great but just seeing if we can scope this down a bit so its easier to make progress on

@mitsuhiko
Copy link
Author

mitsuhiko commented Jan 7, 2022

As for hash format there is probably quite some flexibility here but traditionally the limitations were often the intention to support some form of breakpad compatibility. The default debug id field has space for a UUID/GUID + 4 bytes as u32 (the age field). Since Macho selects a UUID for the hash and PDB uses this UUID + 32bit age it's probably not a bad idea to encourage tools to emit a reproducible UUID (v3 or v5) as build ID. That has the highest form of compatibility.

Knowing which exact type of a build ID something is has not been useful in our experience.


(For additional context this is the abstraction we use for what we call breadpad compatible debug ids: https://docs.rs/debugid/0.7.2/debugid/struct.DebugId.html — any gnu build ID longer than 16 bytes is chopped off and an age of 0 is always used. We then use the original gnu build ID as secondary information for debug file lookup. Our symbol server lookup strategies are documented here: https://getsentry.github.io/symbolicator/advanced/symbol-lookup/)

@dschuff
Copy link
Member

dschuff commented Jun 1, 2022

Sorry I've sat on this so long. Let's finally get it done. I uploaded #183 which I think captures what we've discussed here. After hearing @mitsuhiko's experience that knowing the exact type of ID isn't useful (and not being able to think of any use myself) I decided to just leave it out of the encoding.

@dschuff
Copy link
Member

dschuff commented Jun 1, 2022

Also I just realized that I didn't take @mitsuhiko's advice and encourage a reproducible UUID as the output (or implement one in lld in https://reviews.llvm.org/D107662); instead I went with the same default lld uses for ELF (which is actually just an 8-byte "fast" hash). Do you think that's compatible "enough" or should we invent something new in lld?

@dschuff
Copy link
Member

dschuff commented Jun 11, 2022

@mitsuhiko I guess a followup question, if I were to make lld generate a v5 UUID (based on, a hash of the contents), what would I use as the "namespace" UUID to go with it?

@bkotsopoulossc
Copy link

Would it be reasonable to just generate a random UUID once and bake it into the llvm code, as an "llvm namespace"?

@mitsuhiko
Copy link
Author

@dschuff about the namespace it probably doesn't matter. You can probably hardcode a random ID and just use that consistently and document it. I don't have any expectations that there is a tool independent way of generating the same reproducible IDs. It's more important that the tool itself has some stability.

@dschuff
Copy link
Member

dschuff commented Feb 22, 2023

I updated the prototype in https://reviews.llvm.org/D107662
It supports several different styles for compatibility (mostly the same ones as ELF).
The default style ("fast" aka "tree") hashes the contents of the output and (unlike ELF) generates a v5 UUID based on the hash (using a random namespace).
It also supports generating a random v4 UUID, a sha1 hash, and a user-specified string (as ELF does).

@dschuff dschuff closed this as completed in 9b80cd2 Mar 2, 2023
@dschuff
Copy link
Member

dschuff commented Mar 2, 2023

I think the implementation and document in #183 capture what we've worked out here. Feel free to reopen (or open a new issue, as appropriate) if there are objections or changes we should make; Since there's no ecosystem yet I don't think it's too late to make breaking changes if we do it soon.

@bkotsopoulossc
Copy link

This is awesome, I see that this just landed in llvm, can you update once we know the version of emscripten this is associated with?

@dschuff
Copy link
Member

dschuff commented Mar 9, 2023

This change is now included in emscripten 3.1.33

@jedisct1
Copy link
Member

jedisct1 commented Mar 9, 2023

The Zig wasm linker supports it as well.

@sbc100
Copy link
Member

sbc100 commented Mar 9, 2023

Zig has its own wasm linker? Is it based on wasm-ld or something different?

@jedisct1
Copy link
Member

jedisct1 commented Mar 9, 2023

Written from scratch, like other linkers.

@RReverser
Copy link
Member

FWIW this tag is now also natively recognised by wasmbin (used by Sentry's wasm-split): CustomSection::BuildId

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants