Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding DWARF+WebAssembly offsets #9

Open
kripken opened this issue Dec 19, 2019 · 6 comments
Open

Understanding DWARF+WebAssembly offsets #9

kripken opened this issue Dec 19, 2019 · 6 comments

Comments

@kripken
Copy link
Member

kripken commented Dec 19, 2019

Working on binaryen support for DWARF, I realized I don't know how to read the line info data. The main issues are:

  • The code addresses doc says offsets are the offset of an instruction relative within the Code section of the WebAssembly file. Does "the Code section" include the entire code section, with the 0xa0 byte to declare the code section and the LEB for the length? Or just the body, without those?
  • Can debug lines refer to code section offsets that are not code? (Like the function declarations.)
  • Can debug lines refer to inner parts of an instruction, and not the start?

In more detail here is what I am trying: I started with @yurydelendik 's fib2 sample,

__attribute__((used))
int fib(int n) {
  int i, t, a = 0, b = 1;
  for (i = 0; i < n; i++) {
    t = a;
    a = b;
    b += t;
  }
  return b;
}

and I build it with

clang fib2.c -O3 -g -o fib2.clang.wasm  -target wasm32-unknown-emscripten -nostdlib -Wl,--no-entry

LLVM's dwarfdump says this:

Address            Line   Column File   ISA Discriminator Flags
------------------ ------ ------ ------ --- ------------- -------------
0x0000000000000002      2      0      1   0             0  is_stmt
0x000000000000000b      4     17      1   0             0  is_stmt prologue_end
0x0000000000000010      4      3      1   0             0 
0x0000000000000012      0      3      1   0             0 
0x000000000000001e      7      7      1   0             0  is_stmt
0x0000000000000025      0      7      1   0             0 
0x0000000000000029      4     17      1   0             0  is_stmt
0x000000000000002e      4      3      1   0             0 
0x0000000000000034      9      3      1   0             0  is_stmt
0x0000000000000037      9      3      1   0             0  is_stmt end_sequence

The first line there says address 2. If the offset is in the code section body, then that's in the middle of the function declaration, and not executable code. Is that expected?

The fifth line has address 0x1e. Looking in the binary, though, the code section's body starts at 0x2d, and adding the offset we get 0x4b. That is the second out of 2 bytes of an i32.const -1, which seems odd?

fib2.clang.wasm.zip

@kripken
Copy link
Member Author

kripken commented Dec 19, 2019

Also, when I load the wasm in the code explorer, it only shows 3 lines in the UI (2, 4, 7) while the debug line table also mentions line 9. Looking at that line 9 info, it starts at 0x34 which, relative to the start of the code section's body, is at 0x61 - which is past the end of the code section..?

cc @dschuff @yurydelendik

@yurydelendik
Copy link

the offset of an instruction relative within the Code section of the WebAssembly file

Code section starts at the its function count LEB. There are several decision that led to it:

  • We can potentially point to function locals bytes (see related response below), it is decided that it is better to start way before first function len LEB.
  • No valid DWARF offset shall be 0 or range start from 0. We reserving that for dead symbols: when linker cannot relocate entry, it places 0 in the .debug_info or .debug_line table.
  • The WASM files can be potentially manipulated to remove sections (and rewrite section header), so the decisions were made to make DWARF code offsets relative to the actual code section start.

Can debug lines refer to code section offsets that are not code?

In theory, yes. .debug_info will have ranges that point to entire function body. At the debugger side, "PC" pointing at locals bytes may signal entering frame. It is not used atm, we can change that requirement and use only offsets that point only to code section body/instructions.

Can debug lines refer to inner parts of an instruction, and not the start?

Not sure DWARF does have a requirement to point only to the start of the instruction.

The relocation section will definitely is capable to point to inner parts of an instruction.

@kripken
Copy link
Member Author

kripken commented Dec 19, 2019

Thanks @yurydelendik !

No valid DWARF offset shall be 0 or range start from 0. We reserving that for dead symbols: when linker cannot relocate entry, it places 0 in the .debug_info or .debug_line table.

Interesting, why not just drop that line then, seems like it won't be usable later anyhow? Or is there some other use for the information?

Not sure DWARF does have a requirement to point only to the start of the instruction.

It would require some additional logic in binaryen to support that. I was hoping not to need it...

@yurydelendik
Copy link

Interesting, why not just drop that line then

the lld cannot parse, optimize or re-write DWARF data due to its complexity. @sbc100 , is it correct?

seems like it won't be usable later anyhow?

It is not useful. Notice that .debug_line encodes only few offsets, and rest of them are deltas. That means delta becomes invalid/dead as well.

It would require some additional logic in binaryen to support that.

Agree. We can recommend that for WebAssembly DWARF.

@sbc100
Copy link
Member

sbc100 commented Dec 20, 2019 via email

kripken added a commit to WebAssembly/binaryen that referenced this issue Dec 21, 2019
With this, we can update DWARF debug line info properly as
we write a new binary.

To do that we track binary locations as we write. Each
instruction is mapped to the location it is written to. We
must also adjust them as we move code around because
of LEB optimization (we emit a function or a section
with a 5-byte LEB placeholder, the maximal size; later
we shrink it which is almost always possible).

writeDWARFSections() now takes a second param, the new
locations of instructions. It then maps debug line info from the
original offsets in the binary to the new offsets in the binary
being written.

The core logic for updating the debug line section is in
wasm-debug.cpp. It basically tracks state machine logic
both to read the existing debug lines and to emit the new
ones. I couldn't find a way to reuse LLVM code for this, but
reading LLVM's code was very useful here.

A final tricky thing we need to do is to update the DWARF
section's internal size annotation. The LLVM YAML writing
code doesn't do that for us. Luckily it's pretty easy, in
fixEmittedSection we just update the first 4 bytes in place
to have the section size, after we've emitted it and know
the size.

This ignores debug lines with a 0 in the line, col, or addr,
see WebAssembly/debugging#9 (comment)

This ignores debug line offsets into the middle of
instructions, which LLVM sometimes emits for some
reason, see WebAssembly/debugging#9 (comment)
Handling that would likely at least double our memory
usage, which is unfortunate - we are run in an LTO manner,
where the entire app's DWARF is present, and it may be
massive. I think we should see if such odd offsets are
a bug in LLVM, and if we can fix or prevent that.

This does not emit "special" opcodes for debug lines. Those
are purely an optimization, which I wanted to leave for
later. (Even without them we decrease the size quite a lot,
btw, as many lines have 0s in them...)

This adds some testing that shows we can load and save
fib2.c and fannkuch.cpp properly. The latter includes more
than one function and has nontrivial code.

To actually emit correct offsets a few minor fixes are
done here:

*   Fix the code section location tracking during reading -
    the correct offset we care about is the body of the code
    section, not including the section declaration and size.
*   Fix wasm-stack debug line emitting. We need to update
    in BinaryInstWriter::visit(), that is, right before writing
    bytes for the instruction. That differs from
*   BinaryenIRWriter::visit which is a recursive function
    that also calls the children - so the offset there would be
    of the first child. For some reason that is correct with
    source maps, I don't understand why, but it's wrong for
    DWARF...
*   Print code section offsets in hex, to match other tools.
    Remove DWARFUpdate pass, which was useful for testing
    temporarily, but doesn't make sense now (it just updates without
    writing a binary).

cc @yurydelendik
@turbolent
Copy link

turbolent commented Feb 10, 2022

Thank you for the explanation @yurydelendik!

I'm trying to parse DWARF line info for https://github.com/turbolent/w2c2 and had the same questions after reading the spec, i.e. where the start of the code section is, and if it is normal that sometimes line addresses point to the middle of instructions. Maybe it is worth to document this better in the spec?

I'm still a bit confused about the last part, addresses pointing to the middle of instructions. Why not require alignment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants