Rework DWARF parsing to fix inline information correctness issues #607

mstange · 2022-06-26T16:45:55Z

Fixes #603.
Fixes #602.

This change fixes handling of multi-range inlinees, and the resulting code is slightly shorter overall.
It's probably also a bit faster, because it uses EntriesRaw and skip_attributes.

The traversal matches what the addr2line crate was doing as of this commit:
https://github.com/gimli-rs/addr2line/blob/d4d29061f4298a5f9ad1a0dd2080464757839b06/src/lib.rs#L769

(addr2line changed a bit when it switched to a two-pass traversal to improve performance when you only look up a few addresses and don't need to parse all functions. The linked commit is before that change; it's doing a one-pass traversal.)

If a function or an inlined call has multiple address ranges, we now create individual Function objects for each range.

We no longer attempt to split line records which straddle the boundary of an inline function range. Such line records make no sense; if there is an inline call starting / ending at an address, then the compiler should also emit a new line record for that address. If it doesn't do that, then splitting a line record is also not going to lead to useful output.

codecov-commenter · 2022-06-26T19:15:49Z

Codecov Report

Merging #607 (289f35b) into master (33fd1b0) will increase coverage by 0.70%.
The diff coverage is 96.33%.

@@            Coverage Diff             @@
##           master     #607      +/-   ##
==========================================
+ Coverage   68.24%   68.95%   +0.70%     
==========================================
  Files          83       83              
  Lines       16434    16831     +397     
==========================================
+ Hits        11216    11606     +390     
- Misses       5218     5225       +7

Swatinem

This looks very nice!

We did a couple of experiments back when we re-designed the symcache, to construct the inlined source locations directly via scanning over the DWARF tree once. I tried to outline the approach here: https://github.com/getsentry/symbolic/blob/ref/symcache-experiments/symbolic-symcache-new/brainstorming.md#how-to-convert-dwarf=
We also did experiments comparing the output of that algorithm with whatever addr2line and the existing symbolic-debuginfo abstraction did, and we found that all three disagreed at times, and my approach was closer to addr2line in other times.

Your FunctionBuilder is very similar, though it first parses things into line records and inlinee hierarchy before sorting and combining the two, so maybe my approach just made some wrong assumptions.

Either way, I trust you did things correctly, and things look a lot more readable and understandable than before.

symbolic-debuginfo/src/dwarf.rs

Swatinem · 2022-06-27T10:21:55Z

symbolic-debuginfo/src/dwarf.rs

-                                continue;
-                            }
+        let name = symbol_name
+            .or_else(|| self.resolve_dwarf_name(&self.inner.unit.entry(dw_die_offset).unwrap()))


can the dw_die_offset ever be invalid? I think so, given we parse that from the raw entries without validation.
In that case I think we should raise a proper Err here, and down below for inlinees.

The dw_die_offset is just the byte offset of the reader, as far as I can tell. It comes from the caller who just parsed an abbrev from that very offset. So I think it's always valid.

So what happens here is that we parse the attributes for the same entry twice: Once to get the address ranges, and then again to get the name. There is an opportunity here to merge the two, but I didn't want to bother with that change in a PR that's already quite large. So the call to unit.entry() is more of an escape hatch.

mstange · 2022-06-27T18:37:52Z

Thanks for the review!

I tried to outline the approach here: https://github.com/getsentry/symbolic/blob/ref/symcache-experiments/symbolic-symcache-new/brainstorming.md#how-to-convert-dwarf=

Oh, interesting! I don't have any strong opinions on the data representation. It sounds like you did the right thing when you did some back-of-the-napkin calculations on real world examples (Electron).

I did notice the following:

Let's say representation A is lines-at-deepest-level-plus-call-locations, and representation B is lines-per-level. DWARF and breakpad use representation A, whereas symbolic and pdb use representation B.

For the Firefox profiler, we end up doing three conversions between A and B: The symbolic dwarf parser goes from A to B, and then dump_syms converts from B back to A. Then in the profiler symbol server we load the breakpad file using symbolic, which again converts from A to B. Then we save the B representation to a symcache and look up addresses in it.
I also noticed that the symbolic representation stores a filename for each line record, even though with representation B you'd expect all line records of a function to have the same filename: The file in which that function's code lives.

Swatinem · 2022-06-28T10:16:34Z

with representation B you'd expect all line records of a function to have the same filename: The file in which that function's code lives.

Not necessarily. I think @jan-auer brought up the interesting case that macros can come from completely different files/lines as well. They are not "inlined" code, but rather just some snippets that have a completely different source location.

mstange · 2022-06-28T15:15:08Z

Not necessarily. I think @jan-auer brought up the interesting case that macros can come from completely different files/lines as well. They are not "inlined" code, but rather just some snippets that have a completely different source location.

Hmm, have you seen this in practice? Jeff and I were thinking the same thing and then did some cursory checking of what happened in Rust, and found out that at least in Rust what I said is true: Any code from macros is assigned to the line where the macro is "called" from. The only time we saw debug info point to a line inside a macro was for lines inside a function which was defined by a macro - i.e. the entire function was defined inside the macro. So the property still held.

Swatinem · 2022-06-28T15:44:27Z

I think that was rather the case for C macros, and also compiler specific. IIRC MSVC was the one that behaved that way. Which is interesting since you mentioned that PDBs have a different internal structure than DWARF.

Fixes getsentry#603. Fixes getsentry#602. This change fixes handling of multi-range inlinees, and the resulting code is slightly shorter overall. It's probably also a bit faster, because it uses `EntriesRaw` and `skip_attributes`. The traversal matches what the addr2line crate was doing as of this commit: https://github.com/gimli-rs/addr2line/blob/d4d29061f4298a5f9ad1a0dd2080464757839b06/src/lib.rs#L769 (addr2line changed a bit when switched to a two-pass traversal to improve performance when you only look up a few addresses and don't need to parse all functions. The linked commit is before that change; it's doing a one-pass traversal.) If a function or an inlined call has multiple address ranges, we now create individual `Function` objects for each range. We no longer attempt to split line records which straddle the boundary of an inline function range. Such line records make no sense; if there is an inline call starting / ending at an address, then the compiler should also emit a new line record for that address. If it doesn't do that, then splitting a line record is also not going to lead to useful output. I regenerated crash.inlines.sym by re-running dump_syms with the symbolic-debuginfo fix from this PR. With that regenerated sym file, the test snapshot now makes more sense. And you can also compare test_objects__breakpad_functions_mac_with_inlines.snap with test_objects__mach_functions.snap (updated in previous commit): the two files are initially generated from the same object, one directly and one indirectly via dump_syms / breakpad parsing. You can see that the information roundtrips through dump_syms correctly now; the only differences between the two files are resolved filenames, demangled function names, and deduplicated line records. All of these differences are expected from dump_syms.

mstange · 2022-06-28T15:54:02Z

I think that was rather the case for C macros, and also compiler specific. IIRC MSVC was the one that behaved that way.

Ah, interesting. That's too bad, I guess we can't simplify the symcache format (or the Function / LineInfo structs) then.

It's also a bit unfortunate for the profiler UI, because the profiler call tree displays call nodes as <function name> <file name>, so the UI assumes that all stack frames in a given function have the same file. Oh well, I guess we'll just have weird sibling call nodes of the same function in different files in those MSVC C macro cases.

Swatinem · 2022-06-28T15:56:47Z

You also have the case of different functions in different files, but with the same name. So the function is uniquely defined by its declaration file/line either way. I would like to add this metadata at some point, but we don’t have that right now, only the comp_dir which on its own is pretty useless though.

mstange · 2022-06-28T15:58:45Z

You also have the case of different functions in different files, but with the same name. So the function is uniquely defined by its declaration file/line either way.

Sure, yes, but in those cases you'd legitimately want two different call nodes.

I would like to add this metadata at some point, but we don’t have that right now, only the comp_dir which on its own is pretty useless though.

Right, and the addr2line crate doesn't return this information either. So at the moment the profiler can't really differentiate between the two cases.

mstange force-pushed the better-dwarf-parsing branch from f93dd45 to 289f35b Compare June 26, 2022 19:10

Swatinem reviewed Jun 27, 2022

View reviewed changes

mstange mentioned this pull request Jun 27, 2022

Add support for parsing breakpad sym files containing INLINE_ORIGIN and INLINE records #605

Merged

mstange force-pushed the better-dwarf-parsing branch from 289f35b to 19eb11c Compare June 27, 2022 18:20

mstange marked this pull request as ready for review June 27, 2022 18:20

mstange requested a review from a team June 27, 2022 18:20

mstange changed the title ~~WIP: Rework DWARF parsing to fix inline information correctness issues~~ Rework DWARF parsing to fix inline information correctness issues Jun 27, 2022

Swatinem approved these changes Jun 28, 2022

View reviewed changes

mstange force-pushed the better-dwarf-parsing branch from 19eb11c to dd50717 Compare June 28, 2022 15:49

Swatinem enabled auto-merge (squash) June 28, 2022 15:57

Swatinem merged commit e902660 into getsentry:master Jun 28, 2022

mstange mentioned this pull request Jul 9, 2022

PDB parsing produces overlapping inlinees #620

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework DWARF parsing to fix inline information correctness issues #607

Rework DWARF parsing to fix inline information correctness issues #607

mstange commented Jun 26, 2022 •

edited

Loading

codecov-commenter commented Jun 26, 2022

Swatinem left a comment

Swatinem Jun 27, 2022

mstange Jun 27, 2022

mstange Jun 27, 2022

mstange commented Jun 27, 2022 •

edited

Loading

Swatinem commented Jun 28, 2022

mstange commented Jun 28, 2022

Swatinem commented Jun 28, 2022

mstange commented Jun 28, 2022 •

edited

Loading

Swatinem commented Jun 28, 2022

mstange commented Jun 28, 2022

Rework DWARF parsing to fix inline information correctness issues #607

Rework DWARF parsing to fix inline information correctness issues #607

Conversation

mstange commented Jun 26, 2022 • edited Loading

codecov-commenter commented Jun 26, 2022

Codecov Report

Swatinem left a comment

Choose a reason for hiding this comment

Swatinem Jun 27, 2022

Choose a reason for hiding this comment

mstange Jun 27, 2022

Choose a reason for hiding this comment

mstange Jun 27, 2022

Choose a reason for hiding this comment

mstange commented Jun 27, 2022 • edited Loading

Swatinem commented Jun 28, 2022

mstange commented Jun 28, 2022

Swatinem commented Jun 28, 2022

mstange commented Jun 28, 2022 • edited Loading

Swatinem commented Jun 28, 2022

mstange commented Jun 28, 2022

mstange commented Jun 26, 2022 •

edited

Loading

mstange commented Jun 27, 2022 •

edited

Loading

mstange commented Jun 28, 2022 •

edited

Loading