Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework DWARF parsing to fix inline information correctness issues #607

Merged
merged 1 commit into from
Jun 28, 2022

Conversation

mstange
Copy link
Contributor

@mstange mstange commented Jun 26, 2022

Fixes #603.
Fixes #602.

This change fixes handling of multi-range inlinees, and the resulting code is slightly shorter overall.
It's probably also a bit faster, because it uses EntriesRaw and skip_attributes.

The traversal matches what the addr2line crate was doing as of this commit:
https://github.com/gimli-rs/addr2line/blob/d4d29061f4298a5f9ad1a0dd2080464757839b06/src/lib.rs#L769

(addr2line changed a bit when it switched to a two-pass traversal to improve performance when you only look up a few addresses and don't need to parse all functions. The linked commit is before that change; it's doing a one-pass traversal.)

If a function or an inlined call has multiple address ranges, we now create individual Function objects for each range.

We no longer attempt to split line records which straddle the boundary of an inline function range. Such line records make no sense; if there is an inline call starting / ending at an address, then the compiler should also emit a new line record for that address. If it doesn't do that, then splitting a line record is also not going to lead to useful output.

@codecov-commenter
Copy link

Codecov Report

Merging #607 (289f35b) into master (33fd1b0) will increase coverage by 0.70%.
The diff coverage is 96.33%.

@@            Coverage Diff             @@
##           master     #607      +/-   ##
==========================================
+ Coverage   68.24%   68.95%   +0.70%     
==========================================
  Files          83       83              
  Lines       16434    16831     +397     
==========================================
+ Hits        11216    11606     +390     
- Misses       5218     5225       +7     

Copy link
Member

@Swatinem Swatinem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very nice!

We did a couple of experiments back when we re-designed the symcache, to construct the inlined source locations directly via scanning over the DWARF tree once. I tried to outline the approach here: https://github.com/getsentry/symbolic/blob/ref/symcache-experiments/symbolic-symcache-new/brainstorming.md#how-to-convert-dwarf=
We also did experiments comparing the output of that algorithm with whatever addr2line and the existing symbolic-debuginfo abstraction did, and we found that all three disagreed at times, and my approach was closer to addr2line in other times.

Your FunctionBuilder is very similar, though it first parses things into line records and inlinee hierarchy before sorting and combining the two, so maybe my approach just made some wrong assumptions.

Either way, I trust you did things correctly, and things look a lot more readable and understandable than before.

symbolic-debuginfo/src/dwarf.rs Outdated Show resolved Hide resolved
continue;
}
let name = symbol_name
.or_else(|| self.resolve_dwarf_name(&self.inner.unit.entry(dw_die_offset).unwrap()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the dw_die_offset ever be invalid? I think so, given we parse that from the raw entries without validation.
In that case I think we should raise a proper Err here, and down below for inlinees.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dw_die_offset is just the byte offset of the reader, as far as I can tell. It comes from the caller who just parsed an abbrev from that very offset. So I think it's always valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens here is that we parse the attributes for the same entry twice: Once to get the address ranges, and then again to get the name. There is an opportunity here to merge the two, but I didn't want to bother with that change in a PR that's already quite large. So the call to unit.entry() is more of an escape hatch.

@mstange mstange marked this pull request as ready for review June 27, 2022 18:20
@mstange mstange requested a review from a team June 27, 2022 18:20
@mstange mstange changed the title WIP: Rework DWARF parsing to fix inline information correctness issues Rework DWARF parsing to fix inline information correctness issues Jun 27, 2022
@mstange
Copy link
Contributor Author

mstange commented Jun 27, 2022

Thanks for the review!

I tried to outline the approach here: https://github.com/getsentry/symbolic/blob/ref/symcache-experiments/symbolic-symcache-new/brainstorming.md#how-to-convert-dwarf=

Oh, interesting! I don't have any strong opinions on the data representation. It sounds like you did the right thing when you did some back-of-the-napkin calculations on real world examples (Electron).

I did notice the following:

Let's say representation A is lines-at-deepest-level-plus-call-locations, and representation B is lines-per-level. DWARF and breakpad use representation A, whereas symbolic and pdb use representation B.

For the Firefox profiler, we end up doing three conversions between A and B: The symbolic dwarf parser goes from A to B, and then dump_syms converts from B back to A. Then in the profiler symbol server we load the breakpad file using symbolic, which again converts from A to B. Then we save the B representation to a symcache and look up addresses in it.
I also noticed that the symbolic representation stores a filename for each line record, even though with representation B you'd expect all line records of a function to have the same filename: The file in which that function's code lives.

@Swatinem
Copy link
Member

with representation B you'd expect all line records of a function to have the same filename: The file in which that function's code lives.

Not necessarily. I think @jan-auer brought up the interesting case that macros can come from completely different files/lines as well. They are not "inlined" code, but rather just some snippets that have a completely different source location.

@mstange
Copy link
Contributor Author

mstange commented Jun 28, 2022

Not necessarily. I think @jan-auer brought up the interesting case that macros can come from completely different files/lines as well. They are not "inlined" code, but rather just some snippets that have a completely different source location.

Hmm, have you seen this in practice? Jeff and I were thinking the same thing and then did some cursory checking of what happened in Rust, and found out that at least in Rust what I said is true: Any code from macros is assigned to the line where the macro is "called" from. The only time we saw debug info point to a line inside a macro was for lines inside a function which was defined by a macro - i.e. the entire function was defined inside the macro. So the property still held.

@Swatinem
Copy link
Member

I think that was rather the case for C macros, and also compiler specific. IIRC MSVC was the one that behaved that way. Which is interesting since you mentioned that PDBs have a different internal structure than DWARF.

Fixes getsentry#603.
Fixes getsentry#602.

This change fixes handling of multi-range inlinees, and the resulting
code is slightly shorter overall.
It's probably also a bit faster, because it uses `EntriesRaw` and
`skip_attributes`.

The traversal matches what the addr2line crate was doing as of this commit:
https://github.com/gimli-rs/addr2line/blob/d4d29061f4298a5f9ad1a0dd2080464757839b06/src/lib.rs#L769

(addr2line changed a bit when switched to a two-pass traversal to improve
performance when you only look up a few addresses and don't need to parse
all functions. The linked commit is before that change; it's doing a
one-pass traversal.)

If a function or an inlined call has multiple address ranges, we now
create individual `Function` objects for each range.

We no longer attempt to split line records which straddle the boundary
of an inline function range. Such line records make no sense; if there
is an inline call starting / ending at an address, then the compiler
should also emit a new line record for that address. If it doesn't do
that, then splitting a line record is also not going to lead to useful
output.

I regenerated crash.inlines.sym by re-running dump_syms with the
symbolic-debuginfo fix from this PR.
With that regenerated sym file, the test snapshot now makes more sense.

And you can also compare test_objects__breakpad_functions_mac_with_inlines.snap
with test_objects__mach_functions.snap (updated in previous commit): the
two files are initially generated from the same object, one directly and
one indirectly via dump_syms / breakpad parsing. You can see that the
information roundtrips through dump_syms correctly now; the only
differences between the two files are resolved filenames, demangled
function names, and deduplicated line records. All of these differences
are expected from dump_syms.
@mstange
Copy link
Contributor Author

mstange commented Jun 28, 2022

I think that was rather the case for C macros, and also compiler specific. IIRC MSVC was the one that behaved that way.

Ah, interesting. That's too bad, I guess we can't simplify the symcache format (or the Function / LineInfo structs) then.

It's also a bit unfortunate for the profiler UI, because the profiler call tree displays call nodes as <function name> <file name>, so the UI assumes that all stack frames in a given function have the same file. Oh well, I guess we'll just have weird sibling call nodes of the same function in different files in those MSVC C macro cases.

@Swatinem
Copy link
Member

You also have the case of different functions in different files, but with the same name. So the function is uniquely defined by its declaration file/line either way. I would like to add this metadata at some point, but we don’t have that right now, only the comp_dir which on its own is pretty useless though.

@Swatinem Swatinem enabled auto-merge (squash) June 28, 2022 15:57
@mstange
Copy link
Contributor Author

mstange commented Jun 28, 2022

You also have the case of different functions in different files, but with the same name. So the function is uniquely defined by its declaration file/line either way.

Sure, yes, but in those cases you'd legitimately want two different call nodes.

I would like to add this metadata at some point, but we don’t have that right now, only the comp_dir which on its own is pretty useless though.

Right, and the addr2line crate doesn't return this information either. So at the moment the profiler can't really differentiate between the two cases.

@Swatinem Swatinem merged commit e902660 into getsentry:master Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants