Add support for SeparatedCode symbols. #622

mstange · 2022-07-13T21:41:11Z

SeparatedCode symbols usually directly follow their associated Procedure
symbol. They describe a part of the function that has been extracted out
and moved to a separate place. They come with their own set of line
information and can be treated very similarly to regular functions.

The impact of this patch can be tested on a combase.pdb from the
Microsoft symbol server:

% curl -o combase.pdb -L "https://msdl.microsoft.com/download/symbols/combase.pdb/071849A7C75FD246A3367704EE1CA85B1/combase.pdb"
% cargo run --release -p symcache_debug -- --report -d combase.pdb
% cargo run --release -p addr2line -- -Cfi -e combase.pdb 0x1a147c

Before this patch:

% cargo run -p addr2line -- -Cfi -e combase.pdb 0x1a147c
Containers::Intrusive::Details::HashTableBase<class std::tuple<class ObjectLibrary::OpaqueString,enum WinRT::Metadata::Marshaling::MetadataType::ObjectType,enum WinRT::Metadata::Marshaling::MethodParameterInfo>,struct WinRT::Metadata::Marshaling::TypeRegistry::TypeEntry,struct Containers::Intrusive::Details::MemberPtrAccessor<struct WinRT::Metadata::Marshaling::TypeRegistry::TypeEntry,struct Containers::Intrusive::HashMemberHook<class std::tuple<class ObjectLibrary::OpaqueString,enum WinRT::Metadata::Marshaling::MetadataType::ObjectType,enum WinRT::Metadata::Marshaling::MethodParameterInfo>,struct WinRT::Metadata::Marshaling::TypeRegistry::TypeEntry>,24>,struct Containers::Intrusive::Details::Key<class std::tuple<class ObjectLibrary::OpaqueString,enum WinRT::Metadata::Marshaling::MetadataType::ObjectType,enum WinRT::Metadata::Marshaling::MethodParameterInfo> > >::ResizeInternalCache
  at ??:0

(wrong symbol)

After this patch:

CComCatalog::GetProcessInfoInternal
  at onecore\com\combase\catalog\catalog.cxx:4132

The symcache for combase.pdb grows from 10.6 MB to 11 MB, and from
having 28844 functions to having 31194 functions.

codecov-commenter · 2022-07-13T21:46:52Z

Codecov Report

Merging #622 (778ad9e) into master (7d4e25a) will decrease coverage by 0.05%.
The diff coverage is 61.29%.

❗ Current head 778ad9e differs from pull request most recent head 1ed3e84. Consider uploading reports for the commit 1ed3e84 to get more accurate results

@@            Coverage Diff             @@
##           master     #622      +/-   ##
==========================================
- Coverage   69.19%   69.13%   -0.06%     
==========================================
  Files          84       84              
  Lines       16873    16883      +10     
==========================================
- Hits        11675    11672       -3     
- Misses       5198     5211      +13

Swatinem

This looks good.

As you mention changes to the symcache size / number of functions:

We currently uniquely identify functions by their name and entry_pc / address.
The growth in number of functions you observe is thus duplicates with a "bogus" entry_pc.

The intention of Function::address, and entry_pc as we persist it into the symcache was to give, well, the entry point of the function, as you would call it from other code. Thats why we already have duplicates for inline functions. We may have one "unique function" for the non-inlined version with its "real" entry_pc. Plus another value for the inlined version for which we set the entry_pc to a sentinel value that says "has none".

Do you think that makes sense in general?

And specific to this change: What does address mean in this case for such split functions (which I guess pdb calls SeparatedCode). Can you actually properly call these with their address? Or are they missing the normal function prolog/epilog?

mstange · 2022-07-15T13:10:04Z

We currently uniquely identify functions by their name and entry_pc / address. The growth in number of functions you observe is thus duplicates with a "bogus" entry_pc.

Oh, interesting, I didn't know about entry_pc - I've only ever looked at symbolic_debuginfo::Function, never at symbolic_symcache::Function.

The intention of Function::address, and entry_pc as we persist it into the symcache was to give, well, the entry point of the function, as you would call it from other code.

I see. But isn't this a mixing of two very different concerns?

Given a code address, tell me which function the instruction at that address belongs to.
Given a function, tell me how to call it.

These two questions seem to have very different areas of application. For my purposes, I am only ever interested in the first, never in the second. If I was interested in the second, then the address wouldn't be enough, I'd also need to know calling convention / parameter types / etc.
So what is entry_pc used for?

Thats why we already have duplicates for inline functions. We may have one "unique function" for the non-inlined version with its "real" entry_pc. Plus another value for the inlined version for which we set the entry_pc to a sentinel value that says "has none".

Do you think that makes sense in general?

Sure, I guess? You can't call "inlined" functions so it makes sense to give them a value of None. But I can't really say more unless I know what people are using entry_pc for.

And specific to this change: What does address mean in this case for such split functions (which I guess pdb calls SeparatedCode). Can you actually properly call these with their address? Or are they missing the normal function prolog/epilog?

I don't know for sure but I strongly suspect that they're missing prolog / epilog and you can't call them.

Swatinem · 2022-07-15T13:17:56Z

The idea of entry_pc really is to give you a "symbol relative offset" which is reported in different ways by different tools.
We don’t have that exposed yet in our UI, but we do return that internally (with varying accuracy though). Eventually we would expose that in the UI as well, I remember this feature was requested by customers at some point.

Speaking of split functions: Is it guaranteed that the split code is always "higher"/"after" the entry point? I wonder if we can assume that this offset is always positive? Thats a question for both pdb and dwarf though.

mstange · 2022-07-15T14:23:07Z

The idea of entry_pc really is to give you a "symbol relative offset" which is reported in different ways by different tools.

Oh, of course! For things like some_system_library_function() + 0x2b.

I'm even using that exact field in the profiler! I totally forgot. I use it to collect a list of symbol addresses.

And this list of symbol addresses, or ideally symbol "address ranges", is what I'm planning to use in the upcoming assembly view in the profiler:

I want to display the sampled instruction in the context of its surrounding assembly code. So I want to know "Where should I start and stop disassembling?"
I want to show all samples from this "chunk of machine code" in this view. So I want to know "Which code addresses are associated with the same chunk of machine code / symbol address?"

For this purpose, I think having multiple entry_pc values for split functions would actually work best.

Speaking of split functions: Is it guaranteed that the split code is always "higher"/"after" the entry point? I wonder if we can assume that this offset is always positive? Thats a question for both pdb and dwarf though.

I don't know, but I would be surprised if there was such a guarantee. For example I could imagine that the linker might want to reorder your functions sometimes, without much care for which functions belong together.

Swatinem · 2022-07-15T14:42:37Z

For example I could imagine that the linker might want to reorder your functions sometimes, without much care for which functions belong together.

I would assume as much, yes :-(

So for split top-level functions, you would most likely really have the offset relative to that chunk of code. No matter if it is a "real" function with prolog etc that you call, or just some random cold code that was split out and that you jmp to.

So for the use-case of "show me the surrounding asm code", you would really want the complete addr range and not only the entry_pc. Lets CC @jan-auer @loewenheim and see what they think of this?

mstange requested a review from a team July 13, 2022 21:41

mstange mentioned this pull request Jul 13, 2022

Consider using ObjectDebugSession for PDBs as well, not just for mach-O and ELF mozilla/dump_syms#406

Closed

loewenheim approved these changes Jul 14, 2022

View reviewed changes

mstange force-pushed the sepcode branch from ca62b62 to 1ed3e84 Compare July 14, 2022 20:42

Swatinem approved these changes Jul 15, 2022

View reviewed changes

mstange force-pushed the sepcode branch from 1ed3e84 to 48530ce Compare July 15, 2022 13:14

feat: Add support for SeparatedCode symbols

d16c081

mstange force-pushed the sepcode branch from 48530ce to d16c081 Compare July 15, 2022 13:14

Swatinem enabled auto-merge (squash) July 15, 2022 13:21

Swatinem merged commit 91a9ccc into getsentry:master Jul 15, 2022

mstange mentioned this pull request Jul 31, 2022

Remove PDB code and use symbolic_debuginfo for all platforms mozilla/dump_syms#433

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for SeparatedCode symbols. #622

Add support for SeparatedCode symbols. #622

mstange commented Jul 13, 2022

codecov-commenter commented Jul 13, 2022 •

edited

Loading

Swatinem left a comment

mstange commented Jul 15, 2022

Swatinem commented Jul 15, 2022

mstange commented Jul 15, 2022

Swatinem commented Jul 15, 2022

Add support for SeparatedCode symbols. #622

Add support for SeparatedCode symbols. #622

Conversation

mstange commented Jul 13, 2022

codecov-commenter commented Jul 13, 2022 • edited Loading

Codecov Report

Swatinem left a comment

Choose a reason for hiding this comment

mstange commented Jul 15, 2022

Swatinem commented Jul 15, 2022

mstange commented Jul 15, 2022

Swatinem commented Jul 15, 2022

codecov-commenter commented Jul 13, 2022 •

edited

Loading