Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

See how much bloat is generated by template class #17

Open
stgatilov opened this issue May 11, 2018 · 5 comments
Open

See how much bloat is generated by template class #17

stgatilov opened this issue May 11, 2018 · 5 comments

Comments

@stgatilov
Copy link
Contributor

A lot of code bloat comes from generic template classes. For instance, let it be MyVector defined in MyVector.h. If would be great if SymbolSort would allow to see how much code was generated by such class.

Right now it is possible to analyze object files (COMDAT), but there is no way to group symbols by class or by header file in such case. Also, it is possible to analyze PDB, but then duplication of symbols across object files is not taken into account (and it is important for analyzing build times).

I see two approaches to implement this feature:

  1. Extract classes from symbol names. Ideally, they can be extracted with namespaces, e.g. std::_XTree, and then grouped like SymbolSort does for paths. This is perhaps the best approach, but given how many special types of symbols exist, it becomes very hard to do it right. In fact, it is necessary to implement full-fledged parser of symbol names (and perhaps decorated symbols are even easier to parse than undecorated ones) to do it right.

  2. Attribute each symbol to the source file where its code is located. This information is absent in object files, but it is present in PDB files. So it is possible to read object file dumps for the main data, then read PDB files solely for setting proper code location to symbols. This approach has some disadvantages: mainly, not all symbols are present in PDB, and not all symbols have any location in source code.

@stgatilov
Copy link
Contributor Author

stgatilov commented May 11, 2018

I have implemented the second approach in my fork. You can see the full set of changes here.

Please let me know if pull request is welcome.

P.S. The approach 2 has some additional advantages. For instance, in theory it is possible to produce annotated version of source files, where count/total stats are added as comment before each function.

@stgatilov stgatilov changed the title See how much bloat is generated by template See how much bloat is generated by template class May 11, 2018
@stgatilov
Copy link
Contributor Author

I have also implemented the first approach, i.e. extracting classpath from symbol name. It works like this:

  1. Take raw symbol name (i.e. mangled/decorated one).
  2. Undecorate it partially, omitting return value and function parameters (and probably smth else).
  3. Parse undecorated name using several templates, regexes, and other dirty stuff like that.

First I tried to use UnDecorateSymbolName for point 2, but it is located in dbghelp.dll, which has not been updated for quite a long time. It cannot handle C++11 features like Rvalue references. This implementation is currently in classpath branch.

Then I switched to calling undname.exe util from MSVC distribution. It works perfectly (it is perhaps the only official way to demangle MSVC symbols today). The code is in classpath2 branch. All the differences can be see here.

@adrianstone55
Copy link
Owner

Hi, sorry for the slow response, but I've been away on vacation. I think you're analysis of the problem is spot on. PDBs are interesting, but to analyze code bloat from weak instantiations you need to look at the OBJ files. I would probably lean towards the second approach, because trying to correlate input from two different sources could get messy, but there are advantages and disadvantages both ways.

If you want to put together a pull request, I'll happily consider it, but I might be a bit slow because I'm not actively maintaining the code anymore and I haven't even used it more than a couple times in the past five years.

@stgatilov
Copy link
Contributor Author

stgatilov commented May 19, 2018

Both approaches already work for me. Surely, both has pluses and minuses.

In classpath approach, analysis relies on hacky regexes for parsing symbol names. Despite that, almost all symbols are taken into account.
In the pdb filepath approach, not all symbols actually have location in PDB. About 20-30% of symbols are usually implicitly generated stuff or some data. On the bonus side, it gives per-directory stats, so it is very simple to see code bloat from whole STL.

My plan is to write a blog article about these two options. Then it will be easier to make decision.
P.S. As for now, continuing to post small pull requests...

@stgatilov
Copy link
Contributor Author

Ok, finished with article.

Here is the full article.
To not waste time, I suggest you to start reading from Improvements section.

Now I'll prepare pull requests for both features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants