Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase size of RegisterRuleMap to avoid TooManyRegisterRules #487

Merged
merged 1 commit into from
Apr 21, 2020

Conversation

jan-auer
Copy link
Contributor

@jan-auer jan-auer commented Apr 16, 2020

Increases the size of RegisterRuleMap to fit more rules in rare cases.

@coveralls
Copy link

coveralls commented Apr 16, 2020

Coverage Status

Coverage increased (+0.02%) to 85.391% when pulling e471a94 on jan-auer:fix/too-many-rules into 53c802a on gimli-rs:master.

@philipc
Copy link
Collaborator

philipc commented Apr 16, 2020

I think ArrayVec was intentional because this code is meant to be async signal safe and not allocate, as mentioned in the comment.

I wonder how other unwinders handle this.

@jan-auer
Copy link
Contributor Author

Completely missed that comment.

What if we keep the error and swap out ArrayVec vs SmallVec based on a feature flag? There may be contexts where signal safety is not required in which case it would be desirable to collect all rules.

@philipc
Copy link
Collaborator

philipc commented Apr 17, 2020

I'm not keen on using a feature flag to switch between ArrayVec and SmallVec. We need to be able to handle all rules, even when signal safety is required. libunwind uses an architecture specific value for its array (33 for x86-64, 97 for aarch64). How would you feel about increasing the fixed size for now, and later making it configurable with const generics?

@jan-auer jan-auer changed the title Use smallvec to avoid TooManyRegisterRules in CFI Increase size of RegisterRuleMap to avoid TooManyRegisterRules Apr 17, 2020
@jan-auer
Copy link
Contributor Author

Sounds good to me. I should have done the reading in libunwind right away. Would you mind to link me to it, for some reason I'm not able to find it.

Increased the limit to 100 now (since arrayvec::Array is not implemented for size 97). The first part of this comment is now a bit redundant, since we are actually allocating that many rules:

gimli/src/read/cfi.rs

Lines 2297 to 2299 in 53c802a

// We tend to have very few register rules: usually only a couple. Even if we
// have a rule for every register, on x86-64 with SSE and everything we're
// talking about ~100 rules. So rather than keeping the rules in a hash map, or

Would you like me to change that?

@philipc
Copy link
Collaborator

philipc commented Apr 17, 2020

Would you mind to link me to it, for some reason I'm not able to find it.

Sure: the register rule table, and the x86-64 and aarch64 numbers. ARM (128) and MIPS (188) are actually even larger.

Would you like me to change that?

Yeah, can you change the comment to mention that it is based on the numbers from libunwind?

@jan-auer jan-auer force-pushed the fix/too-many-rules branch 2 times, most recently from 5a2c6a7 to aa6c516 Compare April 20, 2020 15:53
@jan-auer
Copy link
Contributor Author

Thanks for the links, @philipc! Was going back and forth with using a smaller array, but ended up increasing size to 192. This bumps sizes of the register map to 6kB and the unwind context to a whopping 7+ pages (30kB).

@philipc
Copy link
Collaborator

philipc commented Apr 21, 2020

Just out of interest, the benchmark difference on my machine is:

 name                                                       before ns/iter  after ns/iter  diff ns/iter   diff %  speedup 
 cfi::eval_longest_fde_instructions_new_ctx_everytime       498             2,893                 2,395  480.92%   x 0.17 
 cfi::eval_longest_fde_instructions_same_ctx                314             651                     337  107.32%   x 0.48 
 cfi::iterate_entries_and_do_not_parse_any_fde              50,855          49,392               -1,463   -2.88%   x 1.03 
 cfi::iterate_entries_and_parse_every_fde                   387,405         391,387               3,982    1.03%   x 0.99 
 cfi::iterate_entries_and_parse_every_fde_and_instructions  889,495         896,817               7,322    0.82%   x 0.99 
 cfi::iterate_entries_evaluate_every_fde                    1,204,579       2,542,256         1,337,677  111.05%   x 0.47 
 cfi::parse_longest_fde_instructions                        203             209                       6    2.96%   x 0.97 

I don't have any ideas for what we can do better, but maybe @fitzgen does? I don't think anyone has spent any serious effort on optimising this yet though, other than #148 (which may no longer be an optimisation).

@fitzgen
Copy link
Member

fitzgen commented Apr 21, 2020

@philipc

I wonder how other unwinders handle this.

Some versions of libunwind (there are so many forks floating around) have a specialized mmap-only, lock-free allocator for use inside signal handlers. Given the snippets you found, it looks like that isn't used for register rules, but it is something we could look into.

Other unwinders, that are focused just on fast, common case unwinding for profiling, only ever recover rsp and rbp.

Exposing this limit through generics somehow seems promising and not overly ocean boil-y.

@fitzgen
Copy link
Member

fitzgen commented Apr 21, 2020

I don't have any ideas for what we can do better, but maybe @fitzgen does? I don't think anyone has spent any serious effort on optimising this yet though, other than #148 (which may no longer be an optimisation).

The fastest implementations I've seen either compile CFI to machine code ahead of time and replace the CFI, or do it at runtime with a super limited JIT compiler and an associative cache mapping PCs to JIT stubs.

That is, they largely avoid the cost of repeatedly parsing and interpreting CFI rules.

@philipc
Copy link
Collaborator

philipc commented Apr 21, 2020

Some versions of libunwind (there are so many forks floating around) have a specialized mmap-only, lock-free allocator for use inside signal handlers. Given the snippets you found, it looks like that isn't used for register rules, but it is something we could look into.

Yeah libunwind (at least the primary version of it) uses mmap for DW_CFA_remember_state. The libgcc unwinder does the same thing as libunwind for the register array, and uses alloca for DW_CFA_remember_state.

The fastest implementations I've seen either compile CFI to machine code ahead of time and replace the CFI, or do it at runtime with a super limited JIT compiler and an associative cache mapping PCs to JIT stubs.

I'm aware of https://fzn.fr/projects/frdwarf/, but not sure if anything in production uses that technique.

Anyway, it sounds like this PR is the way to go for now at least.

@philipc philipc merged commit 1e0d07b into gimli-rs:master Apr 21, 2020
@fitzgen
Copy link
Member

fitzgen commented Apr 22, 2020

but not sure if anything in production uses that technique.

Firefox's profiler's unwinder uses the JIT approach.

@jan-auer jan-auer deleted the fix/too-many-rules branch April 22, 2020 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants