[RFC] Optimizing extraction of all unwind rows for an FDE #308
+1,827
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi @davea42,
For our use case, we're interested in extracting all unwind rows for a given binary. To do so, we're currently using the
dwarf_get_fde_info_for_all_regs3_bfunction by repeatedly calling it in a loop until all rows have been consumed. This looks something like this:For a binary we're testing against, this takes ~1 second to extract all unwind rows for all FDEs in the binary. The binary has:
We analyzed this, and the root cause is that this is essentially a somewhat hidden quadratic loop:
dwarf_get_fde_info_for_all_regs3_binternally uses_dwarf_exec_frame_instr(via_dwarf_get_fde_info_for_a_pc_row) to execute all instructions for the given FDE until it reaches thesearch_pc_valpassed in, but it starts from the first instruction for each call.This means that if you're iterating through the rows for an FDE like this, the loop essentially looks like this in pseudo code:
This PR implements a new function
dwarf_iterate_fd_info_for_all_regs3that fixes this. The new function uses a new (internal) helper function_dwarf_iterate_frame_instrthat executes all instructions for the FDE, invoking a callback for each row. This turns the quadratic loop into a linear loop over the instructions, which results in a significant speedup: iteration for this binary goes from 1007ms to 83ms, which is ~12x faster.Open questions
I'm sending this PR as a RFC/draft, because there are some questions around the change as implemented that I'm not sure about:
First of all, the new
_dwarf_iterate_frame_instrhelper is a copy/paste of the existing_dwarf_exec_frame_instrwith some minor modifications to invoke the callback instead. This results in significant code duplication that is probably not desired. It is technically possible to implement_dwarf_exec_frame_instrin terms of the new_dwarf_iterate_frame_instrto prevent this code duplication.I haven't included that as part of the PR, but that could look something like this:
But this is not a risk-free change.
Secondly, the new
dwarf_iterate_fde_info_for_all_regs3has to make use of an internal callback function_dwarf_iterate_fde_info_for_all_regs3_callbackthat's passed to the new_dwarf_iterate_frame_instrhelper. The reason for this is that_dwarf_iterate_frame_instr(and_dwarf_exec_frame_instr) work with an internal structDwarf_Framethat's not currently exposed inlibdwarf.h. This means we need to copy fromDwarf_FrametoDwarf_Regtable3, which is a bit wasteful.It would be nice if
Dwarf_Frame(and related structs) could be exposed in the API to avoid this copy step. As I understand it, the difference betweenDwarf_FrameandDwarf_Regtable3was introduced to avoid breaking the API for existing functions likedwarf_get_fde_info_for_all_regs3_b, but since there is a new function being introduced here, there is no risk of that.I realize this is quite a lot taken together, but it would be great to get your thoughts/feedback on all of this to see if we can get this PR in a shape where it could be upstreamed (or if you have any ideas around how a similar optimization could be implemented in a way that fits a bit better in libdwarf's current architecture).