[RFC] Optimizing extraction of all unwind rows for an FDE #308

rovarma · 2025-11-23T20:48:44Z

For our use case, we're interested in extracting all unwind rows for a given binary. To do so, we're currently using the dwarf_get_fde_info_for_all_regs3_b function by repeatedly calling it in a loop until all rows have been consumed. This looks something like this:

Dwarf_Addr currentRowRVA = fdeStartAddress;
Dwarf_Bool hasMoreRows = false;

// Iterate through all rows for this FDE while there are more rows available
do
{
	Dwarf_Addr rowRVA;

	// Get row for the current address, which also outputs the address of the next
	// available row
	Dwarf_Addr nextRowRVA;
	dwarf_get_fde_info_for_all_regs3_b(fde, currentRowRVA, &registerTable, &rowRVA, &hasMoreRows, &nextRowRVA, &dwarfError);

	// do something with register table

	// Move to the next available row
	currentRowRVA = nextRowRVA;
} while (hasMoreRows);

For a binary we're testing against, this takes ~1 second to extract all unwind rows for all FDEs in the binary. The binary has:

154.361 FDEs
1.548.145 rows

We analyzed this, and the root cause is that this is essentially a somewhat hidden quadratic loop: dwarf_get_fde_info_for_all_regs3_b internally uses _dwarf_exec_frame_instr (via _dwarf_get_fde_info_for_a_pc_row) to execute all instructions for the given FDE until it reaches the search_pc_val passed in, but it starts from the first instruction for each call.

This means that if you're iterating through the rows for an FDE like this, the loop essentially looks like this in pseudo code:

for each row in function:
	for each instruction in function:
		execute instruction
		if current location > search_pc_val:
			break

This PR implements a new function dwarf_iterate_fd_info_for_all_regs3 that fixes this. The new function uses a new (internal) helper function _dwarf_iterate_frame_instr that executes all instructions for the FDE, invoking a callback for each row. This turns the quadratic loop into a linear loop over the instructions, which results in a significant speedup: iteration for this binary goes from 1007ms to 83ms, which is ~12x faster.

Open questions

I'm sending this PR as a RFC/draft, because there are some questions around the change as implemented that I'm not sure about:

First of all, the new _dwarf_iterate_frame_instr helper is a copy/paste of the existing _dwarf_exec_frame_instr with some minor modifications to invoke the callback instead. This results in significant code duplication that is probably not desired. It is technically possible to implement _dwarf_exec_frame_instr in terms of the new _dwarf_iterate_frame_instr to prevent this code duplication.

I haven't included that as part of the PR, but that could look something like this:

struct Dwarf_Exec_Frame_Callback_Info {
    Dwarf_Bool  search_pc;
    Dwarf_Addr  search_pc_val;
    Dwarf_Frame output_table;
};

Dwarf_Bool _dwarf_exec_frame_instr_callback(Dwarf_Frame table,
    Dwarf_Addr subsequent_pc, Dwarf_Bool is_last_row, void* user_data)
{
    struct Dwarf_Exec_Frame_Callback_Info* exec_data =
        (struct Dwarf_Exec_Frame_Callback_Info*)user_data;

    Dwarf_Bool done = exec_data->search_pc &&
        (subsequent_pc > exec_data->search_pc_val);

    if ((done || is_last_row) && exec_data->output_table) {

        struct Dwarf_Reg_Rule_s* t2reg = exec_data->output_table->fr_reg;
        struct Dwarf_Reg_Rule_s* t3reg = table->fr_reg;
        unsigned minregcount = (unsigned)MIN(exec_data->output_table->fr_reg_count,
            table->fr_reg_count);
        unsigned curreg = 0;

        exec_data->output_table->fr_loc = table->fr_loc;
        for (; curreg < minregcount; curreg++, t3reg++, t2reg++) {
            *t2reg = *t3reg;
        }

        exec_data->output_table->fr_cfa_rule = table->fr_cfa_rule;
    }

    return done;
}

int
_dwarf_exec_frame_instr(Dwarf_Bool make_instr,
    Dwarf_Bool search_pc,
    Dwarf_Addr search_pc_val,
    Dwarf_Addr initial_loc,
    Dwarf_Small* start_instr_ptr,
    Dwarf_Small* final_instr_ptr,
    Dwarf_Frame table,
    Dwarf_Cie cie,
    Dwarf_Debug dbg,
    Dwarf_Unsigned reg_num_of_cfa,
    Dwarf_Bool* has_more_rows,
    Dwarf_Addr* subsequent_pc,
    Dwarf_Frame_Instr_Head* ret_frame_instr_head,
    Dwarf_Unsigned* returned_frame_instr_count,
    Dwarf_Error* error)
{
    struct Dwarf_Exec_Frame_Callback_Info user_data;
    user_data.search_pc = search_pc;
    user_data.search_pc_val = search_pc_val;
    user_data.output_table = table;

    return _dwarf_iterate_frame_instr(&_dwarf_exec_frame_instr_callback,
        &user_data, make_instr, initial_loc, start_instr_ptr,
        final_instr_ptr, cie, dbg, reg_num_of_cfa, has_more_rows,
        subsequent_pc, ret_frame_instr_head, returned_frame_instr_count, error);
}

But this is not a risk-free change.

Secondly, the new dwarf_iterate_fde_info_for_all_regs3 has to make use of an internal callback function _dwarf_iterate_fde_info_for_all_regs3_callback that's passed to the new _dwarf_iterate_frame_instr helper. The reason for this is that _dwarf_iterate_frame_instr (and _dwarf_exec_frame_instr) work with an internal struct Dwarf_Frame that's not currently exposed in libdwarf.h. This means we need to copy from Dwarf_Frame to Dwarf_Regtable3, which is a bit wasteful.

It would be nice if Dwarf_Frame (and related structs) could be exposed in the API to avoid this copy step. As I understand it, the difference between Dwarf_Frame and Dwarf_Regtable3 was introduced to avoid breaking the API for existing functions like dwarf_get_fde_info_for_all_regs3_b, but since there is a new function being introduced here, there is no risk of that.

I realize this is quite a lot taken together, but it would be great to get your thoughts/feedback on all of this to see if we can get this PR in a shape where it could be upstreamed (or if you have any ideas around how a similar optimization could be implemented in a way that fits a bit better in libdwarf's current architecture).

…e_info_for_all_regs3 API function This greatly optimizes iteration through all FDE unwind rows. The new helper function invokes a callback for each row in the FDE, instead of having to repeatedly call _dwarf_exec_frame_instr, which performs a lot of redundant work per row and has quadratic behavior. The new helper is mostly a copy of _dwarf_exec_frame_instr with some minor modifications. The new dwarf_iterate_fde_info_for_all_regs3() function uses the new helper function internally

davea42 · 2025-11-25T21:20:34Z

I have not looked at this yet, still working on harmless errors/dwarfdump.

The problem with public structs in the API is that when DWARF changes something
affecting those structs it is sort of impossible to preserve source compatiblity
for library users. And causes code bloat.

A function interface, on the other hand, even with N return values via pointers,
can easily deal with new versions by adding
new functions, and there are several examples in the library.
Basically painless.

Your code in this pull request looks pretty sensible. I need to look again, but it looks promising.

rovarma · 2025-11-25T21:23:56Z

Thanks! No rush -- happy to discuss when you have the room for it.

davea42 · 2025-11-26T21:41:47Z

It would be nice if Dwarf_Frame (and related structs) could be exposed in the API to avoid this copy step. As I understand it, the difference between Dwarf_Frame and Dwarf_Regtable3 was introduced to avoid breaking the API for existing functions like dwarf_get_fde_info_for_all_regs3_b, but since there is a new function being introduced here, there is no risk of that.

Well, that assumes yet more change won't be required in the future.
But I assume change will be required. So not introducing more public structs.
The transformation is not expensive for anyone, but breaking the API
is inevitably expensive for some people (including me).

Anyway, the overall idea looks good.

I'm about to write my own dwarf_iterate_fde_info_for_all_regs3() to verify various details.
No code duplication will be involved. Existing thorough tests will ensure changes
break nothing that already works. I don't do new code in my head (wish I could though).

I will also write a new dwarfexample/frame2.c using the new approach. Maybe with
an automated comparison of the important output of frame1.c with frame2.c

You are not the first to request data on all the fde pc value rows, so this will help
others.

There are only three calls to _dwarf_exec_frame_instr (before changes)
so it's not difficult to get the entire picture. Thanks for pushing me
to look at this!

It would be nice to see your iterate versions as a cross-check on my efforts.

davea42 · 2025-11-26T21:43:56Z

Changing draft to pull request was a mistake on my part. So back to draft now.
DavidA

davea42 marked this pull request as ready for review November 25, 2025 21:22

davea42 marked this pull request as draft November 26, 2025 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Optimizing extraction of all unwind rows for an FDE #308

[RFC] Optimizing extraction of all unwind rows for an FDE #308

rovarma commented Nov 23, 2025

Uh oh!

davea42 commented Nov 25, 2025 •

edited

Loading

Uh oh!

rovarma commented Nov 25, 2025

Uh oh!

davea42 commented Nov 26, 2025

Uh oh!

davea42 commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[RFC] Optimizing extraction of all unwind rows for an FDE #308

Are you sure you want to change the base?

[RFC] Optimizing extraction of all unwind rows for an FDE #308

Conversation

rovarma commented Nov 23, 2025

Uh oh!

davea42 commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rovarma commented Nov 25, 2025

Uh oh!

davea42 commented Nov 26, 2025

Uh oh!

davea42 commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davea42 commented Nov 25, 2025 •

edited

Loading