Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cached random access to CUs and DIEs #264

Merged
merged 4 commits into from Apr 22, 2020
Merged

Conversation

mdmillerii
Copy link
Contributor

Here is a series that refactors the DIE cache to cache entries without walking from the top of a CU,
and also one that adds a similar cache to the CU lookup in the DWARFInfo.

@mdmillerii
Copy link
Contributor Author

I've been working on this for a few weeks now, developing a test to locate static variables. I've been testing using a full Linux kernel build so testing a few is useful.

This part has been refactored enough that I'm comfortable pushing it; the bigger script still needs a few rounds.

This addresses #262 so I've put some efffort into getting this commented and pushed. I just added the LUTEntry object today.

@eliben
Copy link
Owner

eliben commented Feb 4, 2020

Thanks for the contribution. Do you think it would be possible to split this PR into two? One fir DIE cache and one for CU cache?

@sevaa would you like to help review this, since it addresses #262?

@sevaa
Copy link
Contributor

sevaa commented Feb 4, 2020

_get_cached_DIE assumes that the _diemap of a CU already has at least one entry. get_DIE_from_refaddr eventually calls _get_cached_DIE for the target CU.

I'm dealing with cases of both intra-CU and cross-CU references (Xcode loves the latter). If one tries to follow the reference to another CU where not a single DIE has been parsed/cached so far, will that work? Please test the scenario.

If it works as it is, remove the misleading comment please :)

Note that self._diemap cannot be empty because a die, the argument,
is already parsed.

@mdmillerii
Copy link
Contributor Author

@sevaa does a4fb515 address your concern?

I left this as two commits to show the code motion for review but will squash as requested by Eli.

@sevaa
Copy link
Contributor

sevaa commented Feb 5, 2020

It does.

@mdmillerii
Copy link
Contributor Author

mdmillerii commented Feb 6, 2020

Do you think it would be possible to split this PR into two? One fir DIE cache and one for CU cache?

I initially read this as can I factor this into two commits, but then I realized you asked for multiple PRs.
It would take at least 3 to get back to this state:

  1. The first two commits (the get_parent and the DIE by ref within a CU) can be squashed.
  2. CU caching is the third commmit
  3. The DIE API helpers and the testcase can be squashed.

Based on your request I squashed DIE cache cleanup (which was always just showing my work but caused review noise), and consolidated the find a DIE pieces into a helpers commit.

Review note: github seems to show the commits by authored date instead of parent order so the test case is appearing before the API helpers in the GUI.

@eliben
Copy link
Owner

eliben commented Mar 6, 2020

@rhelmot @rupran @woodruffw pinging some folks who've been active recently -- since this is a large change, I'd really appreciate some help with reviewing :-) My review queue for pycparser has grown rather long.

@eliben
Copy link
Owner

eliben commented Mar 6, 2020

@mdmillerii do you have some measurements / benchmark numbers to report? What kind of operations does this make faster, and how much faster?

Note to self: we really need some portfolio of files for performance measurements

@woodruffw
Copy link
Contributor

I can review (and test) here in a bit, thanks for the ping!

@eliben
Copy link
Owner

eliben commented Mar 7, 2020

I can review (and test) here in a bit, thanks for the ping!

Thanks! I'll assign you, just unassign when you're done

@@ -112,6 +117,84 @@ def has_debug_info(self):
"""
return bool(self.debug_info_sec)

def get_DIE_from_lute(self, lutentry):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe rename to either get_DIE_from_lut or get_DIE_from_lut_entry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eliben want me to fix this up and rebase? (I'd choose lut_entry) .. then I can rebase @265 on top

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lut_entry SGTM

Copy link
Contributor

@woodruffw woodruffw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me! I left one (style) nit, and am currently testing this against a big suite of internal projects to see if it breaks anything.

@woodruffw
Copy link
Contributor

Confirmed that this works with my internal tooling. I'll do some additional performance testing.

@eliben
Copy link
Owner

eliben commented Mar 9, 2020

@woodruffw out of curiosity, what is this big suite of projects you have? Do you have a setup to analyze many different ELF binaries with pyelftools?

""" Search for our parent by, starting with the CU top die,
setting the _parent link of each child and then iterating on
the child which was closest to object DIE offset without over,
whithan to the object DIE offset, which is an ancestor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence does not make sense to me, maybe reword or split the sentence in two parts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that was trying to stuff everything into a sentence and it got lost.

Copy link
Owner

@eliben eliben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I left some minor comments. Feel free to fix these and the comments made by others (thanks for the reviews!), and rebase.

self.cu_die_offset <= refaddr < self.cu_offset + self.size,
'refaddr %s not in DIE range of CU %s' % (refaddr, self.cu_offset))

die = self._get_cached_DIE(refaddr)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just return this directly without the die = intermediate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

search = self.cu.get_top_DIE()

while search.offset < self.offset:

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove this empty line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


search = prev

pass
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's going on here. pass outside the loop?

Should this line be removed?

Also the commented code following it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactoring remenants, fixed.

@@ -112,6 +117,84 @@ def has_debug_info(self):
"""
return bool(self.debug_info_sec)

def get_DIE_from_lute(self, lutentry):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lut_entry SGTM

@@ -253,23 +336,41 @@ def range_lists(self):

#------ PRIVATE ------#

def _parse_CUs_iter(self):
def _parse_CUs_iter(self, offset=0):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document what offset means

See get_CU_at().
"""
# If the insert point is 0 the entry does not match (perhaps the cache
# is empty). Otherwise on a miss parse and insert.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what the second sentence in this comment means. Please rephrase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was saying if the first sentence was false, then do the actions. Hopefully the new is clearer,

for (k, v) in pt.items():
ndie = dwarf.get_DIE_from_lute(v)
self.dprint(ndie)
# if not ndie.tag == 'DW_TAG_variable':
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

@eliben eliben assigned mdmillerii and unassigned woodruffw Mar 10, 2020
@woodruffw
Copy link
Contributor

woodruffw commented Mar 10, 2020

out of curiosity, what is this big suite of projects you have? Do you have a setup to analyze many different ELF binaries with pyelftools?

Sorry for the ambiguity there: it's both a big suite of projects (i.e. test binaries) and tools that use pyelftools (as a dependency) to extract information from those binaries.

There's nothing publicly available at the moment, but it's all part of the DARPA CHESS program.

Add a function to set the _parent link of known chldren, iterating
down each parent of a target DIE.  Walk all children of a given
parent and set each child's ._parent to avoid O(n^2) walking.

A future commit will add other methods to instatiate a DIE that will
not set the _parent link as the DIE is instantiated.

This walk uses the knowledge that in a flattened tree a parents offset
will always be less than the childs.

The call to die.set_parent in compile_unit iter_DIE_children could be
removed to make the method private,, but it is free to set starting
from the top DIE.  Alternativly make it an optional argument to
DIE creation.
Accept a resolved reference address for a DIE in a compile unit and
parse the DIE at that location.  Insert into the _diemap / _dielist
cache shared with iter_DIE_children() for fast repeated lookups.

This can be used to follow attribute references to a DIE that be
referenced several times (eg for a DW_AT_type reference) or find
a DIE referenced in a lookup table.
Maintain a cache of compile units parsed and a map of their offsets
similar to the one mainained of DIEs by compile units.

Add the ability to parse a random compile unit when the offset of
the compile unit header is known.

Add the ability to search for a compile unit containing (spanning)
a given refaddr, such as that obtained from a DIE reference class
attribute, starting from the closest previous cached compile unit.
Add APIs to lookup a DIE from: (a) a DIE reference class attribute
taking into account the attribute form, (b) from a lookup table entry
(NameLUTEntry) from a .pub_types or .pub_names section, or (c) directly
from a reference addresss (.debug_info offset) regardless of how it
was obtained.

Add a test that will lookup dies from pubnames and follow die by ref.

	This is a simple test that exercises the new cache lookup
	methods and provides a starting point on how to determine a
	variables type.

For now raise NotImplemented exception for type signature lookup
and supplemental dwarf object files.
mdmillerii added a commit to mdmillerii/pyelftools that referenced this pull request Apr 9, 2020
This commit addresses the review comments in eliben#264 showing the changes
to the prior submission after rebasing before folding them into the prior
commits.
@mdmillerii
Copy link
Contributor Author

Here's my update. I also pushed a branch random-die-review that shows the previous content rebased then 1 patch reverted so reviewers can see what was changed, specifically on a5d42a4

@eliben eliben merged commit 519a234 into eliben:master Apr 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants