New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve Analysis through reboot #478

Open
AaronOpfer opened this Issue Nov 24, 2016 · 10 comments

Comments

Projects
None yet
3 participants
@AaronOpfer
Contributor

AaronOpfer commented Nov 24, 2016

Analysis can be slow on some very large binaries. It would be nice if we could save the results of the analysis to disk.

To prevent loading incorrect analysis when binaries change, we can compare the analysis timestamp against the binary's filestamp at loadtime, and if the binary is newer we can discard the analysis.

@10110111

This comment has been minimized.

Show comment
Hide comment
@10110111

10110111 Nov 24, 2016

Contributor

A hash sum seems more robust than a timestamp.

Contributor

10110111 commented Nov 24, 2016

A hash sum seems more robust than a timestamp.

@AaronOpfer

This comment has been minimized.

Show comment
Hide comment
@AaronOpfer

AaronOpfer Nov 29, 2016

Contributor

You're right, but it could be expensive on binaries with lots of embedded resources. Maybe it should be an opt-in.

Contributor

AaronOpfer commented Nov 29, 2016

You're right, but it could be expensive on binaries with lots of embedded resources. Maybe it should be an opt-in.

@eteran

This comment has been minimized.

Show comment
Hide comment
@eteran

eteran Nov 30, 2016

Owner

The analyzer already does an MD5 of every region it analyzes (in particular to detect changes), so that could be used directly. Fortunately, it hasn't proven to be particularly time consuming yet.

Owner

eteran commented Nov 30, 2016

The analyzer already does an MD5 of every region it analyzes (in particular to detect changes), so that could be used directly. Fortunately, it hasn't proven to be particularly time consuming yet.

@eteran

This comment has been minimized.

Show comment
Hide comment
@eteran

eteran Dec 1, 2016

Owner

I think the first step in this, would be the make the analysis data store addresses relative to the module/region base instead of absolute like it is currently. That would make saving/restoring much simpler when ASLR is involved.

Owner

eteran commented Dec 1, 2016

I think the first step in this, would be the make the analysis data store addresses relative to the module/region base instead of absolute like it is currently. That would make saving/restoring much simpler when ASLR is involved.

@AaronOpfer

This comment has been minimized.

Show comment
Hide comment
@AaronOpfer

AaronOpfer Dec 6, 2016

Contributor

We could do that, but it would probably be better to just save the base address alongside the absolute addresses so that we can do corrections at load-time.

Contributor

AaronOpfer commented Dec 6, 2016

We could do that, but it would probably be better to just save the base address alongside the absolute addresses so that we can do corrections at load-time.

@eteran

This comment has been minimized.

Show comment
Hide comment
@eteran

eteran Dec 6, 2016

Owner

Sure, that would work equally well.

Owner

eteran commented Dec 6, 2016

Sure, that would work equally well.

@AaronOpfer

This comment has been minimized.

Show comment
Hide comment
@AaronOpfer

AaronOpfer Dec 21, 2016

Contributor

Using the md5 sum to determine whether analysis is still relevant might be a problem for binaries that use relocation tables. Relocation tables will cause a relocated binary to have a different hash each time.

Contributor

AaronOpfer commented Dec 21, 2016

Using the md5 sum to determine whether analysis is still relevant might be a problem for binaries that use relocation tables. Relocation tables will cause a relocated binary to have a different hash each time.

@eteran

This comment has been minimized.

Show comment
Hide comment
@eteran

eteran Dec 21, 2016

Owner

Well, I think that will generally be a problem for any solution that is based on "did the data in this region change". I am of course open to alternatives.

BTW. do you know if my push fixed #528 ?

Owner

eteran commented Dec 21, 2016

Well, I think that will generally be a problem for any solution that is based on "did the data in this region change". I am of course open to alternatives.

BTW. do you know if my push fixed #528 ?

@AaronOpfer

This comment has been minimized.

Show comment
Hide comment
@AaronOpfer

AaronOpfer Dec 21, 2016

Contributor

Any suggestions on what we should serialize? Serializing everything looks impractical since these function objects have vectors of basicblocks which have vectors of instructions. Literally writing all of the instructions into a file sounds pretty redundant and probably slower than real analysis.

It seems like the most benefit would come from reducing the most expensive parts of analysis. It seems like the fuzzy analysis and basic block steps essentially saves every function's start address, end address and reference counts (expensive) and then disassembles and saves all of those functions' instructions (cheap-ish). If we could serialize that expensive information, we might be able to more quickly recreate Function and BasicBlock objects and their disassembled instructions than we normally would.

Does this seem like a reasonable approach? I don't want to get too far off the deep end before I confirm this is reasonable.

Contributor

AaronOpfer commented Dec 21, 2016

Any suggestions on what we should serialize? Serializing everything looks impractical since these function objects have vectors of basicblocks which have vectors of instructions. Literally writing all of the instructions into a file sounds pretty redundant and probably slower than real analysis.

It seems like the most benefit would come from reducing the most expensive parts of analysis. It seems like the fuzzy analysis and basic block steps essentially saves every function's start address, end address and reference counts (expensive) and then disassembles and saves all of those functions' instructions (cheap-ish). If we could serialize that expensive information, we might be able to more quickly recreate Function and BasicBlock objects and their disassembled instructions than we normally would.

Does this seem like a reasonable approach? I don't want to get too far off the deep end before I confirm this is reasonable.

@eteran

This comment has been minimized.

Show comment
Hide comment
@eteran

eteran Dec 21, 2016

Owner

I'll take a look at it and get back to you. But we should probably lean more towards the "store too much" over potentially storing too little.

Owner

eteran commented Dec 21, 2016

I'll take a look at it and get back to you. But we should probably lean more towards the "store too much" over potentially storing too little.

AaronOpfer added a commit to AaronOpfer/edb-debugger that referenced this issue Dec 23, 2016

Scaffolding for Analysis Persistence
This is work toward #478, but not a complete implementation yet.

In order to meaningfully serialize the analysis data of a particular region, we need to know:
* What file the region belongs to,
* What the base address is of that file in memory,
* And how that compares to where the file was actually loaded (relocation).

So first we needed to add `base_address()` public method to all of the `IBinaryInfo` plugins to satisfy the second requirement (PE32 is a stub right now, though).  Then I added a method that generates a filepath that can be used to store the region's analysis. This is a little rough; we have to iterate all regions in memory to locate regions that belong to the same file as our analyzed region (because of the possibility of multiple code sections) and then we try to use BinaryInfo plugins on every region until one finally fits. We use that to obtain the base address and the load address of the module. Then, finally, we calculate a relative offset and put it into the filename.

Once this is in place we just need to actually implement serialization of the analysis data. This is where I got a little bamboozled, so I'm putting up what I have so far since I think what I have so far is pretty uncontroversal.

AaronOpfer added a commit to AaronOpfer/edb-debugger that referenced this issue Dec 23, 2016

Scaffolding for Analysis Persistence
This is work toward #478, but not a complete implementation yet.

In order to meaningfully serialize the analysis data of a particular region, we need to know:
* What file the region belongs to,
* What the base address is of that file in memory,
* And how that compares to where the file was actually loaded (relocation).

So first we needed to add `base_address()` public method to all of the `IBinaryInfo` plugins to satisfy the second requirement.  Then I added a method that generates a filepath that can be used to store the region's analysis. This is a little rough; we have to iterate all regions in memory to locate regions that belong to the same file as our analyzed region (because of the possibility of multiple code sections) and then we try to use BinaryInfo plugins on every region until one finally fits. We use that to obtain the base address and the load address of the module. Then, finally, we calculate a relative offset and put it into the filename.

Once this is in place we just need to actually implement serialization of the analysis data. This is where I got a little bamboozled, so I'm putting up what I have so far since I think what I have so far is pretty uncontroversal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment