Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-drcov format support #41

Closed
yrp604 opened this issue Aug 24, 2018 · 3 comments
Closed

Non-drcov format support #41

yrp604 opened this issue Aug 24, 2018 · 3 comments

Comments

@yrp604
Copy link
Contributor

@yrp604 yrp604 commented Aug 24, 2018

Hey so, we've talked about this a bit but I wanted to document why lighthouse supporting other formats would be nice while it was still fresh.

So, drcov is useful in that there are easy, cross-platform tools to generate it, however it has some pretty significant shortcomings which I'm running into. Specifically drcov is made up of a header which gives the module maps and then a series of tuples (module id, bb offset, bb size). The main issue here is the bb size field. If you're generating a trace with someone that is aware of the bb sizes (e.g. a dbi), this is all cool, however if youre dumping a trace from something that is not bb aware (e.g. an emulator or collecting code coverage via sampling) you just have a list of PC values.

Assuming you have have a module map and a list of PC values there are a few things you could do:

  • Tell Lighthouse that every instruction is a single bb, however with variable length instructions this requires a disassembler
  • Use a disassembler to go from PC -> Block, then get the block size and base from that. This is what I've most commonly implemented but it's a pain in the ass: it requires IDA to have all covered code be in recognized functions, and doesnt properly handle cases where you have calls in the middle of a block (if the call does not return, the entire block is still pained)

Basically both these require pre-processing the coverage in IDA before loading, which is doable but is a pain in the ass.

So, I'm pretty agnostic with regards to what the actual format is, but the feature request is the ability to load any coverage data format which can be generated from the module mappings and a list of PC values.

@gaasedelen
Copy link
Owner

@gaasedelen gaasedelen commented Aug 29, 2018

You are not the first to ask for the ability to load instruction traces, and I think tat it is a reasonable request. It is trivial to implement for the 'perfect' trace, but gets a bit murky for the general case.

I haven't sat down to enumerate the considerations we have to be mindful of for adding this feature. This is a good opportunity to do so, and welcome any of your thoughts.

0 - usability

Like loading from drcov files, it is important that loading a trace will just work without spamming users with dialogs and asking for their 'help' to identify relevent modules or to enter a base address (eg ASLR) for the module of interest. This is a shitty user experience which can become both tedious and annoying, especially when it's avoidable.

I also don't want menu options to choose between 10 different trace formats to load from. Novice users might not know what trace format they collected, or what to try to load it as.

1 - existing traces

There is no 'standardized' trace format, making the contents ambigious.

0x4522ed
0x4522f0
0x4522f2
0x4522f4
0x4522fb
0x452301
0x452304
0x452306
0x5a52e1
0x5a52e2
0x5a52e5
0x5a52e9
0x5a52ef
0x5a52f6
0x5a52fd
0x5a5304
0x5a530b
...
  • Is this a binary or ASCII (text) trace file?
  • Are these instruction addresses?
  • Are these basic block addresses?
    • the definition of a basic block can differ from tool to tool (disassemblers, dbis, ...)
  • Is there more than one module/address space in this trace?
  • Which module addresses correspond with the binary open in IDA?
  • What is the base address for the traced module? (assuming the VA is different than IDA)
  • If we can't use heuristics to confidently identify a base addresses or module of interest, what are the odds that the user will have logged it? Is it really worth trying to ask them to enter it?
    • Now imagine getting spammed 10x in a row asking for the base address of each trace you load.

2 - custom traces

We could force lighthouse to only load custom traces with additional metadata (sort of like drcov) which can address all of the ambiguity stated above.

The problem with this approach is that we immediately lose support for existing tracing solutions such as the in-box PIN tool by Intel, a DynamoRIO equivalent, or virtually any other tracing technology out there.

3 - expectations vs reality

Right now, lighthouse does not paint 'coverage' or compute statistics for code that falls outside of defined functions. This is what I call 'unmapped coverage' which is entirely invisible to users at the present moment, and a skeleton in the lighthouse closet.

This is relevant to this issue, because instruction traces are more frequently used to capture abnormal execution (eg malware). In these cases, it will be more common for traces trace to 'collect coverage' on code or instructions that are not within defined functions (obfuscated). Even if you could capture an instruction trace, lighthouse might not be showing you evereything that is getting executd.

The biggest reason for this shortcoming is simply because lighthouse metadata aggregation works at a function level. We do not iterate over individual instructions outside of defined functions. Integrating this change will probably require some degree of re-architecting the metadata collection and storage process which I have not investigated.


These are just some immediate thoughts, I may add more later.

@yrp604
Copy link
Contributor Author

@yrp604 yrp604 commented Oct 25, 2018

So after thinking about this and looking at my existing toolchain, I think the best option is a textfile with mod+off.

E.g. we have a.exe which loads b.dll and c.dll. Our trace would look something like:

a+4141
b+5242
b+5243
a+4142
c+6361
a+4143

If we have a.exe open, we pain the matching blocks (if idaapi.get_root_filename() matches). If we have b.dll open, we pain b's blocks, etc. Maybe we spit out some info to the console about the unmatched things (e.g. 'Warning: did not paint 100,000 trace points due to non-matching base name').

mod+off I think is the right way to go because its pretty much the simplest format thats still usable, has the fewest number of assumptions, etc.

@gaasedelen
Copy link
Owner

@gaasedelen gaasedelen commented Mar 10, 2019

There is now a rough implementation of mod+off available on the devlop branch (v0.8.4-DEV and newer). Simply try loading a mod+off file through the normal workflow.

Let me know if you experience any immediate issues or oddities, I'll be cleaning up the code soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants