Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrating byteweight into bap. #22

Closed
ivg opened this issue Nov 26, 2014 · 3 comments
Closed

integrating byteweight into bap. #22

ivg opened this issue Nov 26, 2014 · 3 comments

Comments

@ivg
Copy link
Member

ivg commented Nov 26, 2014

So, we have byteweight merged into bap, but there're few issues, we need to discuss. We should understand, that currently it is mostly not a part of bap, but more a demo application. That's not bad, but it is not enough.
What we should do next, is to split it into a library/application parts. So that we can grab some neat stuff from byteweight, so that it can be used inside bap itself. Also, we need to make a plugin of byteweight. But before doing this we should figure out what kind of service does it provide. Currently in BAP there is only one service named bap.image that provides facilities to load and parse binary files. So it is time to add new service. Now we should try to figure out an interface of the service. Indeed, we need to figure out two interfaces, one for backend (i.e., service provider) and other for the frontend (service itself) (cf., elf_backend and image). So, lets start from the frontend. Two variants came to my mind: something like function start identifier (FSI) or function boundaries identifier (FBI). Currently, only dwarf can provide the latter. But since dwarf can be used in real conditions we can forget about it. Also we have elf itself, that can provide some useful information even for stripped binary. But afaik it can also provide only function starts (correct me if I'm wrong, but all the we can rely is dynsym table coupled with relocation table, and they give us only starting locations). So, my idea is, instead of starting with FBI and then downcasting it to FSI we should start with the latter. Another question is symbol names. I thing that function boundaries and function names are orthogonal ideas, and shouldn't be mixed. It would be a better idea to have a separate service, that will resolve names. So back to FSI. What this service actually can provide is the predicate over binary, that marks certain addresses as starts of functions, that gives us image -> addr seq or mem -> arch -> addr seq. The problem with this interfaces, is that it doesn't grant any access to file metainformation, so we can't implement any providers, that rely on this (like dwarf, or elf). That means, that FSI backend should work on a lower level, it should work directly with file, so we came out with Bigstring.t -> arch -> addr seq. Also, having in mind some other possible backend implementations, like based on llvm code, we can make it even a little bit more low-level:
Bigstring.t -> arch -> addr -> bool. So, I'm eager to hear others. Everyone is welcome.

ivg added a commit that referenced this issue Nov 26, 2014
@ivg ivg mentioned this issue Nov 26, 2014
@dbrumley
Copy link
Contributor

I agree with Ivan. In addition, we should think about how, architecture-wise, we want to split the training and classification part. Anyone using byteweight will probably want to do their own training, test accuracy (against symbols), and so on. Once they have a trie, they will want to use it. So one question is in the library, which tree do we load? Do we make someone specify, or is there a default? These are questions you two should resolve as well.

@ivg
Copy link
Member Author

ivg commented Nov 26, 2014

train is a program that we already have. It can be called to obtain signatures (it is not yet added to oasis, so that it wouldn't build automatically, see #23, but we can assume that it is already added).
And currently, byteweight comes as an executable, it doesn't provide a library level interface. We decided to move in a small steps: first to make it work as it is, and then split into parts, refactoring something useful. For example, I'm tempting to grab trie implemetation to Bap_types.

@tiffanyb
Copy link
Member

I agree that we should split bw to application and libraries. In terms of
customized signature file, I propose that we support it in application but
not in library. This is because as one of the libraries in BAP, we only use
the signature that BAP generates. In this case one can consider BAP as a
user of bw. Similarly for training, I think we should regard it as an
application as well.
On Nov 26, 2014 10:27 AM, "Ivan Gotovchits" notifications@github.com
wrote:

train is a program that we already have. It can be called to obtain
signatures (it is not yet added to oasis, so that it wouldn't build
automatically, see #23
#23, but we can
assume that it is already added).
And currently, byteweight comes as an executable, it doesn't provide a
library level interface. We decided to move in a small steps: first to make
it work as it is, and then split into parts, refactoring something useful.
For example, I'm tempting to grab trie implemetation to Bap_types.


Reply to this email directly or view it on GitHub
#22 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants