Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraction of images from "Data" directory in office OLE files #457

Open
Solumnant opened this issue Jun 26, 2019 · 2 comments
Open

Extraction of images from "Data" directory in office OLE files #457

Solumnant opened this issue Jun 26, 2019 · 2 comments
Labels

Comments

@Solumnant
Copy link

I'm trying to write an addition to a production environment document classification application by using images extracted from office documents.

Our team has been using oletools to extract macros from files we're looking at, and at first glance it would appear as though oletools would support image extraction given that it works with Microsoft Compound files, but none of the tools seem to look inside the "Data" directory within the file where the images are held.

I was hoping that oletools could add a module that would extract all nonstandard media from office files in a way that they could be used for other tools. Another good question oletools could answer is whether a document contains embedded pictures without extracting them.

OLEimage

@decalage2
Copy link
Owner

For now I do not plan to parse the internal structure of Word/Excel/PPT/etc files in oletools, as that would require a lot of work. However, if you are willing to contribute some code to do so, please do not hesitate to send me a pull request.

It looks like what you are trying to achieve is to carve image files from stream data. In that case, I can suggest to look at file carving tools such as those:
https://hachoir.readthedocs.io/en/latest/subfile.html
https://github.com/simsong/bulk_extractor
https://github.com/sleuthkit/scalpel
http://foremost.sourceforge.net/

@christian-intra2net
Copy link
Contributor

I did start some code in direction of "let's understand the structure as office does it" with the ppt_record_parser . However, there is just so much different stuff in these files and sometimes microsoft does not adhere to its own standards (or I misread them), so pretty early I fell back to just parse the type of data needed to extract macros and ignored the rest. But it should be easily expendable (at least for ppt where everything is record-based).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants