Extraction of images from "Data" directory in office OLE files #457

Solumnant · 2019-06-26T21:07:28Z

I'm trying to write an addition to a production environment document classification application by using images extracted from office documents.

Our team has been using oletools to extract macros from files we're looking at, and at first glance it would appear as though oletools would support image extraction given that it works with Microsoft Compound files, but none of the tools seem to look inside the "Data" directory within the file where the images are held.

I was hoping that oletools could add a module that would extract all nonstandard media from office files in a way that they could be used for other tools. Another good question oletools could answer is whether a document contains embedded pictures without extracting them.

decalage2 · 2019-06-27T07:23:44Z

For now I do not plan to parse the internal structure of Word/Excel/PPT/etc files in oletools, as that would require a lot of work. However, if you are willing to contribute some code to do so, please do not hesitate to send me a pull request.

It looks like what you are trying to achieve is to carve image files from stream data. In that case, I can suggest to look at file carving tools such as those:
https://hachoir.readthedocs.io/en/latest/subfile.html
https://github.com/simsong/bulk_extractor
https://github.com/sleuthkit/scalpel
http://foremost.sourceforge.net/

christian-intra2net · 2019-06-27T10:37:56Z

I did start some code in direction of "let's understand the structure as office does it" with the ppt_record_parser . However, there is just so much different stuff in these files and sometimes microsoft does not adhere to its own standards (or I misread them), so pretty early I fell back to just parse the type of data needed to extract macros and ignored the rest. But it should be easily expendable (at least for ppt where everything is record-based).

decalage2 added the question label Jun 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extraction of images from "Data" directory in office OLE files #457

Extraction of images from "Data" directory in office OLE files #457

Solumnant commented Jun 26, 2019

decalage2 commented Jun 27, 2019

christian-intra2net commented Jun 27, 2019

Extraction of images from "Data" directory in office OLE files #457

Extraction of images from "Data" directory in office OLE files #457

Comments

Solumnant commented Jun 26, 2019

decalage2 commented Jun 27, 2019

christian-intra2net commented Jun 27, 2019