A utility to identify and map the semantic structure of files, including polyglots, chimeras, and schizophrenic files. It can be used in conjunction with its sister tool PolyTracker for Automated Lexical Annotation and Navigation of Parsers, a backronym devised solely for the purpose of collectively referring to the tools as The ALAN Parsers Project.
In the same directory as this README, run:
pip3 install -e .
This will automatically install the polyfile
executable in your path.
$ polyfile --help
usage: polyfile [-h] [--html HTML] [--debug] [--quiet] FILE
A utility to recursively map the structure of a file.
positional arguments:
FILE The file to analyze
optional arguments:
-h, --help show this help message and exit
--html HTML, -t HTML Path to write an interactive HTML file for exploring
the PDF
--debug, -d Print debug information
--quiet, -q Suppress all log output (overrides --debug)
To generate a JSON mapping of a file, run:
polyfile INPUT_FILE > output.json
You can optionally have PolyFile output an interactive HTML page containing a labeled, interactive hexdump of the file:
polyfile INPUT_FILE --html output.html > output.json
PolyFile can identify all 10,000+ file formats in the TrID database. It currently has support for parsing and semantically mapping the following formats:
- PDF, using an instrumented version of Didier Stevens' public domain, permissive, forensic parser
- ZIP, including reursive identification of all ZIP contents
- JPEG/JFIF, using its Kaitai Struct grammar
- iNES
- Any other format specified in a KSY grammar
For an example that exercises all of these file formats, run:
curl -v --silent https://www.sultanik.com/files/ESultanikResume.pdf | polyfile --html ESultanikResume.html - > ESultanikResume.json
- The instrumented Kaitai Struct parser generator implementation has only been tested on the JPEG/JFIF grammar; other KSY definitions may exercise portions of the KSY specification that have not yet been implemented
- The JSON output schema will soon be replaced with the similar SBuD format
This research was developed by Trail of Bits with funding from the Defense Advanced Research Projects Agency (DARPA) under the SafeDocs program as a subcontractor to Galois. It is licensed under the Apache 2.0 lisense. The PDF parser is modified from the parser developed by Didier Stevens and released into the public domain. © 2019, Trail of Bits.