Python tool for bulk PDF feature extraction. This tool is a prototype.
Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
db
jobs
pdfminer
peepdf
scripts
util
.gitignore
JSAnalysis.py
LICENSE.md
README.md
__init__.py
build_pdf_objects.py
cfg.py
db_mgmt.py
huntterp.py
pdfrankenstein.py
sdhasher.py
storage.py
xml_creator.py

README.md

PDFrankenstein

Python tool for bulk malicious PDF feature extraction.

Dependencies

  • PyV8 (and V8) (optional: if you intend to use JS deobfuscation. Note: JS deobfuscation needs to be run in a safe environment, as you would treat any malware.
  • lxml
  • scandir (optional: module included in lib folder)
  • postgresql and psycopg2 (optional: if you intend to use postgresql backing storage)

Usage

$ pdfrankenstein.py --help

Output to a file in delimited plain text, parses ALL files in pdf-dir/

$ pdfrankenstein.py -o file -n fileoutput.txt ~/pdf-dir

Output to an sqlite database

$ pdfrankenstein.py -o sqlite3 -n pdf-db ~/pdf-dir

Output to stdout after parsing all files listed inside file-with-pdfs

$ pdfrankensetin.py -o stdout ~/file-with-pdfs
pdf_in PDF input for analysis. Can be a single PDF file or a directory of files.
-d, --debug Print debugging messages.
-o, --out Analysis output filename or type. Default to 'unnamed-out.*' file in CWD. Options: 'sqlite3'||'postgres'||'stdout'||[filename]
-n, --nameName for output database.
--hasherSpecify which type of hasher to use. PeePDF | PDFMiner (default). PDFMiner option provides better parsing capabilities.
-v, --verboseSpam the terminal, TODO.

References

Open Source PDF Tools