Skip to content

Commit

Permalink
Version 1.9 (#28)
Browse files Browse the repository at this point in the history
* Adding new methods for stream handling (from_stream, magic_stream) (thanks to Robbert Korving)

Co-authored-by: Robbert Korving <r.korving@gmail.com>
  • Loading branch information
cdgriffith and robkorv authored Jun 15, 2020
1 parent f37d19a commit b57020f
Show file tree
Hide file tree
Showing 14 changed files with 274 additions and 114 deletions.
17 changes: 17 additions & 0 deletions .black.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
[tool.black]
line-length = 120
target-version = ['py36', 'py37', 'py38']
exclude = '''
/(
\.eggs
| \.git
| \.idea
| \.pytest_cache
| \.github
| _build
| build
| dist
| venv
| test/resources
)/
'''
13 changes: 13 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[report]
omit =
*/python?.?/*
*/site-packages/*
*/test/*
*/pypy/*
*/venv/*
*/.*/*
*/*.egg-info/*
*/.mypy_cache/*
*/.pytest_cache/*
exclude_lines =
command_line_entry
17 changes: 17 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
[flake8]
max-line-length = 120
exclude = .git,.idea,__pycache__,.gitignore,venv,.github,build,dist,test
ignore =
# E203 whitespace before ':'
# black will insert some non-E203-compliant whitespace
E203,
# W503 line break before binary operator
# black will inserts non-W503-compliant line breaks
W503,
# F401 imported but unused
# for __version__, __author__
F401,
# F403 used; unable to detect undefined names
# When importing start but some names undefined
F403,
T001
7 changes: 5 additions & 2 deletions .github/workflows/pythonpublish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,13 @@ jobs:
run: |
python setup.py sdist bdist_wheel
twine upload dist/*
PURE_MAGIC_VERSION=$(python -c "import puremagic; print(puremagic.__version__)")
- name: Find whl file
run: |
WHL=$(find dist -name *.whl -print -quit)
echo ${WHL}
- name: Upload to release
uses: JasonEtco/upload-to-release@master
with:
args: dist/puremagic-${PURE_MAGIC_VERSION}-py3-none-any.whl
args: ${WHL}
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
19 changes: 9 additions & 10 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,14 @@ jobs:
run: |
python -m pip install --upgrade pip
pip install -r requirements-test.txt
pip install coveralls
pip install coveralls black flake8 setuptools wheel twine
- name: Verify Code with Black
run: |
black --config=.black.toml --check puremagic test
- name: Lint with flake8
run: |
pip install flake8 flake8-print
# stop the build if there are Python syntax errors or undefined names
flake8 box --count --select=E9,F63,F7,F82,T001,T002,T003,T004 --show-source --statistics
# exit-zero treats all errors as warnings.
flake8 . --count --exit-zero --max-complexity=20 --max-line-length=120 --statistics
# stop the tests if there are linting errors
flake8 puremagic --count --show-source --statistics
- name: Test with pytest
env:
COVERALLS_REPO_TOKEN: ${{ secrets.COVERALLS_REPO_TOKEN }}
Expand All @@ -43,9 +43,8 @@ jobs:
coveralls || true
- name: Check distrubiton log description
run: |
pip install setuptools wheel twine
python setup.py sdist bdist_wheel
twine check dist/*
PURE_MAGIC_VERSION=$(python -c "import puremagic; print(puremagic.__version__)")
ls -lah "dist/puremagic-${PURE_MAGIC_VERSION}-py3-none-any.whl"
ls -lah "dist/puremagic-${PURE_MAGIC_VERSION}.tar.gz"
ls -lah "dist/"
WHL=$(find dist -name *.whl -print -quit)
echo ${WHL}
26 changes: 26 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.5.0
hooks:
- id: mixed-line-ending
exclude: ^test/resources/
- id: trailing-whitespace
exclude: ^test/resources/
- id: requirements-txt-fixer
- id: fix-encoding-pragma
- id: check-byte-order-marker
- id: debug-statements
- id: check-yaml
- repo: https://github.com/ambv/black
rev: 19.10b0
hooks:
- id: black
args: [--config=.black.toml]
- repo: https://gitlab.com/pycqa/flake8
rev: master
hooks:
- id: flake8
- repo: https://github.com/pre-commit/mirrors-mypy
rev: 'v0.770'
hooks:
- id: mypy
3 changes: 2 additions & 1 deletion AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,5 @@ A big thank you to everyone that has helped!
- andrewpmk
- bannsec
- Don Tsang (DonaldTsang)
- Oleksandr (msdinit)
- Oleksandr (msdinit)
- Robbert Korving (robkorv)
7 changes: 6 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
Changelog
=========

Version 1.9
-----------

- Adding new methods for stream handling (from_stream, magic_stream) (thanks to Robbert Korving)

Version 1.8
-----------

Expand All @@ -13,7 +18,7 @@ Version 1.7
- Adding support for PCAPNG files (thanks to bannsec)
- Adding support for numerous other files updated by Gary C. Kessler
- Adding script for parsing FTK GCK sigs
- Changing test suites to github workflows instead of TravisCI
- Changing test suites to github workflows instead of TravisCI
- Removing official support, new packages and test for python 2

Version 1.6
Expand Down
5 changes: 1 addition & 4 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,7 @@ Disadvantages:
Compatibility
~~~~~~~~~~~~~

- Python 2.7+
- Python 3.4+
- Python 3.5+
- Pypy

Using travis-ci to run continuous integration tests on listed platforms.
Expand All @@ -50,8 +49,6 @@ In either a virtualenv or globally, simply run:
$ python setup.py install
It has no dependencies (other than the 2.7+ built-in argparse)

Usage
-----

Expand Down
133 changes: 91 additions & 42 deletions puremagic/main.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
# -*- coding: utf-8 -*-
"""
puremagic is a pure python module that will identify a file based off it's
magic numbers. It is designed to be minimalistic and inherently cross platform
Expand All @@ -19,64 +19,70 @@
from collections import namedtuple

__author__ = "Chris Griffith"
__version__ = "1.8"
__all__ = ['magic_file', 'magic_string', 'from_file', 'from_string',
'ext_from_filename', 'PureError', 'magic_footer_array',
'magic_header_array']
__version__ = "1.9"
__all__ = [
"magic_file",
"magic_string",
"magic_stream",
"from_file",
"from_string",
"from_stream",
"ext_from_filename",
"PureError",
"magic_footer_array",
"magic_header_array",
]

here = os.path.abspath(os.path.dirname(__file__))

MAGIC_INFO_TYPES = ('byte_match', 'offset', 'extension', 'mime_type', 'name',)
PureMagic = namedtuple('PureMagic', MAGIC_INFO_TYPES)
PureMagicWithConfidence = namedtuple('PureMagicWithConfidence',
(MAGIC_INFO_TYPES + ('confidence',)))
MAGIC_INFO_TYPES = (
"byte_match",
"offset",
"extension",
"mime_type",
"name",
)
PureMagic = namedtuple("PureMagic", MAGIC_INFO_TYPES) # type: ignore
PureMagicWithConfidence = namedtuple("PureMagicWithConfidence", (MAGIC_INFO_TYPES + ("confidence",))) # type: ignore


class PureError(LookupError):
"""Do not have that type of file in our databanks"""


def _magic_data(filename=os.path.join(here, 'magic_data.json')):
def _magic_data(filename=os.path.join(here, "magic_data.json")):
""" Read the magic file"""
with open(filename) as f:
data = json.load(f)
headers = sorted((_create_puremagic(x) for x in data['headers']),
key=lambda x: x.byte_match)
footers = sorted((_create_puremagic(x) for x in data['footers']),
key=lambda x: x.byte_match)
headers = sorted((_create_puremagic(x) for x in data["headers"]), key=lambda x: x.byte_match)
footers = sorted((_create_puremagic(x) for x in data["footers"]), key=lambda x: x.byte_match)
return headers, footers


def _create_puremagic(x):
return PureMagic(byte_match=binascii.unhexlify(x[0].encode('ascii')),
offset=x[1],
extension=x[2],
mime_type=x[3],
name=x[4])
return PureMagic(
byte_match=binascii.unhexlify(x[0].encode("ascii")), offset=x[1], extension=x[2], mime_type=x[3], name=x[4]
)


magic_header_array, magic_footer_array = _magic_data()


def _max_lengths():
""" The length of the largest magic string + its offset"""
max_header_length = max([len(x.byte_match) + x.offset
for x in magic_header_array])
max_footer_length = max([len(x.byte_match) + abs(x.offset)
for x in magic_footer_array])
max_header_length = max([len(x.byte_match) + x.offset for x in magic_header_array])
max_footer_length = max([len(x.byte_match) + abs(x.offset) for x in magic_footer_array])
return max_header_length, max_footer_length


def _confidence(matches, ext=None):
""" Rough confidence based on string length and file extension"""
results = []
for match in matches:
con = (0.8 if len(match.extension) > 9 else
float("0.{0}".format(len(match.extension))))
con = 0.8 if len(match.extension) > 9 else float("0.{0}".format(len(match.extension)))
if ext == match.extension:
con = 0.9
results.append(
PureMagicWithConfidence(confidence=con, **match._asdict()))
results.append(PureMagicWithConfidence(confidence=con, **match._asdict()))
return sorted(results, key=lambda x: x.confidence, reverse=True)


Expand Down Expand Up @@ -111,8 +117,7 @@ def _magic(header, footer, mime, ext=None):
info = _identify_all(header, footer, ext)[0]
if mime:
return info.mime_type
return info.extension if not \
isinstance(info.extension, list) else info[0].extension
return info.extension if not isinstance(info.extension, list) else info[0].extension


def _file_details(filename):
Expand All @@ -134,6 +139,15 @@ def _string_details(string):
return string[:max_head], string[-max_foot:]


def _stream_details(stream):
""" Grab the start and end of the stream"""
max_head, max_foot = _max_lengths()
head = stream.read(max_head)
stream.seek(-max_foot, os.SEEK_END)
foot = stream.read()
return head, foot


def ext_from_filename(filename):
""" Scan a filename for it's extension.
Expand All @@ -143,10 +157,9 @@ def ext_from_filename(filename):
try:
base, ext = filename.lower().rsplit(".", 1)
except ValueError:
return ''
return ""
ext = ".{0}".format(ext)
all_exts = [x.extension for x in chain(magic_header_array,
magic_footer_array)]
all_exts = [x.extension for x in chain(magic_header_array, magic_footer_array)]

if base[-4:].startswith("."):
# For double extensions like like .tar.gz
Expand Down Expand Up @@ -186,6 +199,22 @@ def from_string(string, mime=False, filename=None):
return _magic(head, foot, mime, ext)


def from_stream(stream, mime=False, filename=None):
""" Reads in stream, attempts to identify content based
off magic number and will return the file extension.
If mime is True it will return the mime type instead.
If filename is provided it will be used in the computation.
:param stream: stream representation to check
:param mime: Return mime, not extension
:param filename: original filename
:return: guessed extension or mime
"""
head, foot = _stream_details(stream)
ext = ext_from_filename(filename) if filename else None
return _magic(head, foot, mime, ext)


def magic_file(filename):
""" Returns tuple of (num_of_matches, array_of_matches)
arranged highest confidence match first.
Expand Down Expand Up @@ -222,18 +251,38 @@ def magic_string(string, filename=None):
return info


def magic_stream(stream, filename=None):
""" Returns tuple of (num_of_matches, array_of_matches)
arranged highest confidence match first
If filename is provided it will be used in the computation.
:param stream: stream representation to check
:param filename: original filename
:return: list of possible matches, highest confidence first
"""
head, foot = _stream_details(stream)
if not head:
raise ValueError("Input was empty")
ext = ext_from_filename(filename) if filename else None
info = _identify_all(head, foot, ext)
info.sort(key=lambda x: x.confidence, reverse=True)
return info


def command_line_entry(*args):
from argparse import ArgumentParser
import sys
desc = "puremagic is a pure python file identification module. \
It looks for matching magic numbers in the file to locate the file type. "
parser = ArgumentParser(description=desc)
parser.add_argument("-m",
"--mime",
action="store_true",
dest="mime",
help="Return the mime type instead of file type")
parser.add_argument('files', nargs="+")

parser = ArgumentParser(
description=(
"puremagic is a pure python file identification module."
"It looks for matching magic numbers in the file to locate the file type. "
)
)
parser.add_argument(
"-m", "--mime", action="store_true", dest="mime", help="Return the mime type instead of file type"
)
parser.add_argument("files", nargs="+")
args = parser.parse_args(args if args else sys.argv[1:])

for fn in args.files:
Expand All @@ -246,5 +295,5 @@ def command_line_entry(*args):
print("'{0}' : could not be Identified".format(fn))


if __name__ == '__main__':
if __name__ == "__main__":
command_line_entry()
Loading

0 comments on commit b57020f

Please sign in to comment.