Skip to content

Commit

Permalink
Merge branch 'msg-support'
Browse files Browse the repository at this point in the history
  • Loading branch information
Dean Malmgren committed Jun 23, 2015
2 parents aead5ba + 6b894c0 commit 42b7f17
Show file tree
Hide file tree
Showing 9 changed files with 52 additions and 4 deletions.
4 changes: 4 additions & 0 deletions docs/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ latest changes in development for next release
* support for ``.rtf`` files (`#84`_)

* support for ``.msg`` files (`#87`_ and `#17`_ by `@anthonygarvan`_)


1.2.0
-----
Expand Down Expand Up @@ -198,6 +200,7 @@ latest changes in development for next release
.. _#8: https://github.com/deanmalmgren/textract/issues/8
.. _#9: https://github.com/deanmalmgren/textract/issues/9
.. _#13: https://github.com/deanmalmgren/textract/issues/13
.. _#17: https://github.com/deanmalmgren/textract/issues/17
.. _#21: https://github.com/deanmalmgren/textract/issues/21
.. _#25: https://github.com/deanmalmgren/textract/issues/25
.. _#26: https://github.com/deanmalmgren/textract/issues/26
Expand Down Expand Up @@ -227,3 +230,4 @@ latest changes in development for next release
.. _#79: https://github.com/deanmalmgren/textract/issues/79
.. _#82: https://github.com/deanmalmgren/textract/issues/82
.. _#84: https://github.com/deanmalmgren/textract/issues/84
.. _#87: https://github.com/deanmalmgren/textract/issues/87
3 changes: 3 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ file types by either mentioning them on the `issue tracker

* ``.mp3`` via `SpeechRecognition`_ and `sox`_

* ``.msg`` via `msg-extractor`_

* ``.odt`` via python builtins

* ``.ogg`` via `SpeechRecognition`_ and `sox`_
Expand Down Expand Up @@ -90,6 +92,7 @@ file types by either mentioning them on the `issue tracker
.. _antiword: http://www.winfield.demon.nl/
.. _beautifulsoup4: http://beautiful-soup-4.readthedocs.org/en/latest/
.. _ebooklib: https://github.com/aerkalov/ebooklib
.. _msg-extractor: https://github.com/mattgwwalker/msg-extractor
.. _pdfminer: https://euske.github.io/pdfminer/
.. _pdftotext: http://poppler.freedesktop.org/
.. _ps2text: http://pages.cs.wisc.edu/~ghost/doc/pstotext.htm
Expand Down
1 change: 1 addition & 0 deletions requirements/python
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ beautifulsoup4
xlrd
EbookLib
SpeechRecognition>=1.1.0
https://github.com/deanmalmgren/msg-extractor/zipball/pip-installable
14 changes: 10 additions & 4 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,18 @@
github_url='https://github.com/deanmalmgren/textract'

# read in the dependencies from the virtualenv requirements file
dependencies = []
dependencies, dependency_links = [], []
filename = os.path.join("requirements", "python")
with open(filename, 'r') as stream:
for line in stream:
package = line.strip().split('#')[0]
if package:
dependencies.append(package)
line = line.strip()
if line.startswith("http"):
dependency_links.append(line)
else:
package = line.split('#')[0]
if package:
dependencies.append(package)


setup(
name=textract.__name__,
Expand All @@ -38,5 +43,6 @@
'textract.parsers',
],
install_requires=dependencies,
dependency_links=dependency_links,
zip_safe=False,
)
Binary file added tests/msg/raw_text.msg
Binary file not shown.
15 changes: 15 additions & 0 deletions tests/msg/raw_text.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Test for TIF files

This is a test email to experiment with the MS Outlook MSG Extractor


--


Kind regards




Brian Zhou

Binary file added tests/msg/standardized_text.msg
Binary file not shown.
7 changes: 7 additions & 0 deletions tests/test_msg.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
import unittest

import base


class MsgTestCase(base.BaseParserTestCase, unittest.TestCase):
extension = 'msg'
12 changes: 12 additions & 0 deletions textract/parsers/msg_parser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from ExtractMsg import Message

from .utils import BaseParser


class Parser(BaseParser):
"""Extract text from Microsoft Outlook files (.msg)
"""

def extract(self, filename, **kwargs):
m = Message(filename)
return m.subject + '\n\n' + m.body

0 comments on commit 42b7f17

Please sign in to comment.